Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add hardware topology module #105

Merged
merged 40 commits into from
Apr 23, 2024
Merged

add hardware topology module #105

merged 40 commits into from
Apr 23, 2024

Conversation

bg-furiosa
Copy link
Contributor

@bg-furiosa bg-furiosa commented Mar 28, 2024

PR Description

  • add a hardware topology module to the device API that can detect the server's hardware topology and measure the distance between devices.
  • this is back porting of go implementation[1][2].

What type of PR is this?

  • /kind feature

Special notes for reviewer

./show_topology
Authorization required, but no authorization protocol specified
+--------+----------------------+----------------------+----------------------+----------------------+
| Device | npu0                 | npu1                 | npu2                 | npu3                 |
+--------+----------------------+----------------------+----------------------+----------------------+
| npu0   | LinkTypeSoc          | LinkTypeCPU          | LinkTypeInterconnect | LinkTypeInterconnect |
+--------+----------------------+----------------------+----------------------+----------------------+
| npu1   | LinkTypeCPU          | LinkTypeSoc          | LinkTypeInterconnect | LinkTypeInterconnect |
+--------+----------------------+----------------------+----------------------+----------------------+
| npu2   | LinkTypeInterconnect | LinkTypeInterconnect | LinkTypeSoc          | LinkTypeCPU          |
+--------+----------------------+----------------------+----------------------+----------------------+
| npu3   | LinkTypeInterconnect | LinkTypeInterconnect | LinkTypeCPU          | LinkTypeSoc          |
+--------+----------------------+----------------------+----------------------+----------------------+

Copy link
Collaborator

@libc-furiosa libc-furiosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before reviewing at the code level, I want to comment on the design.

First, we need to simplify the implementation of Topology.
Right now, the struct is behaving mutable with state, but it can actually have deterministic values at initialization time.
This makes the interface more compact and clear.

Second, if you need to add a dependency on hwloc, it would be better to build it from source code and include it in the build output via a static link rather than relying on some arbitrary OS package manager.
Meanwhile, there are hwloc implementations on crates.io. Please comment if you think we should use them.
(e.g., https://github.com/HadrienG2/hwlocality)

Signed-off-by: bg-furiosa <[email protected]>
Signed-off-by: bg-furiosa <[email protected]>
Signed-off-by: bg-furiosa <[email protected]>
@bg-furiosa bg-furiosa requested a review from libc-furiosa April 8, 2024 10:37
Copy link
Collaborator

@libc-furiosa libc-furiosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started by looking at the big flows. I'm not completely done with the review, but I'll continue with the review after seeing the changes for comments.

device-api/Cargo.toml Outdated Show resolved Hide resolved
device-api/Cargo.toml Outdated Show resolved Hide resolved
device-api/Cargo.toml Outdated Show resolved Hide resolved
device-api/src/topology/mod.rs Outdated Show resolved Hide resolved
device-api/src/device.rs Outdated Show resolved Hide resolved
device-api/src/topology/mod.rs Outdated Show resolved Hide resolved
device-api/src/topology/mod.rs Outdated Show resolved Hide resolved
device-api/src/topology/mod.rs Outdated Show resolved Hide resolved
device-api/src/topology/mod.rs Outdated Show resolved Hide resolved
device-api/src/topology/hwloc.rs Outdated Show resolved Hide resolved
Signed-off-by: bg-furiosa <[email protected]>
Signed-off-by: bg-furiosa <[email protected]>
Signed-off-by: bg-furiosa <[email protected]>
Signed-off-by: bg-furiosa <[email protected]>
Signed-off-by: bg-furiosa <[email protected]>
Signed-off-by: bg-furiosa <[email protected]>
@bg-furiosa bg-furiosa requested a review from libc-furiosa April 12, 2024 12:12
Copy link
Collaborator

@libc-furiosa libc-furiosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments. Let's just resolve those and be done with it.

@@ -8,6 +8,8 @@ license = { workspace = true }
homepage = { workspace = true }
repository = { workspace = true }
readme = { workspace = true }
build = "build.rs"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be omitted.

use std::collections::BTreeMap;

use itertools::iproduct;
use libc::link;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this dependency? If it is not used, you can remove this line and remove the libc dependency as well.

@@ -0,0 +1,36 @@
use cli_table::{print_stdout, Cell, Style, Table};
use furiosa_device::{list_devices, topology, DeviceError};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
use furiosa_device::{list_devices, topology, DeviceError};
use furiosa_device::{list_devices, topology::Topology, DeviceError};

return Ok(());
}

let topology = topology::Topology::new(devices.clone())?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let topology = topology::Topology::new(devices.clone())?;
let topology = Topology::new(devices.clone())?;

Comment on lines 15 to 16
let mut rows = vec![];
let mut header = vec!["Device".cell().bold(true)];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let mut rows = vec![];
let mut header = vec!["Device".cell().bold(true)];
let mut rows = Vec::with_capacity(devices.len() + 1);
let mut header = Vec::with_capacity(devices.len() + 1);
header.push("Device".cell().bold(true));

Comment on lines 84 to 90
let mut link_type: LinkType = LinkTypeUnknown;

if dev1_bdf == dev2_bdf {
link_type = LinkTypeSoc
} else {
link_type = topology_provider.get_common_ancestor_obj(&dev1_bdf, &dev2_bdf)?;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let mut link_type: LinkType = LinkTypeUnknown;
if dev1_bdf == dev2_bdf {
link_type = LinkTypeSoc
} else {
link_type = topology_provider.get_common_ancestor_obj(&dev1_bdf, &dev2_bdf)?;
}
let link_type = if dev1_bdf == dev2_bdf {
LinkTypeSoc
} else {
topology_provider.get_common_ancestor_obj(&dev1_bdf, &dev2_bdf)?
}

(dev1_bdf, dev2_bdf)
};

topology_matrix.insert(key, link_type);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
topology_matrix.insert(key, link_type);
topology_matrix.entry(key).or_insert(link_type);

}
}

pub unsafe fn hwloc_get_common_ancestor_obj(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why the access specifier for this function is pub? Are there other places it's used?

static ref BDF_NOTATION_PATTERN: Regex = Regex::new(r"^(?:(?P<domain>[0-9a-fA-F]+):)?(?P<bus>[0-9a-fA-F]+):(?P<dev>[0-9a-fA-F]+)\.(?P<func>[0-9a-fA-F]+)").unwrap();
}

pub unsafe fn hwloc_get_pcidev_by_busidstring(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why the access specifier for this function is pub? (2)

thiserror = "1"
tokio = { workspace = true }
tracing = "0.1"
tracing-subscriber = { version = "0.3.1", features = ["env-filter", "json"] }
libc = "0.2.153"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make sure this is the dependency we actually need.

Copy link
Collaborator

@libc-furiosa libc-furiosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@libc-furiosa libc-furiosa merged commit 1e45fa8 into furiosa-ai:main Apr 23, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants