You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The RPC implementation directly executes nvidia-smi command without proper input validation or sanitization, which could lead to command injection vulnerabilities if the output parsing logic is not robust enough.
let gpu_usage = ifletOk(output) = std::process::Command::new("nvidia-smi").args(vec!["--query-gpu=memory.total,memory.used","--format=csv,noheader,nounits",]).output(){let output = String::from_utf8_lossy(&output.stdout);let usage = output
.lines().map(|l| l.split_once(",").unwrap()).map(|(total, used)| {(ByteSize::mib(total.trim().parse::<u64>().unwrap()).as_u64(),ByteSize::mib(used.trim().parse::<u64>().unwrap()).as_u64(),)}).collect();Some(usage)}else{println!("Failed to execute nvidia-smi, maybe it's not installed");None};
The resource usage component relies on external commands like nvidia-smi and system calls that may not be available or behave consistently across different environments, potentially causing panics or incorrect data display.
The error handling in cscs_resource_usage function could be improved to better propagate and handle network or communication errors when connecting to the RPC endpoint.
The nvidia-smi command execution could fail silently, leading to missed GPU usage data. Add error handling to log failures and return default values when nvidia-smi is not available or returns invalid data.
async fn resource_usage(self, _context: ::tarpc::context::Context) -> ResourceUsage {
let mut sys = System::new_all();
sys.refresh_all();
tokio::time::sleep(sysinfo::MINIMUM_CPU_UPDATE_INTERVAL).await;
sys.refresh_processes_specifics(ProcessesToUpdate::All, true, ProcessRefreshKind::nothing().with_cpu());
let Ok(pid) = get_current_pid() else {
return ResourceUsage::default();
};
let Some(process) = sys.process(pid) else {
return ResourceUsage::default();
};
- let gpu_usage = if let Ok(output) = std::process::Command::new("nvidia-smi")+ let gpu_usage = match std::process::Command::new("nvidia-smi")
.args(vec![
"--query-gpu=memory.total,memory.used",
"--format=csv,noheader,nounits",
])
.output()
{
- let output = String::from_utf8_lossy(&output.stdout);- let usage = output- .lines()- .map(|l| l.split_once(",").unwrap())- .map(|(total, used)| {- (- ByteSize::mib(total.trim().parse::<u64>().unwrap()).as_u64(),- ByteSize::mib(used.trim().parse::<u64>().unwrap()).as_u64(),- )- })- .collect();- Some(usage)- } else {- println!("Failed to execute nvidia-smi, maybe it's not installed");- None+ Ok(output) => {+ if !output.status.success() {+ eprintln!("nvidia-smi command failed: {}", String::from_utf8_lossy(&output.stderr));+ None+ } else {+ let output = String::from_utf8_lossy(&output.stdout);+ let usage = output+ .lines()+ .map(|l| l.split_once(",").unwrap())+ .map(|(total, used)| {+ (+ ByteSize::mib(total.trim().parse::<u64>().unwrap()).as_u64(),+ ByteSize::mib(used.trim().parse::<u64>().unwrap()).as_u64(),+ )+ })+ .collect();+ Some(usage)+ }+ }+ Err(e) => {+ eprintln!("Failed to execute nvidia-smi: {e}");+ None+ }
};
ResourceUsage {
cpu: process.cpu_usage() / sys.cpus().len() as f32,
rss: process.memory(),
vsz: process.virtual_memory(),
gpu: gpu_usage,
}
}
Suggestion importance[1-10]: 7
__
Why: The suggestion addresses a potential issue where nvidia-smi command failures could go unnoticed, leading to missed GPU data. The improved code adds proper error logging and handling for command execution failures and invalid outputs, enhancing robustness.
Medium
General
Add validation for resource usage data updates
The component does not handle potential errors during data updates. If the resource usage data is malformed or unavailable, the UI may become inconsistent. Add validation and error handling to ensure robustness.
impl Component<Msg, UserEvent> for ResourceUsage {
fn on(&mut self, ev: Event<UserEvent>) -> Option<Msg> {
let _ = match ev {
Event::Keyboard(KeyEvent { code: Key::Left, .. }) => self.perform(Cmd::Move(Direction::Left)),
Event::Keyboard(KeyEvent { code: Key::Right, .. }) => self.perform(Cmd::Move(Direction::Right)),
Event::Keyboard(KeyEvent { code: Key::Home, .. }) => self.perform(Cmd::GoTo(Position::Begin)),
Event::Keyboard(KeyEvent { code: Key::End, .. }) => self.perform(Cmd::GoTo(Position::End)),
Event::Keyboard(KeyEvent { code: Key::Esc, .. }) => {
return Some(Msg::Job(JobMsg::Close));
}
Event::User(UserEvent::Cscs(CscsEvent::GotJobResourceUsage(ru))) => {
+ // Validate resource usage data before updating UI+ if ru.cpu.is_nan() || ru.cpu < 0.0 {+ eprintln!("Invalid CPU usage data received: {:?}", ru.cpu);+ return Some(Msg::None);+ }
self.attr(
Attribute::Custom(UPDATE_CPU_DATA),
AttrValue::Payload(PropPayload::One(PropValue::F64(ru.cpu as f64))),
);
+ if ru.rss == 0 && ru.vsz == 0 {+ eprintln!("Zero memory usage data received");+ return Some(Msg::None);+ }
self.attr(
Attribute::Custom(UPDATE_MEMORY_DATA),
AttrValue::Payload(PropPayload::Tup2((PropValue::U64(ru.rss), PropValue::U64(ru.vsz)))),
);
if let Some(gpu) = ru.gpu {
+ if gpu.is_empty() {+ eprintln!("Empty GPU usage data received");+ return Some(Msg::None);+ }
self.attr(
Attribute::Custom(UPDATE_GPU_DATA),
AttrValue::Payload(PropPayload::Vec(gpu.iter().map(|g| PropValue::U64(g.1)).collect())),
);
}
CmdResult::None
}
_ => CmdResult::None,
};
Some(Msg::None)
}
}
Suggestion importance[1-10]: 6
__
Why: The suggestion identifies a potential issue with unvalidated data updates that could lead to UI inconsistencies. The improved code adds checks for invalid CPU usage, zero memory usage, and empty GPU data, improving the robustness of the component.
Low
Optimize endpoint creation for RPC calls
The function creates a new endpoint for each RPC call which is inefficient. Reuse existing endpoints or implement connection pooling to reduce overhead and improve performance.
pub async fn cscs_resource_usage(job_id: i64, system: Option<String>) -> Result<ResourceUsage> {
let endpoint_id = get_endpoint_id(job_id, system).await?;
let alpn: Vec<u8> = b"/coman/rpc".to_vec();
+ // Consider reusing a single endpoint instance instead of creating new ones
let secret_key = SecretKey::generate(&mut rand::rng());
let endpoint = Endpoint::builder().secret_key(secret_key).bind().await?;
match endpoint.connect(endpoint_id, &alpn).await {
Ok(connection) => {
let (iroh_send, iroh_recv) = connection.open_bi().await?;
let combined = Duplex::new(iroh_recv, iroh_send);
let codec_builder = LengthDelimitedCodec::builder();
let framed = codec_builder.new_framed(combined);
let transport = serde_transport::new(framed, Bincode::default());
let client = ComanRPCClient::new(client::Config::default(), transport);
client
.spawn()
.resource_usage(context::current())
.await
.wrap_err("couldn't get resource usage from remote")
}
Err(e) => Err(e).wrap_err("couldn't establish tunnel to remote"),
}
}
Suggestion importance[1-10]: 3
__
Why: The suggestion points out a potential performance concern with creating new endpoints for each RPC call. However, the implementation is actually correct as-is since the endpoint creation is necessary for establishing secure connections. The suggestion doesn't propose a concrete optimization and is more of a general observation rather than a critical improvement.
The code assumes there is always at least one GPU and that the first GPU index is valid. This can cause panics if no GPUs are present or if the device index is invalid. Add error handling for NVML operations.
-let nvml = Nvml::init().unwrap();-let device = nvml.device_by_index(0).unwrap();-let memory_info = device.memory_info().unwrap();+let mut gpu_memory_used = 0u64;+if let Ok(nvml) = Nvml::init() {+ if let Ok(device) = nvml.device_by_index(0) {+ if let Ok(memory_info) = device.memory_info() {+ gpu_memory_used = memory_info.used;+ }+ }+}
ResourceUsage {
cpu: cpu_usage,
rss: sys.used_memory(),
- gpu: memory_info.used,+ gpu: gpu_memory_used,
}
Suggestion importance[1-10]: 8
__
Why: The original code assumes NVML operations will always succeed, which can lead to panics. The improved code adds proper error handling using if let Ok() patterns to safely handle potential failures in NVML initialization and device access.
Medium
Add error handling for spawned RPC tasks
The spawn function does not handle errors from spawned tasks, which could lead to silent failures. Add error logging to ensure any task failures are reported.
Why: The suggestion correctly identifies that unhandled errors in spawned tasks can lead to silent failures. The improved code adds error logging to report task failures, which enhances the robustness of the RPC server implementation.
Low
General
Improve secret key reuse for RPC connections
The function creates a new secret key for each RPC call which is inefficient and potentially insecure. Reuse a single secret key or generate it once per session instead.
+pub async fn cscs_resource_usage(job_id: i64, system: Option<String>) -> Result<ResourceUsage> {+ let endpoint_id = get_endpoint_id(job_id, system).await?;+ let alpn: Vec<u8> = b"/coman/rpc".to_vec();+ let secret_key = SecretKey::generate(&mut rand::rng());+ let endpoint = Endpoint::builder().secret_key(secret_key).bind().await?;+ match endpoint.connect(endpoint_id, &alpn).await {+ Ok(connection) => {+ let (iroh_send, iroh_recv) = connection.open_bi().await?;+ let combined = Duplex::new(iroh_recv, iroh_send);+ let codec_builder = LengthDelimitedCodec::builder();+ let framed = codec_builder.new_framed(combined);+ let transport = serde_transport::new(framed, Bincode::default());+ let client = ComanRPCClient::new(client::Config::default(), transport);+ client+ .spawn()+ .resource_usage(context::current())+ .await+ .wrap_err("couldn't get resource usage from remote")+ }+ Err(e) => Err(e).wrap_err("couldn't establish tunnel to remote"),+ }+}+
Suggestion importance[1-10]: 3
__
Why: While the suggestion identifies a potential inefficiency in generating a new secret key for each RPC call, the improved code provided is identical to the existing code. Therefore, the suggestion doesn't actually propose a meaningful change and only points out a general concern without providing a concrete solution.
The code assumes NVML initialization and GPU device access will always succeed. Add proper error handling for these operations to prevent panics if GPU is unavailable or inaccessible.
-let nvml = Nvml::init().unwrap();-let device = nvml.device_by_index(0).unwrap();-let memory_info = device.memory_info().unwrap();+let mut gpu_memory_used = 0u64;+if let Ok(nvml) = Nvml::init() {+ if let Ok(device) = nvml.device_by_index(0) {+ if let Ok(memory_info) = device.memory_info() {+ gpu_memory_used = memory_info.used;+ }+ }+}
ResourceUsage {
cpu: cpu_usage,
rss: sys.used_memory(),
- gpu: memory_info.used,+ gpu: gpu_memory_used,
}
Suggestion importance[1-10]: 8
__
Why: The original code uses unwrap() on potentially failing GPU operations, which can cause panics. The improved code adds proper error handling using if let Ok() patterns to safely handle cases where GPU access might fail, improving robustness.
Medium
Improve error handling in RPC communication
The RPC client creation and call may fail silently due to missing error propagation. Ensure that all potential failure points in the RPC communication chain are properly handled and reported.
pub async fn cscs_resource_usage(job_id: i64, system: Option<String>) -> Result<ResourceUsage> {
let endpoint_id = get_endpoint_id(job_id, system).await?;
let alpn: Vec<u8> = b"/coman/rpc".to_vec();
let secret_key = SecretKey::generate(&mut rand::rng());
let endpoint = Endpoint::builder().secret_key(secret_key).bind().await?;
match endpoint.connect(endpoint_id, &alpn).await {
Ok(connection) => {
let (iroh_send, iroh_recv) = connection.open_bi().await?;
let combined = Duplex::new(iroh_recv, iroh_send);
let codec_builder = LengthDelimitedCodec::builder();
let framed = codec_builder.new_framed(combined);
let transport = serde_transport::new(framed, Bincode::default());
let client = ComanRPCClient::new(client::Config::default(), transport);
- client+ let result = client
.spawn()
.resource_usage(context::current())
.await
- .wrap_err("couldn't get resource usage from remote")+ .map_err(|e| eyre::eyre!("RPC call failed: {}", e))+ .wrap_err("couldn't get resource usage from remote")?;+ Ok(result)
}
- Err(e) => Err(e).wrap_err("couldn't establish tunnel to remote"),+ Err(e) => Err(eyre::eyre!("Connection failed: {}", e)).wrap_err("couldn't establish tunnel to remote"),
}
}
Suggestion importance[1-10]: 7
__
Why: The suggestion addresses potential silent failures in RPC communication by adding explicit error mapping and wrapping. While this improves error reporting, it doesn't address fundamental architectural concerns about error propagation in the broader system.
Medium
General
Add panic recovery to RPC server tasks
The spawn function doesn't handle potential panics from the spawned future. This could lead to unhandled exceptions in the RPC server which would crash the application.
Why: The suggestion enhances fault tolerance by adding panic recovery to spawned tasks. However, this is a defensive measure rather than a critical fix, and the improvement is moderate in terms of overall system stability.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR Type
Enhancement
Description
Add RPC support for system metrics
Introduce resource usage viewing in TUI
Enable CLI command for resource usage
Extend job status handling
Diagram Walkthrough
File Walkthrough
18 files
Add new ID for resource usage viewAdd resource usage messageIntegrate resource usage component and actionsAdd resource usage eventAdd CLI resource usage commandRegister RPC handler for port forwardingImplement RPC service for resource usageAdd resource usage componentCreate resource usage visualization componentAdd keyboard shortcut for resource usageUpdate path listing with hidden files optionAdd requeued job statusAdd CLI resource usage command implementationImplement resource usage fetching and connection handlingAdd resource usage polling portIntegrate resource usage components into main applicationImprove Docker image parsing and error handlingAdd show hidden flag to filesystem listing2 files
Create RPC moduleEnable DCGM hook2 files