add rpc support for getting system metrics by Panaetius · Pull Request #59 · SwissDataScienceCenter/coman

Panaetius · 2026-01-26T15:31:30Z

PR Type

Enhancement

Description

Add RPC support for system metrics
Introduce resource usage viewing in TUI
Enable CLI command for resource usage
Extend job status handling

Diagram Walkthrough

flowchart LR
  A["CLI Command: Resource Usage"] -- "triggers" --> B["RPC Handler"]
  B -- "fetches" --> C["System Metrics"]
  C -- "sends" --> D["TUI Component"]
  D -- "updates" --> E["Resource Usage View"]

File Walkthrough

Relevant files

Enhancement

18 files

ids.rs `Add new ID for resource usage view`	+1/-0
messages.rs `Add resource usage message`	+1/-0
model.rs `Integrate resource usage component and actions`	+59/-3
user_events.rs `Add resource usage event`	+2/-0
app.rs `Add CLI resource usage command`	+9/-1
exec.rs `Register RPC handler for port forwarding`	+6/-0
rpc.rs `Implement RPC service for resource usage`	+123/-0
mod.rs `Add resource usage component`	+1/-0
resource_usage.rs `Create resource usage visualization component`	+542/-0
workload_list.rs `Add keyboard shortcut for resource usage`	+18/-1
client.rs `Update path listing with hidden files option`	+2/-2
types.rs `Add requeued job status`	+2/-0
cli.rs `Add CLI resource usage command implementation`	+36/-3
handlers.rs `Implement resource usage fetching and connection handling`	+94/-38
ports.rs `Add resource usage polling port`	+52/-3
main.rs `Integrate resource usage components into main application`	+14/-4
types.rs `Improve Docker image parsing and error handling`	+13/-7
filesystem_api.rs `Add show hidden flag to filesystem listing`	+5/-1

Configuration changes

2 files

mod.rs `Create RPC module`	+1/-0
config.toml `Enable DCGM hook`	+1/-0

Additional files

2 files

pr-review.yaml	+1/-0
Cargo.toml	+9/-1

github-actions · 2026-01-26T15:50:28Z

PR Reviewer Guide 🔍

(Review updated until commit `3577357`)

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review Potential Security Issue The RPC implementation directly executes `nvidia-smi` command without proper input validation or sanitization, which could lead to command injection vulnerabilities if the output parsing logic is not robust enough. let gpu_usage = if let Ok(output) = std::process::Command::new("nvidia-smi") .args(vec![ "--query-gpu=memory.total,memory.used", "--format=csv,noheader,nounits", ]) .output() { let output = String::from_utf8_lossy(&output.stdout); let usage = output .lines() .map(\|l\| l.split_once(",").unwrap()) .map(\|(total, used)\| { ( ByteSize::mib(total.trim().parse::<u64>().unwrap()).as_u64(), ByteSize::mib(used.trim().parse::<u64>().unwrap()).as_u64(), ) }) .collect(); Some(usage) } else { println!("Failed to execute nvidia-smi, maybe it's not installed"); None }; Data Handling Concerns The resource usage component relies on external commands like `nvidia-smi` and system calls that may not be available or behave consistently across different environments, potentially causing panics or incorrect data display. Event::Keyboard(KeyEvent { code: Key::Home, .. }) => self.perform(Cmd::GoTo(Position::Begin)), Event::Keyboard(KeyEvent { code: Key::End, .. }) => self.perform(Cmd::GoTo(Position::End)), Event::Keyboard(KeyEvent { code: Key::Esc, .. }) => { return Some(Msg::Job(JobMsg::Close)); } Event::User(UserEvent::Cscs(CscsEvent::GotJobResourceUsage(ru))) => { self.attr( Attribute::Custom(UPDATE_CPU_DATA), AttrValue::Payload(PropPayload::One(PropValue::F64(ru.cpu as f64))), ); self.attr( Attribute::Custom(UPDATE_MEMORY_DATA), AttrValue::Payload(PropPayload::Tup2((PropValue::U64(ru.rss), PropValue::U64(ru.vsz)))), ); if let Some(gpu) = ru.gpu { self.attr( Attribute::Custom(UPDATE_GPU_DATA), AttrValue::Payload(PropPayload::Vec(gpu.iter().map(\|g\| PropValue::U64(g.1)).collect())), ); } CmdResult::None } _ => CmdResult::None, }; Some(Msg::None) } } // START CPU struct CpuUsage { component: Chart, Error Handling The error handling in `cscs_resource_usage` function could be improved to better propagate and handle network or communication errors when connecting to the RPC endpoint. pub async fn cscs_resource_usage(job_id: i64, system: Option<String>) -> Result<ResourceUsage> { let endpoint_id = get_endpoint_id(job_id, system).await?; let alpn: Vec<u8> = b"/coman/rpc".to_vec(); let secret_key = SecretKey::generate(&mut rand::rng()); let endpoint = Endpoint::builder().secret_key(secret_key).bind().await?; match endpoint.connect(endpoint_id, &alpn).await { Ok(connection) => { let (iroh_send, iroh_recv) = connection.open_bi().await?; let combined = Duplex::new(iroh_recv, iroh_send); let codec_builder = LengthDelimitedCodec::builder(); let framed = codec_builder.new_framed(combined); let transport = serde_transport::new(framed, Bincode::default()); let client = ComanRPCClient::new(client::Config::default(), transport); client .spawn() .resource_usage(context::current()) .await .wrap_err("couldn't get resource usage from remote") } Err(e) => Err(e).wrap_err("couldn't establish tunnel to remote"), } }

github-actions · 2026-01-26T15:50:31Z

Persistent review updated to latest commit 3369f5a

github-actions · 2026-01-26T15:50:47Z

PR Code Suggestions ✨

Latest suggestions up to 3577357

Explore these optional code suggestions:

Category	Suggestion	Impact
Possible issue	Improve error handling for GPU usage collection The nvidia-smi command execution could fail silently, leading to missed GPU usage data. Add error handling to log failures and return default values when nvidia-smi is not available or returns invalid data. coman/src/cli/rpc.rs [51-93] async fn resource_usage(self, _context: ::tarpc::context::Context) -> ResourceUsage { let mut sys = System::new_all(); sys.refresh_all(); tokio::time::sleep(sysinfo::MINIMUM_CPU_UPDATE_INTERVAL).await; sys.refresh_processes_specifics(ProcessesToUpdate::All, true, ProcessRefreshKind::nothing().with_cpu()); let Ok(pid) = get_current_pid() else { return ResourceUsage::default(); }; let Some(process) = sys.process(pid) else { return ResourceUsage::default(); }; - let gpu_usage = if let Ok(output) = std::process::Command::new("nvidia-smi") + let gpu_usage = match std::process::Command::new("nvidia-smi") .args(vec![ "--query-gpu=memory.total,memory.used", "--format=csv,noheader,nounits", ]) .output() { - let output = String::from_utf8_lossy(&output.stdout); - let usage = output - .lines() - .map(\|l\| l.split_once(",").unwrap()) - .map(\|(total, used)\| { - ( - ByteSize::mib(total.trim().parse::<u64>().unwrap()).as_u64(), - ByteSize::mib(used.trim().parse::<u64>().unwrap()).as_u64(), - ) - }) - .collect(); - Some(usage) - } else { - println!("Failed to execute nvidia-smi, maybe it's not installed"); - None + Ok(output) => { + if !output.status.success() { + eprintln!("nvidia-smi command failed: {}", String::from_utf8_lossy(&output.stderr)); + None + } else { + let output = String::from_utf8_lossy(&output.stdout); + let usage = output + .lines() + .map(\|l\| l.split_once(",").unwrap()) + .map(\|(total, used)\| { + ( + ByteSize::mib(total.trim().parse::<u64>().unwrap()).as_u64(), + ByteSize::mib(used.trim().parse::<u64>().unwrap()).as_u64(), + ) + }) + .collect(); + Some(usage) + } + } + Err(e) => { + eprintln!("Failed to execute nvidia-smi: {e}"); + None + } }; ResourceUsage { cpu: process.cpu_usage() / sys.cpus().len() as f32, rss: process.memory(), vsz: process.virtual_memory(), gpu: gpu_usage, } } Suggestion importance[1-10]: 7 __ Why: The suggestion addresses a potential issue where `nvidia-smi` command failures could go unnoticed, leading to missed GPU data. The improved code adds proper error logging and handling for command execution failures and invalid outputs, enhancing robustness.	Medium
General	Add validation for resource usage data updates The component does not handle potential errors during data updates. If the resource usage data is malformed or unavailable, the UI may become inconsistent. Add validation and error handling to ensure robustness. coman/src/components/resource_usage.rs [57-88] impl Component<Msg, UserEvent> for ResourceUsage { fn on(&mut self, ev: Event<UserEvent>) -> Option<Msg> { let _ = match ev { Event::Keyboard(KeyEvent { code: Key::Left, .. }) => self.perform(Cmd::Move(Direction::Left)), Event::Keyboard(KeyEvent { code: Key::Right, .. }) => self.perform(Cmd::Move(Direction::Right)), Event::Keyboard(KeyEvent { code: Key::Home, .. }) => self.perform(Cmd::GoTo(Position::Begin)), Event::Keyboard(KeyEvent { code: Key::End, .. }) => self.perform(Cmd::GoTo(Position::End)), Event::Keyboard(KeyEvent { code: Key::Esc, .. }) => { return Some(Msg::Job(JobMsg::Close)); } Event::User(UserEvent::Cscs(CscsEvent::GotJobResourceUsage(ru))) => { + // Validate resource usage data before updating UI + if ru.cpu.is_nan() \|\| ru.cpu < 0.0 { + eprintln!("Invalid CPU usage data received: {:?}", ru.cpu); + return Some(Msg::None); + } self.attr( Attribute::Custom(UPDATE_CPU_DATA), AttrValue::Payload(PropPayload::One(PropValue::F64(ru.cpu as f64))), ); + if ru.rss == 0 && ru.vsz == 0 { + eprintln!("Zero memory usage data received"); + return Some(Msg::None); + } self.attr( Attribute::Custom(UPDATE_MEMORY_DATA), AttrValue::Payload(PropPayload::Tup2((PropValue::U64(ru.rss), PropValue::U64(ru.vsz)))), ); if let Some(gpu) = ru.gpu { + if gpu.is_empty() { + eprintln!("Empty GPU usage data received"); + return Some(Msg::None); + } self.attr( Attribute::Custom(UPDATE_GPU_DATA), AttrValue::Payload(PropPayload::Vec(gpu.iter().map(\|g\| PropValue::U64(g.1)).collect())), ); } CmdResult::None } _ => CmdResult::None, }; Some(Msg::None) } } Suggestion importance[1-10]: 6 __ Why: The suggestion identifies a potential issue with unvalidated data updates that could lead to UI inconsistencies. The improved code adds checks for invalid CPU usage, zero memory usage, and empty GPU data, improving the robustness of the component.	Low
General	Optimize endpoint creation for RPC calls The function creates a new endpoint for each RPC call which is inefficient. Reuse existing endpoints or implement connection pooling to reduce overhead and improve performance. coman/src/cscs/handlers.rs [174-197] pub async fn cscs_resource_usage(job_id: i64, system: Option<String>) -> Result<ResourceUsage> { let endpoint_id = get_endpoint_id(job_id, system).await?; let alpn: Vec<u8> = b"/coman/rpc".to_vec(); + // Consider reusing a single endpoint instance instead of creating new ones let secret_key = SecretKey::generate(&mut rand::rng()); let endpoint = Endpoint::builder().secret_key(secret_key).bind().await?; match endpoint.connect(endpoint_id, &alpn).await { Ok(connection) => { let (iroh_send, iroh_recv) = connection.open_bi().await?; let combined = Duplex::new(iroh_recv, iroh_send); let codec_builder = LengthDelimitedCodec::builder(); let framed = codec_builder.new_framed(combined); let transport = serde_transport::new(framed, Bincode::default()); let client = ComanRPCClient::new(client::Config::default(), transport); client .spawn() .resource_usage(context::current()) .await .wrap_err("couldn't get resource usage from remote") } Err(e) => Err(e).wrap_err("couldn't establish tunnel to remote"), } } Suggestion importance[1-10]: 3 __ Why: The suggestion points out a potential performance concern with creating new endpoints for each RPC call. However, the implementation is actually correct as-is since the endpoint creation is necessary for establishing secure connections. The suggestion doesn't propose a concrete optimization and is more of a general observation rather than a critical improvement.	Low

Previous suggestions

Suggestions up to commit 3369f5a

Category	Suggestion	Impact
Possible issue	Handle potential NVML errors gracefully The code assumes there is always at least one GPU and that the first GPU index is valid. This can cause panics if no GPUs are present or if the device index is invalid. Add error handling for NVML operations. coman/src/cli/rpc.rs [42-49] -let nvml = Nvml::init().unwrap(); -let device = nvml.device_by_index(0).unwrap(); -let memory_info = device.memory_info().unwrap(); +let mut gpu_memory_used = 0u64; +if let Ok(nvml) = Nvml::init() { + if let Ok(device) = nvml.device_by_index(0) { + if let Ok(memory_info) = device.memory_info() { + gpu_memory_used = memory_info.used; + } + } +} ResourceUsage { cpu: cpu_usage, rss: sys.used_memory(), - gpu: memory_info.used, + gpu: gpu_memory_used, } Suggestion importance[1-10]: 8 __ Why: The original code assumes NVML operations will always succeed, which can lead to panics. The improved code adds proper error handling using `if let Ok()` patterns to safely handle potential failures in NVML initialization and device access.	Medium
Possible issue	Add error handling for spawned RPC tasks The spawn function does not handle errors from spawned tasks, which could lead to silent failures. Add error logging to ensure any task failures are reported. coman/src/cli/rpc.rs [79-81] async fn spawn(fut: impl Future<Output = ()> + Send + 'static) { - tokio::spawn(fut); + tokio::spawn(async move { + if let Err(e) = fut.await { + eprintln!("RPC server task failed: {e}"); + } + }); } Suggestion importance[1-10]: 6 __ Why: The suggestion correctly identifies that unhandled errors in spawned tasks can lead to silent failures. The improved code adds error logging to report task failures, which enhances the robustness of the RPC server implementation.	Low
General	Improve secret key reuse for RPC connections The function creates a new secret key for each RPC call which is inefficient and potentially insecure. Reuse a single secret key or generate it once per session instead. coman/src/cscs/handlers.rs [174-197] +pub async fn cscs_resource_usage(job_id: i64, system: Option<String>) -> Result<ResourceUsage> { + let endpoint_id = get_endpoint_id(job_id, system).await?; + let alpn: Vec<u8> = b"/coman/rpc".to_vec(); + let secret_key = SecretKey::generate(&mut rand::rng()); + let endpoint = Endpoint::builder().secret_key(secret_key).bind().await?; + match endpoint.connect(endpoint_id, &alpn).await { + Ok(connection) => { + let (iroh_send, iroh_recv) = connection.open_bi().await?; + let combined = Duplex::new(iroh_recv, iroh_send); + let codec_builder = LengthDelimitedCodec::builder(); + let framed = codec_builder.new_framed(combined); + let transport = serde_transport::new(framed, Bincode::default()); + let client = ComanRPCClient::new(client::Config::default(), transport); + client + .spawn() + .resource_usage(context::current()) + .await + .wrap_err("couldn't get resource usage from remote") + } + Err(e) => Err(e).wrap_err("couldn't establish tunnel to remote"), + } +} + Suggestion importance[1-10]: 3 __ Why: While the suggestion identifies a potential inefficiency in generating a new secret key for each RPC call, the improved code provided is identical to the existing code. Therefore, the suggestion doesn't actually propose a meaningful change and only points out a general concern without providing a concrete solution.	Low

Suggestions up to commit 3369f5a

Category	Suggestion	Impact
Possible issue	Handle potential GPU access errors gracefully The code assumes NVML initialization and GPU device access will always succeed. Add proper error handling for these operations to prevent panics if GPU is unavailable or inaccessible. coman/src/cli/rpc.rs [34-49] -let nvml = Nvml::init().unwrap(); -let device = nvml.device_by_index(0).unwrap(); -let memory_info = device.memory_info().unwrap(); +let mut gpu_memory_used = 0u64; +if let Ok(nvml) = Nvml::init() { + if let Ok(device) = nvml.device_by_index(0) { + if let Ok(memory_info) = device.memory_info() { + gpu_memory_used = memory_info.used; + } + } +} ResourceUsage { cpu: cpu_usage, rss: sys.used_memory(), - gpu: memory_info.used, + gpu: gpu_memory_used, } Suggestion importance[1-10]: 8 __ Why: The original code uses `unwrap()` on potentially failing GPU operations, which can cause panics. The improved code adds proper error handling using `if let Ok()` patterns to safely handle cases where GPU access might fail, improving robustness.	Medium
Possible issue	Improve error handling in RPC communication The RPC client creation and call may fail silently due to missing error propagation. Ensure that all potential failure points in the RPC communication chain are properly handled and reported. coman/src/cscs/handlers.rs [174-197] pub async fn cscs_resource_usage(job_id: i64, system: Option<String>) -> Result<ResourceUsage> { let endpoint_id = get_endpoint_id(job_id, system).await?; let alpn: Vec<u8> = b"/coman/rpc".to_vec(); let secret_key = SecretKey::generate(&mut rand::rng()); let endpoint = Endpoint::builder().secret_key(secret_key).bind().await?; match endpoint.connect(endpoint_id, &alpn).await { Ok(connection) => { let (iroh_send, iroh_recv) = connection.open_bi().await?; let combined = Duplex::new(iroh_recv, iroh_send); let codec_builder = LengthDelimitedCodec::builder(); let framed = codec_builder.new_framed(combined); let transport = serde_transport::new(framed, Bincode::default()); let client = ComanRPCClient::new(client::Config::default(), transport); - client + let result = client .spawn() .resource_usage(context::current()) .await - .wrap_err("couldn't get resource usage from remote") + .map_err(\|e\| eyre::eyre!("RPC call failed: {}", e)) + .wrap_err("couldn't get resource usage from remote")?; + Ok(result) } - Err(e) => Err(e).wrap_err("couldn't establish tunnel to remote"), + Err(e) => Err(eyre::eyre!("Connection failed: {}", e)).wrap_err("couldn't establish tunnel to remote"), } } Suggestion importance[1-10]: 7 __ Why: The suggestion addresses potential silent failures in RPC communication by adding explicit error mapping and wrapping. While this improves error reporting, it doesn't address fundamental architectural concerns about error propagation in the broader system.	Medium
General	Add panic recovery to RPC server tasks The spawn function doesn't handle potential panics from the spawned future. This could lead to unhandled exceptions in the RPC server which would crash the application. coman/src/cli/rpc.rs [79-81] async fn spawn(fut: impl Future<Output = ()> + Send + 'static) { - tokio::spawn(fut); + tokio::spawn(async move { + if let Err(e) = fut.await { + eprintln!("RPC server task panicked: {:?}", e); + } + }); } Suggestion importance[1-10]: 6 __ Why: The suggestion enhances fault tolerance by adding panic recovery to spawned tasks. However, this is a defensive measure rather than a critical fix, and the improvement is moderate in terms of overall system stability.	Low

coman/src/cli/mod.rs

coman/Cargo.toml

coman/src/cli/app.rs

coman/src/cscs/cli.rs

coman/src/cli/exec.rs

coman/src/main.rs

coman/src/cli/exec.rs

coman/src/cli/rpc.rs

coman/src/cscs/handlers.rs

Cargo.lock

github-actions · 2026-01-30T09:11:43Z

Persistent review updated to latest commit 3577357

add rpc implementation

5d5822a

Panaetius force-pushed the rpc-support branch from b9c0669 to 3369f5a Compare January 26, 2026 15:39

Panaetius marked this pull request as ready for review January 26, 2026 15:45

Panaetius requested a review from a team as a code owner January 26, 2026 15:45

Panaetius marked this pull request as draft January 26, 2026 15:45

github-actions bot added the Review effort 4/5 label Jan 26, 2026