Skip to content

add rpc support for getting system metrics#59

Merged
Panaetius merged 5 commits intomainfrom
rpc-support
Jan 30, 2026
Merged

add rpc support for getting system metrics#59
Panaetius merged 5 commits intomainfrom
rpc-support

Conversation

@Panaetius
Copy link
Member

@Panaetius Panaetius commented Jan 26, 2026

PR Type

Enhancement


Description

  • Add RPC support for system metrics

  • Introduce resource usage viewing in TUI

  • Enable CLI command for resource usage

  • Extend job status handling


Diagram Walkthrough

flowchart LR
  A["CLI Command: Resource Usage"] -- "triggers" --> B["RPC Handler"]
  B -- "fetches" --> C["System Metrics"]
  C -- "sends" --> D["TUI Component"]
  D -- "updates" --> E["Resource Usage View"]
Loading

File Walkthrough

Relevant files
Enhancement
18 files
ids.rs
Add new ID for resource usage view                                             
+1/-0     
messages.rs
Add resource usage message                                                             
+1/-0     
model.rs
Integrate resource usage component and actions                     
+59/-3   
user_events.rs
Add resource usage event                                                                 
+2/-0     
app.rs
Add CLI resource usage command                                                     
+9/-1     
exec.rs
Register RPC handler for port forwarding                                 
+6/-0     
rpc.rs
Implement RPC service for resource usage                                 
+123/-0 
mod.rs
Add resource usage component                                                         
+1/-0     
resource_usage.rs
Create resource usage visualization component                       
+542/-0 
workload_list.rs
Add keyboard shortcut for resource usage                                 
+18/-1   
client.rs
Update path listing with hidden files option                         
+2/-2     
types.rs
Add requeued job status                                                                   
+2/-0     
cli.rs
Add CLI resource usage command implementation                       
+36/-3   
handlers.rs
Implement resource usage fetching and connection handling
+94/-38 
ports.rs
Add resource usage polling port                                                   
+52/-3   
main.rs
Integrate resource usage components into main application
+14/-4   
types.rs
Improve Docker image parsing and error handling                   
+13/-7   
filesystem_api.rs
Add show hidden flag to filesystem listing                             
+5/-1     
Configuration changes
2 files
mod.rs
Create RPC module                                                                               
+1/-0     
config.toml
Enable DCGM hook                                                                                 
+1/-0     
Additional files
2 files
pr-review.yaml +1/-0     
Cargo.toml +9/-1     

@Panaetius Panaetius marked this pull request as ready for review January 26, 2026 15:45
@Panaetius Panaetius requested a review from a team as a code owner January 26, 2026 15:45
@Panaetius Panaetius marked this pull request as draft January 26, 2026 15:45
@github-actions
Copy link

github-actions bot commented Jan 26, 2026

PR Reviewer Guide 🔍

(Review updated until commit 3577357)

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Potential Security Issue

The RPC implementation directly executes nvidia-smi command without proper input validation or sanitization, which could lead to command injection vulnerabilities if the output parsing logic is not robust enough.

let gpu_usage = if let Ok(output) = std::process::Command::new("nvidia-smi")
    .args(vec![
        "--query-gpu=memory.total,memory.used",
        "--format=csv,noheader,nounits",
    ])
    .output()
{
    let output = String::from_utf8_lossy(&output.stdout);
    let usage = output
        .lines()
        .map(|l| l.split_once(",").unwrap())
        .map(|(total, used)| {
            (
                ByteSize::mib(total.trim().parse::<u64>().unwrap()).as_u64(),
                ByteSize::mib(used.trim().parse::<u64>().unwrap()).as_u64(),
            )
        })
        .collect();
    Some(usage)
} else {
    println!("Failed to execute nvidia-smi, maybe it's not installed");
    None
};
Data Handling Concerns

The resource usage component relies on external commands like nvidia-smi and system calls that may not be available or behave consistently across different environments, potentially causing panics or incorrect data display.

            Event::Keyboard(KeyEvent { code: Key::Home, .. }) => self.perform(Cmd::GoTo(Position::Begin)),
            Event::Keyboard(KeyEvent { code: Key::End, .. }) => self.perform(Cmd::GoTo(Position::End)),
            Event::Keyboard(KeyEvent { code: Key::Esc, .. }) => {
                return Some(Msg::Job(JobMsg::Close));
            }
            Event::User(UserEvent::Cscs(CscsEvent::GotJobResourceUsage(ru))) => {
                self.attr(
                    Attribute::Custom(UPDATE_CPU_DATA),
                    AttrValue::Payload(PropPayload::One(PropValue::F64(ru.cpu as f64))),
                );
                self.attr(
                    Attribute::Custom(UPDATE_MEMORY_DATA),
                    AttrValue::Payload(PropPayload::Tup2((PropValue::U64(ru.rss), PropValue::U64(ru.vsz)))),
                );
                if let Some(gpu) = ru.gpu {
                    self.attr(
                        Attribute::Custom(UPDATE_GPU_DATA),
                        AttrValue::Payload(PropPayload::Vec(gpu.iter().map(|g| PropValue::U64(g.1)).collect())),
                    );
                }
                CmdResult::None
            }
            _ => CmdResult::None,
        };
        Some(Msg::None)
    }
}

// START CPU
struct CpuUsage {
    component: Chart,
Error Handling

The error handling in cscs_resource_usage function could be improved to better propagate and handle network or communication errors when connecting to the RPC endpoint.

pub async fn cscs_resource_usage(job_id: i64, system: Option<String>) -> Result<ResourceUsage> {
    let endpoint_id = get_endpoint_id(job_id, system).await?;

    let alpn: Vec<u8> = b"/coman/rpc".to_vec();
    let secret_key = SecretKey::generate(&mut rand::rng());
    let endpoint = Endpoint::builder().secret_key(secret_key).bind().await?;

    match endpoint.connect(endpoint_id, &alpn).await {
        Ok(connection) => {
            let (iroh_send, iroh_recv) = connection.open_bi().await?;
            let combined = Duplex::new(iroh_recv, iroh_send);
            let codec_builder = LengthDelimitedCodec::builder();
            let framed = codec_builder.new_framed(combined);
            let transport = serde_transport::new(framed, Bincode::default());
            let client = ComanRPCClient::new(client::Config::default(), transport);
            client
                .spawn()
                .resource_usage(context::current())
                .await
                .wrap_err("couldn't get resource usage from remote")
        }
        Err(e) => Err(e).wrap_err("couldn't establish tunnel to remote"),
    }
}

@github-actions
Copy link

Persistent review updated to latest commit 3369f5a

@github-actions
Copy link

github-actions bot commented Jan 26, 2026

PR Code Suggestions ✨

Latest suggestions up to 3577357

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Improve error handling for GPU usage collection

The nvidia-smi command execution could fail silently, leading to missed GPU usage
data. Add error handling to log failures and return default values when nvidia-smi
is not available or returns invalid data.

coman/src/cli/rpc.rs [51-93]

 async fn resource_usage(self, _context: ::tarpc::context::Context) -> ResourceUsage {
     let mut sys = System::new_all();
     sys.refresh_all();
     tokio::time::sleep(sysinfo::MINIMUM_CPU_UPDATE_INTERVAL).await;
     sys.refresh_processes_specifics(ProcessesToUpdate::All, true, ProcessRefreshKind::nothing().with_cpu());
     let Ok(pid) = get_current_pid() else {
         return ResourceUsage::default();
     };
     let Some(process) = sys.process(pid) else {
         return ResourceUsage::default();
     };
-    let gpu_usage = if let Ok(output) = std::process::Command::new("nvidia-smi")
+    let gpu_usage = match std::process::Command::new("nvidia-smi")
         .args(vec![
             "--query-gpu=memory.total,memory.used",
             "--format=csv,noheader,nounits",
         ])
         .output()
     {
-        let output = String::from_utf8_lossy(&output.stdout);
-        let usage = output
-            .lines()
-            .map(|l| l.split_once(",").unwrap())
-            .map(|(total, used)| {
-                (
-                    ByteSize::mib(total.trim().parse::<u64>().unwrap()).as_u64(),
-                    ByteSize::mib(used.trim().parse::<u64>().unwrap()).as_u64(),
-                )
-            })
-            .collect();
-        Some(usage)
-    } else {
-        println!("Failed to execute nvidia-smi, maybe it's not installed");
-        None
+        Ok(output) => {
+            if !output.status.success() {
+                eprintln!("nvidia-smi command failed: {}", String::from_utf8_lossy(&output.stderr));
+                None
+            } else {
+                let output = String::from_utf8_lossy(&output.stdout);
+                let usage = output
+                    .lines()
+                    .map(|l| l.split_once(",").unwrap())
+                    .map(|(total, used)| {
+                        (
+                            ByteSize::mib(total.trim().parse::<u64>().unwrap()).as_u64(),
+                            ByteSize::mib(used.trim().parse::<u64>().unwrap()).as_u64(),
+                        )
+                    })
+                    .collect();
+                Some(usage)
+            }
+        }
+        Err(e) => {
+            eprintln!("Failed to execute nvidia-smi: {e}");
+            None
+        }
     };
 
     ResourceUsage {
         cpu: process.cpu_usage() / sys.cpus().len() as f32,
         rss: process.memory(),
         vsz: process.virtual_memory(),
         gpu: gpu_usage,
     }
 }
Suggestion importance[1-10]: 7

__

Why: The suggestion addresses a potential issue where nvidia-smi command failures could go unnoticed, leading to missed GPU data. The improved code adds proper error logging and handling for command execution failures and invalid outputs, enhancing robustness.

Medium
General
Add validation for resource usage data updates

The component does not handle potential errors during data updates. If the resource
usage data is malformed or unavailable, the UI may become inconsistent. Add
validation and error handling to ensure robustness.

coman/src/components/resource_usage.rs [57-88]

 impl Component<Msg, UserEvent> for ResourceUsage {
     fn on(&mut self, ev: Event<UserEvent>) -> Option<Msg> {
         let _ = match ev {
             Event::Keyboard(KeyEvent { code: Key::Left, .. }) => self.perform(Cmd::Move(Direction::Left)),
             Event::Keyboard(KeyEvent { code: Key::Right, .. }) => self.perform(Cmd::Move(Direction::Right)),
             Event::Keyboard(KeyEvent { code: Key::Home, .. }) => self.perform(Cmd::GoTo(Position::Begin)),
             Event::Keyboard(KeyEvent { code: Key::End, .. }) => self.perform(Cmd::GoTo(Position::End)),
             Event::Keyboard(KeyEvent { code: Key::Esc, .. }) => {
                 return Some(Msg::Job(JobMsg::Close));
             }
             Event::User(UserEvent::Cscs(CscsEvent::GotJobResourceUsage(ru))) => {
+                // Validate resource usage data before updating UI
+                if ru.cpu.is_nan() || ru.cpu < 0.0 {
+                    eprintln!("Invalid CPU usage data received: {:?}", ru.cpu);
+                    return Some(Msg::None);
+                }
                 self.attr(
                     Attribute::Custom(UPDATE_CPU_DATA),
                     AttrValue::Payload(PropPayload::One(PropValue::F64(ru.cpu as f64))),
                 );
+                if ru.rss == 0 && ru.vsz == 0 {
+                    eprintln!("Zero memory usage data received");
+                    return Some(Msg::None);
+                }
                 self.attr(
                     Attribute::Custom(UPDATE_MEMORY_DATA),
                     AttrValue::Payload(PropPayload::Tup2((PropValue::U64(ru.rss), PropValue::U64(ru.vsz)))),
                 );
                 if let Some(gpu) = ru.gpu {
+                    if gpu.is_empty() {
+                        eprintln!("Empty GPU usage data received");
+                        return Some(Msg::None);
+                    }
                     self.attr(
                         Attribute::Custom(UPDATE_GPU_DATA),
                         AttrValue::Payload(PropPayload::Vec(gpu.iter().map(|g| PropValue::U64(g.1)).collect())),
                     );
                 }
                 CmdResult::None
             }
             _ => CmdResult::None,
         };
         Some(Msg::None)
     }
 }
Suggestion importance[1-10]: 6

__

Why: The suggestion identifies a potential issue with unvalidated data updates that could lead to UI inconsistencies. The improved code adds checks for invalid CPU usage, zero memory usage, and empty GPU data, improving the robustness of the component.

Low
Optimize endpoint creation for RPC calls

The function creates a new endpoint for each RPC call which is inefficient. Reuse
existing endpoints or implement connection pooling to reduce overhead and improve
performance.

coman/src/cscs/handlers.rs [174-197]

 pub async fn cscs_resource_usage(job_id: i64, system: Option<String>) -> Result<ResourceUsage> {
     let endpoint_id = get_endpoint_id(job_id, system).await?;
 
     let alpn: Vec<u8> = b"/coman/rpc".to_vec();
+    // Consider reusing a single endpoint instance instead of creating new ones
     let secret_key = SecretKey::generate(&mut rand::rng());
     let endpoint = Endpoint::builder().secret_key(secret_key).bind().await?;
 
     match endpoint.connect(endpoint_id, &alpn).await {
         Ok(connection) => {
             let (iroh_send, iroh_recv) = connection.open_bi().await?;
             let combined = Duplex::new(iroh_recv, iroh_send);
             let codec_builder = LengthDelimitedCodec::builder();
             let framed = codec_builder.new_framed(combined);
             let transport = serde_transport::new(framed, Bincode::default());
             let client = ComanRPCClient::new(client::Config::default(), transport);
             client
                 .spawn()
                 .resource_usage(context::current())
                 .await
                 .wrap_err("couldn't get resource usage from remote")
         }
         Err(e) => Err(e).wrap_err("couldn't establish tunnel to remote"),
     }
 }
Suggestion importance[1-10]: 3

__

Why: The suggestion points out a potential performance concern with creating new endpoints for each RPC call. However, the implementation is actually correct as-is since the endpoint creation is necessary for establishing secure connections. The suggestion doesn't propose a concrete optimization and is more of a general observation rather than a critical improvement.

Low

Previous suggestions

Suggestions up to commit 3369f5a
CategorySuggestion                                                                                                                                    Impact
Possible issue
Handle potential NVML errors gracefully

The code assumes there is always at least one GPU and that the first GPU index is
valid. This can cause panics if no GPUs are present or if the device index is
invalid. Add error handling for NVML operations.

coman/src/cli/rpc.rs [42-49]

-let nvml = Nvml::init().unwrap();
-let device = nvml.device_by_index(0).unwrap();
-let memory_info = device.memory_info().unwrap();
+let mut gpu_memory_used = 0u64;
+if let Ok(nvml) = Nvml::init() {
+    if let Ok(device) = nvml.device_by_index(0) {
+        if let Ok(memory_info) = device.memory_info() {
+            gpu_memory_used = memory_info.used;
+        }
+    }
+}
 ResourceUsage {
     cpu: cpu_usage,
     rss: sys.used_memory(),
-    gpu: memory_info.used,
+    gpu: gpu_memory_used,
 }
Suggestion importance[1-10]: 8

__

Why: The original code assumes NVML operations will always succeed, which can lead to panics. The improved code adds proper error handling using if let Ok() patterns to safely handle potential failures in NVML initialization and device access.

Medium
Add error handling for spawned RPC tasks

The spawn function does not handle errors from spawned tasks, which could lead to
silent failures. Add error logging to ensure any task failures are reported.

coman/src/cli/rpc.rs [79-81]

 async fn spawn(fut: impl Future<Output = ()> + Send + 'static) {
-    tokio::spawn(fut);
+    tokio::spawn(async move {
+        if let Err(e) = fut.await {
+            eprintln!("RPC server task failed: {e}");
+        }
+    });
 }
Suggestion importance[1-10]: 6

__

Why: The suggestion correctly identifies that unhandled errors in spawned tasks can lead to silent failures. The improved code adds error logging to report task failures, which enhances the robustness of the RPC server implementation.

Low
General
Improve secret key reuse for RPC connections

The function creates a new secret key for each RPC call which is inefficient and
potentially insecure. Reuse a single secret key or generate it once per session
instead.

coman/src/cscs/handlers.rs [174-197]

+pub async fn cscs_resource_usage(job_id: i64, system: Option<String>) -> Result<ResourceUsage> {
+    let endpoint_id = get_endpoint_id(job_id, system).await?;
 
+    let alpn: Vec<u8> = b"/coman/rpc".to_vec();
+    let secret_key = SecretKey::generate(&mut rand::rng());
+    let endpoint = Endpoint::builder().secret_key(secret_key).bind().await?;
 
+    match endpoint.connect(endpoint_id, &alpn).await {
+        Ok(connection) => {
+            let (iroh_send, iroh_recv) = connection.open_bi().await?;
+            let combined = Duplex::new(iroh_recv, iroh_send);
+            let codec_builder = LengthDelimitedCodec::builder();
+            let framed = codec_builder.new_framed(combined);
+            let transport = serde_transport::new(framed, Bincode::default());
+            let client = ComanRPCClient::new(client::Config::default(), transport);
+            client
+                .spawn()
+                .resource_usage(context::current())
+                .await
+                .wrap_err("couldn't get resource usage from remote")
+        }
+        Err(e) => Err(e).wrap_err("couldn't establish tunnel to remote"),
+    }
+}
+
Suggestion importance[1-10]: 3

__

Why: While the suggestion identifies a potential inefficiency in generating a new secret key for each RPC call, the improved code provided is identical to the existing code. Therefore, the suggestion doesn't actually propose a meaningful change and only points out a general concern without providing a concrete solution.

Low
Suggestions up to commit 3369f5a
CategorySuggestion                                                                                                                                    Impact
Possible issue
Handle potential GPU access errors gracefully

The code assumes NVML initialization and GPU device access will always succeed. Add
proper error handling for these operations to prevent panics if GPU is unavailable
or inaccessible.

coman/src/cli/rpc.rs [34-49]

-let nvml = Nvml::init().unwrap();
-let device = nvml.device_by_index(0).unwrap();
-let memory_info = device.memory_info().unwrap();
+let mut gpu_memory_used = 0u64;
+if let Ok(nvml) = Nvml::init() {
+    if let Ok(device) = nvml.device_by_index(0) {
+        if let Ok(memory_info) = device.memory_info() {
+            gpu_memory_used = memory_info.used;
+        }
+    }
+}
 ResourceUsage {
     cpu: cpu_usage,
     rss: sys.used_memory(),
-    gpu: memory_info.used,
+    gpu: gpu_memory_used,
 }
Suggestion importance[1-10]: 8

__

Why: The original code uses unwrap() on potentially failing GPU operations, which can cause panics. The improved code adds proper error handling using if let Ok() patterns to safely handle cases where GPU access might fail, improving robustness.

Medium
Improve error handling in RPC communication

The RPC client creation and call may fail silently due to missing error propagation.
Ensure that all potential failure points in the RPC communication chain are properly
handled and reported.

coman/src/cscs/handlers.rs [174-197]

 pub async fn cscs_resource_usage(job_id: i64, system: Option<String>) -> Result<ResourceUsage> {
     let endpoint_id = get_endpoint_id(job_id, system).await?;
 
     let alpn: Vec<u8> = b"/coman/rpc".to_vec();
     let secret_key = SecretKey::generate(&mut rand::rng());
     let endpoint = Endpoint::builder().secret_key(secret_key).bind().await?;
 
     match endpoint.connect(endpoint_id, &alpn).await {
         Ok(connection) => {
             let (iroh_send, iroh_recv) = connection.open_bi().await?;
             let combined = Duplex::new(iroh_recv, iroh_send);
             let codec_builder = LengthDelimitedCodec::builder();
             let framed = codec_builder.new_framed(combined);
             let transport = serde_transport::new(framed, Bincode::default());
             let client = ComanRPCClient::new(client::Config::default(), transport);
-            client
+            let result = client
                 .spawn()
                 .resource_usage(context::current())
                 .await
-                .wrap_err("couldn't get resource usage from remote")
+                .map_err(|e| eyre::eyre!("RPC call failed: {}", e))
+                .wrap_err("couldn't get resource usage from remote")?;
+            Ok(result)
         }
-        Err(e) => Err(e).wrap_err("couldn't establish tunnel to remote"),
+        Err(e) => Err(eyre::eyre!("Connection failed: {}", e)).wrap_err("couldn't establish tunnel to remote"),
     }
 }
Suggestion importance[1-10]: 7

__

Why: The suggestion addresses potential silent failures in RPC communication by adding explicit error mapping and wrapping. While this improves error reporting, it doesn't address fundamental architectural concerns about error propagation in the broader system.

Medium
General
Add panic recovery to RPC server tasks

The spawn function doesn't handle potential panics from the spawned future. This
could lead to unhandled exceptions in the RPC server which would crash the
application.

coman/src/cli/rpc.rs [79-81]

 async fn spawn(fut: impl Future<Output = ()> + Send + 'static) {
-    tokio::spawn(fut);
+    tokio::spawn(async move {
+        if let Err(e) = fut.await {
+            eprintln!("RPC server task panicked: {:?}", e);
+        }
+    });
 }
Suggestion importance[1-10]: 6

__

Why: The suggestion enhances fault tolerance by adding panic recovery to spawned tasks. However, this is a defensive measure rather than a critical fix, and the improvement is moderate in terms of overall system stability.

Low

@Panaetius Panaetius force-pushed the rpc-support branch 5 times, most recently from dca7872 to bde1c41 Compare January 27, 2026 09:07
@Panaetius Panaetius force-pushed the rpc-support branch 5 times, most recently from 74c914f to 4e55d2e Compare January 27, 2026 13:08
@Panaetius Panaetius force-pushed the rpc-support branch 3 times, most recently from 9995730 to 33f932c Compare January 28, 2026 20:28
@Panaetius Panaetius force-pushed the rpc-support branch 2 times, most recently from 511071e to 3577357 Compare January 29, 2026 14:53
@Panaetius Panaetius marked this pull request as ready for review January 30, 2026 08:49
@github-actions
Copy link

Persistent review updated to latest commit 3577357

@Panaetius Panaetius merged commit be25035 into main Jan 30, 2026
2 checks passed
@Panaetius Panaetius deleted the rpc-support branch January 30, 2026 09:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant