Skip to content

Pr2 job gen#32

Merged
rbx merged 2 commits intoFairRootGroup:masterfrom
enlorenz:pr2-job_gen
Mar 12, 2026
Merged

Pr2 job gen#32
rbx merged 2 commits intoFairRootGroup:masterfrom
enlorenz:pr2-job_gen

Conversation

@enlorenz
Copy link
Contributor

Dependent on Pr1

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 11, 2026

Caution

Review failed

Pull request was closed or merged during review

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough
🚥 Pre-merge checks | ❌ 3

❌ Failed checks (1 warning, 2 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 55.56% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Pr2 job gen' is vague and non-descriptive; it uses abbreviations and generic terminology that don't convey meaningful information about the specific changes. Use a more descriptive title that captures the main change, such as 'Add job arrival scaling and replay modes for workload generation' or 'Implement configurable job arrival scaling with exact replay options'.
Description check ❓ Inconclusive The description 'Dependent on Pr1' provides minimal information and doesn't describe any actual changes in the pull request. Expand the description to explain what job generation features are being added, such as job arrival scaling, replay modes, and instances tracking.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
train.py (1)

137-151: ⚠️ Potential issue | 🔴 Critical

Remove or reintroduce carry_over_state before this ships.

This call now passes carry_over_state, but ComputeClusterEnv.__init__ in src/environment.py (Line 63-77) does not accept that keyword. The result is an immediate TypeError during startup, before the first reset().

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@train.py` around lines 137 - 151, The call to ComputeClusterEnv in train.py
passes carry_over_state but ComputeClusterEnv.__init__ (in src/environment.py)
does not accept that keyword, causing a TypeError; either remove the
carry_over_state argument from the ComputeClusterEnv(...) call in train.py or
add a carry_over_state parameter (with a default, store it on self and use it
where env state persistence is handled) to ComputeClusterEnv.__init__ so the
signature matches callers (update any internal logic that relies on
preserving/restoring state to reference self.carry_over_state).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/reward_calculation.py`:
- Around line 50-53: The NEGATIVE_PRICE_OVERDRIVE_* configuration lets computed
price_reward exceed the normalized upper bound; change the logic so price_reward
is always clamped to 1.0 (or disable the above-one path) by updating the branch
that reads NEGATIVE_PRICE_OVERDRIVE_ALLOW_ABOVE_ONE and uses
NEGATIVE_PRICE_OVERDRIVE_MAX_REWARD: ensure any calculation that yields
price_reward uses min(computed_value, 1.0) (and remove or ignore
NEGATIVE_PRICE_OVERDRIVE_MAX_REWARD when allowing above-one is false), and apply
the same clamp/update for the duplicate logic referenced around the other block
(the second occurrence at lines ~164-168) so price_weight remains within the
configured normalized range.

In `@src/sampler_jobs.py`:
- Around line 356-362: The hourly-job branch currently copies raw
agg_job['nnodes'] and ['cores_per_node'] into hourly_job which can exceed
simulation limits; clamp those values to the configured max_nodes_per_job and
cores_per_node before assigning (e.g., replace agg_job['nnodes'] with
min(agg_job['nnodes'], max_nodes_per_job) and agg_job['cores_per_node'] with
min(agg_job['cores_per_node'], cores_per_node)) while keeping duration_hours,
original_job_count and instances as-is so the replayed requests fit the
simulated cluster.

In `@src/workload_generator.py`:
- Around line 140-167: The current fallback clones existing sampled jobs (using
new_jobs_durations/new_jobs_nodes/new_jobs_cores) when job_arrival_scale != 1.0,
which artificially reduces variance and corrupts correlations; instead either
apply job_arrival_scale at the arrival-count sampling stage or generate fresh
samples for any additional arrivals from the original source sampler. Remove the
cloning branch that uses whole/frac to extend scaled_durations/nodes/cores and
replace it with logic that (a) computes the desired scaled_count from
new_jobs_count and job_arrival_scale, and then (b) if scaled_count >
new_jobs_count, calls the same sampling routines used originally to produce
single job attributes (i.e., the source's duration/node/core samplers or the
function that produced new_jobs_durations/new_jobs_nodes/new_jobs_cores) to
create the extra jobs so all sampled tuples are independently drawn; keep
arrival_scale_applied_in_source handling and update
new_jobs_count/new_jobs_durations/new_jobs_nodes/new_jobs_cores accordingly.
- Around line 89-98: The code assumes jobs_sampler.sample_one_hourly(wrap=True)
always returns a dict with "hourly_jobs" but it can return {} causing a
KeyError; to fix, capture the return value from sample_one_hourly into a
variable (e.g., sample = jobs_sampler.sample_one_hourly(wrap=True)), check if
sample and "hourly_jobs" in sample (or use sample.get("hourly_jobs", [])) before
iterating, and only extend
new_jobs_count/new_jobs_durations/new_jobs_nodes/new_jobs_cores when there are
actual jobs from sample["hourly_jobs"]; reference sample_one_hourly,
jobs_sampler, and the "hourly_jobs" key and the new_jobs_* variables when
locating the code to change.

In `@train_iter.py`:
- Around line 242-247: The validation in train_iter.py should mirror train.py's
path normalization so whitespace-only --jobs values are rejected; import or copy
the same norm_path() logic used in train.py (see lines referenced around
norm_path() usage) and apply it to args.jobs before the checks for
args.jobs_exact_replay and args.jobs_exact_replay_aggregate, then use the
normalized value in the existing conditional checks (e.g., check normalized_jobs
rather than args.jobs) so blank/whitespace paths are treated as empty and
rejected consistently.

---

Outside diff comments:
In `@train.py`:
- Around line 137-151: The call to ComputeClusterEnv in train.py passes
carry_over_state but ComputeClusterEnv.__init__ (in src/environment.py) does not
accept that keyword, causing a TypeError; either remove the carry_over_state
argument from the ComputeClusterEnv(...) call in train.py or add a
carry_over_state parameter (with a default, store it on self and use it where
env state persistence is handled) to ComputeClusterEnv.__init__ so the signature
matches callers (update any internal logic that relies on preserving/restoring
state to reference self.carry_over_state).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 051899fe-651e-48ae-ac60-760fa2540328

📥 Commits

Reviewing files that changed from the base of the PR and between 1e983e8 and fcfa434.

📒 Files selected for processing (7)
  • src/environment.py
  • src/reward_calculation.py
  • src/sampler_hourly.py
  • src/sampler_jobs.py
  • src/workload_generator.py
  • train.py
  • train_iter.py

… replay/scaling modes

- add core-hour-aware template semantics in sampler conversion:
  - sub-hour aggregated bins emit one hourly-equivalent template (`instances=1`)
  - hour+ bins preserve multiplicity via `instances=count`
- update jobs replay to honor `instances` when expanding sampled templates
- add global `job_arrival_scale` multiplier in job generation (supports fractional scaling with probabilistic rounding)
- introduce exact timeline replay for `--jobs` via `--jobs-exact-replay`
- add optional per-step aggregation in exact replay via `--jobs-exact-replay-aggregate`
- ensure exact+aggregate path also uses `instances` (same conserving behavior as template replay)
- wire new controls through `ComputeClusterEnv`, `train.py`, and `train_iter.py`
- add CLI validation and startup diagnostics for replay mode and arrival scale
- report `max_new_jobs_per_hour` based on expanded instance counts
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
src/workload_generator.py (1)

89-98: ⚠️ Potential issue | 🟡 Minor

Guard against empty return from sample_one_hourly().

sample_one_hourly() returns {} when no periods are loaded, causing a KeyError on the direct ["hourly_jobs"] access.

Suggested fix
         else:
             # Use pre-aggregated hourly templates for pattern-based replay.
-            jobs = jobs_sampler.sample_one_hourly(wrap=True)["hourly_jobs"]
-            if len(jobs) > 0:
+            sampled = jobs_sampler.sample_one_hourly(wrap=True)
+            jobs = sampled.get("hourly_jobs", [])
+            if jobs:
                 for job in jobs:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/workload_generator.py` around lines 89 - 98, Guard against
sample_one_hourly() returning an empty dict by avoiding direct indexing into
["hourly_jobs"]; call jobs_sampler.sample_one_hourly(wrap=True) into a variable
(e.g., sample = jobs_sampler.sample_one_hourly(...)) and then read
sample.get("hourly_jobs", []) or check "if sample and 'hourly_jobs' in sample"
before iterating; update the block that currently uses jobs =
jobs_sampler.sample_one_hourly(wrap=True)["hourly_jobs"] so it safely handles {}
and proceeds only when hourly_jobs is a non-empty list (affecting
jobs_sampler.sample_one_hourly, jobs variable, and the subsequent for loop).
🧹 Nitpick comments (3)
src/workload_generator.py (2)

158-158: Consider adding strict=True to zip() for defensive coding.

While the lists are constructed together and should always have the same length, adding strict=True would catch any future bugs where they diverge.

Suggested fix
-        for d, n, c in zip(new_jobs_durations, new_jobs_nodes, new_jobs_cores):
+        for d, n, c in zip(new_jobs_durations, new_jobs_nodes, new_jobs_cores, strict=True):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/workload_generator.py` at line 158, The loop over new_jobs_durations,
new_jobs_nodes, new_jobs_cores uses zip(...) without strict checking; update the
iteration in the function that contains the line with for d, n, c in
zip(new_jobs_durations, new_jobs_nodes, new_jobs_cores): to call zip(...,
strict=True) so Python raises if the three lists ever differ in length
(defensive coding). Ensure the runtime environment is Python 3.10+ or guard
accordingly if compatibility is required.

140-167: Arrival scaling via job cloning alters workload variance.

The global fallback scales by duplicating existing sampled jobs. This preserves the expected count but changes the variance and correlation structure of the workload—identical jobs are cloned rather than independently sampled. For synthetic workloads (workload_gen or legacy randomizer), consider applying scaling at the arrival-count sampling stage or freshly sampling additional jobs.

This may be acceptable for exact-replay paths but is worth noting for statistical sampling sources.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/workload_generator.py` around lines 140 - 167, The current global
fallback (when arrival_scale_applied_in_source is False) clones sampled jobs by
repeating new_jobs_durations/new_jobs_nodes/new_jobs_cores which changes
variance and correlations; instead, for synthetic/legacy sources (e.g.,
workload_gen/randomizer) apply scaling at the arrival-count sampling step or
freshly sample additional independent jobs rather than duplicating entries:
detect when job_arrival_scale > 1.0 and, for the fractional/whole parts, either
increase the sampled arrival count before invoking the sampler or call the
sampler anew to generate independent extra jobs, using the same seeding/prng
(np_random) to preserve reproducibility; update the code paths around
job_arrival_scale, new_jobs_count, and the arrays
new_jobs_durations/new_jobs_nodes/new_jobs_cores so they contain independently
sampled jobs rather than clones.
src/sampler_hourly.py (1)

316-316: Remove redundant int() cast.

math.floor() already returns an int in Python 3, making the outer int() call unnecessary.

Suggested fix
-            whole = int(math.floor(arrival_scale))
+            whole = math.floor(arrival_scale)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/sampler_hourly.py` at line 316, Remove the redundant outer int() cast
when computing the integer part of arrival_scale; replace the expression that
assigns to the variable whole (currently using int(math.floor(arrival_scale)))
with a single call to math.floor(arrival_scale) so you rely on Python 3's floor
returning an int (look for the assignment to whole in src/sampler_hourly.py
where arrival_scale is used).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@train.py`:
- Line 148: The call to ComputeClusterEnv (or where carry_over_state is passed)
references args.carry_over_state which doesn't exist and ComputeClusterEnv has
no such parameter; remove the carry_over_state=args.carry_over_state argument
from the call site (the line containing carry_over_state=args.carry_over_state)
and ensure no other references to args.carry_over_state remain (or if
persistence behavior is required, add a properly defined CLI flag and a matching
parameter to ComputeClusterEnv such as carry_over_state and wire them up
consistently).

---

Duplicate comments:
In `@src/workload_generator.py`:
- Around line 89-98: Guard against sample_one_hourly() returning an empty dict
by avoiding direct indexing into ["hourly_jobs"]; call
jobs_sampler.sample_one_hourly(wrap=True) into a variable (e.g., sample =
jobs_sampler.sample_one_hourly(...)) and then read sample.get("hourly_jobs", [])
or check "if sample and 'hourly_jobs' in sample" before iterating; update the
block that currently uses jobs =
jobs_sampler.sample_one_hourly(wrap=True)["hourly_jobs"] so it safely handles {}
and proceeds only when hourly_jobs is a non-empty list (affecting
jobs_sampler.sample_one_hourly, jobs variable, and the subsequent for loop).

---

Nitpick comments:
In `@src/sampler_hourly.py`:
- Line 316: Remove the redundant outer int() cast when computing the integer
part of arrival_scale; replace the expression that assigns to the variable whole
(currently using int(math.floor(arrival_scale))) with a single call to
math.floor(arrival_scale) so you rely on Python 3's floor returning an int (look
for the assignment to whole in src/sampler_hourly.py where arrival_scale is
used).

In `@src/workload_generator.py`:
- Line 158: The loop over new_jobs_durations, new_jobs_nodes, new_jobs_cores
uses zip(...) without strict checking; update the iteration in the function that
contains the line with for d, n, c in zip(new_jobs_durations, new_jobs_nodes,
new_jobs_cores): to call zip(..., strict=True) so Python raises if the three
lists ever differ in length (defensive coding). Ensure the runtime environment
is Python 3.10+ or guard accordingly if compatibility is required.
- Around line 140-167: The current global fallback (when
arrival_scale_applied_in_source is False) clones sampled jobs by repeating
new_jobs_durations/new_jobs_nodes/new_jobs_cores which changes variance and
correlations; instead, for synthetic/legacy sources (e.g.,
workload_gen/randomizer) apply scaling at the arrival-count sampling step or
freshly sample additional independent jobs rather than duplicating entries:
detect when job_arrival_scale > 1.0 and, for the fractional/whole parts, either
increase the sampled arrival count before invoking the sampler or call the
sampler anew to generate independent extra jobs, using the same seeding/prng
(np_random) to preserve reproducibility; update the code paths around
job_arrival_scale, new_jobs_count, and the arrays
new_jobs_durations/new_jobs_nodes/new_jobs_cores so they contain independently
sampled jobs rather than clones.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0d032729-9180-40ae-ad55-c1117a9261ba

📥 Commits

Reviewing files that changed from the base of the PR and between fcfa434 and d3b1b59.

📒 Files selected for processing (6)
  • src/environment.py
  • src/sampler_hourly.py
  • src/sampler_jobs.py
  • src/workload_generator.py
  • train.py
  • train_iter.py

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (4)
src/workload_generator.py (2)

90-91: ⚠️ Potential issue | 🟡 Minor

Guard empty hourly samples before subscripting.

sample_one_hourly() can return {} when no periods are loaded, so Line 91 raises KeyError instead of cleanly producing no arrivals.

Suggested fix
-            jobs = jobs_sampler.sample_one_hourly(wrap=True)["hourly_jobs"]
+            sampled = jobs_sampler.sample_one_hourly(wrap=True)
+            jobs = sampled.get("hourly_jobs", [])
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/workload_generator.py` around lines 90 - 91, sample_one_hourly() can
return an empty dict so subscripting ["hourly_jobs"] raises KeyError; update the
code where jobs = jobs_sampler.sample_one_hourly(wrap=True)["hourly_jobs"] to
first capture the result (e.g., hourly =
jobs_sampler.sample_one_hourly(wrap=True)) and guard it (if not hourly or
"hourly_jobs" not in hourly) then set jobs to an empty list or skip processing;
reference sample_one_hourly, jobs_sampler, and the "hourly_jobs" key when making
the change.

140-167: ⚠️ Potential issue | 🟠 Major

Don’t scale the fallback randomizer by cloning sampled jobs.

This block repeats already-sampled (duration, nodes, cores) tuples. train_iter.py Line 248-254 only disables scaling for --workload-gen, so the legacy no-source path still reaches this code and loses the independent sampling behavior of arrivals vs. job attributes. Either resample extra jobs from the original source or reject --job-arrival-scale for unsupported sources.

Based on learnings, workload generator configuration uses four core attributes: arrivals (jobs/hour), duration (hours), nodes, and cores-per-node, each independently sampled.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/workload_generator.py` around lines 140 - 167, The current scaling logic
clones already-sampled (duration,nodes,cores) tuples (using
scaled_durations.extend(new_jobs_durations * whole) etc.), which breaks
independent sampling of arrivals vs. job attributes; instead, when
arrival_scale_applied_in_source is False and job_arrival_scale != 1.0, resample
attributes for each additional arrival from the original source/distribution
rather than duplicating the existing lists (i.e., replace the
extend/append-by-cloning strategy for whole and frac with calls to the source
sampling routine used elsewhere to draw durations, nodes, and cores
independently for each scaled job), or if the source cannot support resampling,
explicitly reject/raise an error when --job-arrival-scale is used for that
source (make the check use arrival_scale_applied_in_source and the CLI flag
--job-arrival-scale to refuse unsupported combinations).
train_iter.py (1)

248-254: ⚠️ Potential issue | 🟠 Major

Fail fast instead of silently rewriting --job-arrival-scale.

Resetting the requested scale to 1.0 can burn an entire sweep on settings different from what the caller asked for. parser.error(...) is safer here and makes the unsupported combination explicit before any child processes start.

Suggested fix
-    if args.workload_gen and args.job_arrival_scale != 1.0:
-        print(
-            "Warning: --job-arrival-scale is not allowed with --workload-gen; "
-            "resetting it to 1.0. Use workload generator arrival settings instead.",
-            file=sys.stderr,
-        )
-        args.job_arrival_scale = 1.0
+    if args.workload_gen and args.job_arrival_scale != 1.0:
+        parser.error("--job-arrival-scale is not supported with --workload-gen")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@train_iter.py` around lines 248 - 254, Replace the silent reset of
args.job_arrival_scale with a fail-fast parser error: when args.workload_gen is
true and args.job_arrival_scale != 1.0, call parser.error("...") with a clear
message that --job-arrival-scale is incompatible with --workload-gen (mention to
use workload generator arrival settings) instead of setting
args.job_arrival_scale = 1.0; update the code around the existing check (which
references args.workload_gen and args.job_arrival_scale) to raise parser.error
so the program exits immediately with an explicit message.
src/sampler_jobs.py (1)

355-357: ⚠️ Potential issue | 🟠 Major

Clamp replayed resources on the lower bound too.

This branch now caps oversized values, but it can still emit 0 nodes or 0 cores_per_node from malformed traces because parse_jobs() stores 0 when nnodes == 0 or ncpus < nnodes. That creates impossible jobs for the simulator; mirror the raw replay path and use max(1, min(...)) here.

Suggested fix
-                nnodes = min(agg_job['nnodes'], max_nodes_per_job)
-                cores_per_node_needed = min(agg_job['cores_per_node'], cores_per_node)
+                nnodes = max(1, min(agg_job['nnodes'], max_nodes_per_job))
+                cores_per_node_needed = max(1, min(agg_job['cores_per_node'], cores_per_node))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/sampler_jobs.py` around lines 355 - 357, The clamping currently only caps
oversized values and can produce zero-sized jobs; update the calculations for
nnodes and cores_per_node_needed to enforce a lower bound of 1 by wrapping the
existing min(...) with max(1, ...). Specifically modify the usage of
agg_job['nnodes'] and agg_job['cores_per_node'] so nnodes = max(1,
min(agg_job['nnodes'], max_nodes_per_job)) and cores_per_node_needed = max(1,
min(agg_job['cores_per_node'], cores_per_node)) to match the raw replay behavior
and avoid zero-valued jobs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/sampler_jobs.py`:
- Around line 355-357: The clamping currently only caps oversized values and can
produce zero-sized jobs; update the calculations for nnodes and
cores_per_node_needed to enforce a lower bound of 1 by wrapping the existing
min(...) with max(1, ...). Specifically modify the usage of agg_job['nnodes']
and agg_job['cores_per_node'] so nnodes = max(1, min(agg_job['nnodes'],
max_nodes_per_job)) and cores_per_node_needed = max(1,
min(agg_job['cores_per_node'], cores_per_node)) to match the raw replay behavior
and avoid zero-valued jobs.

In `@src/workload_generator.py`:
- Around line 90-91: sample_one_hourly() can return an empty dict so
subscripting ["hourly_jobs"] raises KeyError; update the code where jobs =
jobs_sampler.sample_one_hourly(wrap=True)["hourly_jobs"] to first capture the
result (e.g., hourly = jobs_sampler.sample_one_hourly(wrap=True)) and guard it
(if not hourly or "hourly_jobs" not in hourly) then set jobs to an empty list or
skip processing; reference sample_one_hourly, jobs_sampler, and the
"hourly_jobs" key when making the change.
- Around line 140-167: The current scaling logic clones already-sampled
(duration,nodes,cores) tuples (using scaled_durations.extend(new_jobs_durations
* whole) etc.), which breaks independent sampling of arrivals vs. job
attributes; instead, when arrival_scale_applied_in_source is False and
job_arrival_scale != 1.0, resample attributes for each additional arrival from
the original source/distribution rather than duplicating the existing lists
(i.e., replace the extend/append-by-cloning strategy for whole and frac with
calls to the source sampling routine used elsewhere to draw durations, nodes,
and cores independently for each scaled job), or if the source cannot support
resampling, explicitly reject/raise an error when --job-arrival-scale is used
for that source (make the check use arrival_scale_applied_in_source and the CLI
flag --job-arrival-scale to refuse unsupported combinations).

In `@train_iter.py`:
- Around line 248-254: Replace the silent reset of args.job_arrival_scale with a
fail-fast parser error: when args.workload_gen is true and
args.job_arrival_scale != 1.0, call parser.error("...") with a clear message
that --job-arrival-scale is incompatible with --workload-gen (mention to use
workload generator arrival settings) instead of setting args.job_arrival_scale =
1.0; update the code around the existing check (which references
args.workload_gen and args.job_arrival_scale) to raise parser.error so the
program exits immediately with an explicit message.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5bbfa476-dca5-4fb6-a504-88867c55c17e

📥 Commits

Reviewing files that changed from the base of the PR and between d3b1b59 and e94d87a.

📒 Files selected for processing (5)
  • src/sampler_hourly.py
  • src/sampler_jobs.py
  • src/workload_generator.py
  • train.py
  • train_iter.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • train.py

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
train_iter.py (1)

244-245: ⚠️ Potential issue | 🟡 Minor

Whitespace-only --jobs values still pass validation.

The past review comment about path normalization remains unaddressed. A whitespace-only --jobs value (e.g., --jobs " ") is truthy in Python, so the check not args.jobs won't catch it. The sweep can then spawn child processes that fail immediately in train.py.

Suggested fix
-    if args.jobs_exact_replay and not args.jobs:
+    if args.jobs_exact_replay and not (args.jobs and args.jobs.strip()):
         parser.error("--jobs-exact-replay requires --jobs")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@train_iter.py` around lines 244 - 245, The current check uses if
args.jobs_exact_replay and not args.jobs which misses whitespace-only values;
update the validation for args.jobs (when args.jobs_exact_replay is set) to
normalize and trim the value (call .strip() and optionally os.path.normpath) and
then check emptiness, e.g. treat a jobs value that is "" after stripping as
invalid and call parser.error("--jobs-exact-replay requires a non-empty
--jobs"); reference args.jobs_exact_replay, args.jobs, and parser.error to
locate and change the check so child processes in train.py are not spawned with
an empty/whitespace jobs path.
src/workload_generator.py (1)

89-98: ⚠️ Potential issue | 🟡 Minor

Guard against empty hourly replay sample (unaddressed from prior review).

sample_one_hourly() returns {} when no periods are loaded, so direct ["hourly_jobs"] subscript raises KeyError. The past review comment was marked as addressed but the code still uses direct subscript access.

Suggested fix
         else:
             # Use pre-aggregated hourly templates for pattern-based replay.
-            jobs = jobs_sampler.sample_one_hourly(wrap=True)["hourly_jobs"]
-            if len(jobs) > 0:
+            sampled = jobs_sampler.sample_one_hourly(wrap=True)
+            jobs = sampled.get("hourly_jobs", [])
+            if jobs:
                 for job in jobs:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/workload_generator.py` around lines 89 - 98, The code assumes
jobs_sampler.sample_one_hourly() always returns a dict with "hourly_jobs" and
will KeyError when it returns {}, so update the block in workload_generator.py
to safely handle empty responses from sample_one_hourly(): call
jobs_sampler.sample_one_hourly() into a variable (e.g., sample =
jobs_sampler.sample_one_hourly(wrap=True)), then extract jobs =
sample.get("hourly_jobs", []) or guard with if sample and "hourly_jobs" in
sample before iterating; keep the existing loop that expands instances into
new_jobs_count/new_jobs_durations/new_jobs_nodes/new_jobs_cores unchanged.
🧹 Nitpick comments (3)
src/workload_generator.py (2)

158-158: Add strict=True to zip() for defensive programming.

While the lists are constructed in parallel and should always have the same length, adding strict=True would catch any future bugs that cause length mismatches.

Suggested fix
-        for d, n, c in zip(new_jobs_durations, new_jobs_nodes, new_jobs_cores):
+        for d, n, c in zip(new_jobs_durations, new_jobs_nodes, new_jobs_cores, strict=True):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/workload_generator.py` at line 158, The loop using
zip(new_jobs_durations, new_jobs_nodes, new_jobs_cores) should use strict=True
to catch any length mismatches defensively; update the loop in the code that
iterates with "for d, n, c in zip(new_jobs_durations, new_jobs_nodes,
new_jobs_cores)" to call zip with strict=True so Python will raise a ValueError
if the lists differ in length, keeping the unpacking of d, n, c and subsequent
logic unchanged.

82-82: Redundant outer int() on math.ceil() result.

math.ceil() returns an int in Python 3, so the outer int() call is unnecessary. The inner int(job['duration_minutes']) is appropriate for defensive parsing.

Suggested fix
-                    duration_hours = max(1, int(math.ceil(int(job['duration_minutes']) / 60)))
+                    duration_hours = max(1, math.ceil(int(job['duration_minutes']) / 60))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/workload_generator.py` at line 82, The expression computing
duration_hours redundantly wraps math.ceil(...) with int(); update the
calculation so it uses duration_minutes parsed defensively (keep
int(job['duration_minutes'])) and assign duration_hours = max(1,
math.ceil(int(job['duration_minutes']) / 60)) (i.e., remove the outer int call
around math.ceil) so duration_hours remains an int without the redundant
conversion; modify the assignment where duration_hours is defined in
workload_generator.py to reference math.ceil and job['duration_minutes'] as
described.
src/sampler_hourly.py (1)

316-316: Remove redundant int() cast.

math.floor() returns an int in Python 3, so the outer int() call is unnecessary.

Suggested fix
-            whole = int(math.floor(arrival_scale))
+            whole = math.floor(arrival_scale)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/sampler_hourly.py` at line 316, The assignment to variable `whole` wraps
math.floor with an unnecessary int() call; replace `whole =
int(math.floor(arrival_scale))` with a direct call to
`math.floor(arrival_scale)` (e.g., `whole = math.floor(arrival_scale)`) in the
function/block where `whole` is defined so the value remains an int without the
redundant cast; verify any type hints or downstream code expecting an int still
work (no further changes needed).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/workload_generator.py`:
- Around line 89-98: The code assumes jobs_sampler.sample_one_hourly() always
returns a dict with "hourly_jobs" and will KeyError when it returns {}, so
update the block in workload_generator.py to safely handle empty responses from
sample_one_hourly(): call jobs_sampler.sample_one_hourly() into a variable
(e.g., sample = jobs_sampler.sample_one_hourly(wrap=True)), then extract jobs =
sample.get("hourly_jobs", []) or guard with if sample and "hourly_jobs" in
sample before iterating; keep the existing loop that expands instances into
new_jobs_count/new_jobs_durations/new_jobs_nodes/new_jobs_cores unchanged.

In `@train_iter.py`:
- Around line 244-245: The current check uses if args.jobs_exact_replay and not
args.jobs which misses whitespace-only values; update the validation for
args.jobs (when args.jobs_exact_replay is set) to normalize and trim the value
(call .strip() and optionally os.path.normpath) and then check emptiness, e.g.
treat a jobs value that is "" after stripping as invalid and call
parser.error("--jobs-exact-replay requires a non-empty --jobs"); reference
args.jobs_exact_replay, args.jobs, and parser.error to locate and change the
check so child processes in train.py are not spawned with an empty/whitespace
jobs path.

---

Nitpick comments:
In `@src/sampler_hourly.py`:
- Line 316: The assignment to variable `whole` wraps math.floor with an
unnecessary int() call; replace `whole = int(math.floor(arrival_scale))` with a
direct call to `math.floor(arrival_scale)` (e.g., `whole =
math.floor(arrival_scale)`) in the function/block where `whole` is defined so
the value remains an int without the redundant cast; verify any type hints or
downstream code expecting an int still work (no further changes needed).

In `@src/workload_generator.py`:
- Line 158: The loop using zip(new_jobs_durations, new_jobs_nodes,
new_jobs_cores) should use strict=True to catch any length mismatches
defensively; update the loop in the code that iterates with "for d, n, c in
zip(new_jobs_durations, new_jobs_nodes, new_jobs_cores)" to call zip with
strict=True so Python will raise a ValueError if the lists differ in length,
keeping the unpacking of d, n, c and subsequent logic unchanged.
- Line 82: The expression computing duration_hours redundantly wraps
math.ceil(...) with int(); update the calculation so it uses duration_minutes
parsed defensively (keep int(job['duration_minutes'])) and assign duration_hours
= max(1, math.ceil(int(job['duration_minutes']) / 60)) (i.e., remove the outer
int call around math.ceil) so duration_hours remains an int without the
redundant conversion; modify the assignment where duration_hours is defined in
workload_generator.py to reference math.ceil and job['duration_minutes'] as
described.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 26a78fcb-32ec-4b61-8b61-0364ff0f374d

📥 Commits

Reviewing files that changed from the base of the PR and between e94d87a and 7758691.

📒 Files selected for processing (5)
  • src/sampler_hourly.py
  • src/sampler_jobs.py
  • src/workload_generator.py
  • train.py
  • train_iter.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • train.py

… mode

Apply `--job-arrival-scale` to the `--hourly-jobs` path so true statistical
sampling can be up/down-scaled directly at sampled arrival-count level.

Key changes:
- `HourlySampler.sample_aggregated(..., arrival_scale=...)` now scales sampled counts
- `generate_jobs()` forwards `job_arrival_scale` to hourly sampler
- avoid double-scaling by skipping global post-scaling when hourly path already applied it
  - sampling mode: `--hourly-jobs`
@rbx rbx merged commit 2c3dfa7 into FairRootGroup:master Mar 12, 2026
3 of 4 checks passed
@enlorenz enlorenz deleted the pr2-job_gen branch March 12, 2026 11:30
@coderabbitai coderabbitai bot mentioned this pull request Mar 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants