feat(balancer): add oMLX loaded-first selector#152
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
WalkthroughAdds OMLXLoadedFirstSelector and registers it in the balancer factory; the selector prefers endpoints that report a requested model as loaded via TTL-cached oMLX status probes, schedules background refreshes, and falls back to least-connections. Unit tests cover selection, alias resolution, probe failures, and timeout behaviour. ChangesoMLX Loaded-First Balancer Strategy
Sequence DiagramsequenceDiagram
participant Client
participant OMLXLoadedFirstSelector
participant Cache
participant oMLXEndpoint
participant InnerSelector
Client->>OMLXLoadedFirstSelector: Select(ctx, endpoints)
OMLXLoadedFirstSelector->>OMLXLoadedFirstSelector: Extract model from context
loop for each endpoint
OMLXLoadedFirstSelector->>Cache: Check status (with TTL)
alt Cache hit (fresh)
Cache-->>OMLXLoadedFirstSelector: Loaded model list
else Cache miss or stale
OMLXLoadedFirstSelector->>oMLXEndpoint: GET /v1/models/status (with timeout)
oMLXEndpoint-->>OMLXLoadedFirstSelector: JSON {models: [...]}
OMLXLoadedFirstSelector->>Cache: Store status with timestamp
Cache-->>OMLXLoadedFirstSelector: (stored)
end
OMLXLoadedFirstSelector->>OMLXLoadedFirstSelector: Filter by loaded status
end
alt Loaded endpoints exist
OMLXLoadedFirstSelector->>InnerSelector: Select from loaded subset
InnerSelector-->>OMLXLoadedFirstSelector: Least-connected endpoint
else No loaded endpoints
OMLXLoadedFirstSelector->>InnerSelector: Select from all endpoints
InnerSelector-->>OMLXLoadedFirstSelector: Least-connected endpoint
end
OMLXLoadedFirstSelector-->>Client: Selected endpoint
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
internal/adapter/balancer/omlx_loaded_first.go (1)
24-27: ⚡ Quick winFallback strategy diverges from the recommended production balancer.
This selector composes and falls back to
LeastConnectionsSelector. Confirm this is intended given the project guidance to prefer Priority-based balancing for production deployments.As per coding guidelines for
internal/adapter/balancer/**/*.go: "Use Priority-based load balancing strategy as recommended for production deployments".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/adapter/balancer/omlx_loaded_first.go` around lines 24 - 27, OMLXLoadedFirstSelector currently falls back to LeastConnectionsSelector which conflicts with the project guideline to use priority-based balancing in production; update the selector to fall back to the priority-based implementation (e.g., replace composition of LeastConnectionsSelector with the PrioritySelector/PriorityBasedSelector used elsewhere) or make the fallback configurable and document the choice; locate the selection logic inside OMLXLoadedFirstSelector and swap the fallback selector instantiation/usage to the Priority-based selector (or add a flag/use existing priority selector type) so the ordering semantics follow the recommended priority-based balancer.internal/adapter/balancer/omlx_loaded_first_test.go (1)
34-53: ⚡ Quick winAdd coverage for the probe-failure fallback path.
Current tests cover loaded-preference, least-connections fallback (all cold), and alias resolution, but not the documented behaviour where a failed/unreachable status probe falls back to least-connections. A test pointing one endpoint at a closed/non-responsive server would lock in that contract.
Want me to draft this test case?
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/adapter/balancer/omlx_loaded_first_test.go` around lines 34 - 53, Add a new test (e.g., TestOMLXLoadedFirstSelector_ProbeFailureFallsBackToLeastConnections) that uses NewOMLXLoadedFirstSelector and the existing makeOMLXStatusEndpoint helpers to simulate an unreachable probe: create a "busy" endpoint via makeOMLXStatusEndpoint and immediately call its cleanup to close the probe server (so status probe will fail), create an "idle" endpoint that stays running, call selector.IncrementConnections(busyEndpoint) to mark busy, then call selector.Select with both endpoints (context set with constants.ContextModelKey) and assert the selected endpoint is "idle"; use the same Select, IncrementConnections and makeOMLXStatusEndpoint identifiers to locate code.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/adapter/balancer/omlx_loaded_first.go`:
- Around line 119-131: When fetchStatus fails in isModelLoaded, create and cache
a negative/empty Endpoint status (no loadedModels) via the existing storeStatus
call so future isModelLoaded calls hit the cache instead of re-probing;
specifically, in isModelLoaded after err != nil construct an empty/failed status
(no entries in loadedModels or an explicit unreachable flag if your
EndpointStatus supports it) and call
selector.storeStatus(endpoint.GetURLString(), negativeStatus) before returning
false; optionally set a shorter TTL for that negative entry if storeStatus
supports TTL.
- Around line 86-104: loadedEndpoints is currently performing synchronous
per-endpoint probes via isModelLoaded (which calls fetchStatus) on the request
hot path causing latency and lacking negative caching; change loadedEndpoints
(and usages from RetryHandler.ExecuteWithRetry -> selector.Select) to only
consult a shared cache of endpoint model-ready state (populated by a background
refresher with bounded concurrency) and stop doing HTTP GETs inline; update
isModelLoaded/fetchStatus to write both positive and negative entries with short
TTLs into that cache and ensure a background goroutine or worker pool refreshes
entries (with bounded concurrency and backoff) so Select/loadedEndpoints only
reads cached values and never blocks on network probes.
---
Nitpick comments:
In `@internal/adapter/balancer/omlx_loaded_first_test.go`:
- Around line 34-53: Add a new test (e.g.,
TestOMLXLoadedFirstSelector_ProbeFailureFallsBackToLeastConnections) that uses
NewOMLXLoadedFirstSelector and the existing makeOMLXStatusEndpoint helpers to
simulate an unreachable probe: create a "busy" endpoint via
makeOMLXStatusEndpoint and immediately call its cleanup to close the probe
server (so status probe will fail), create an "idle" endpoint that stays
running, call selector.IncrementConnections(busyEndpoint) to mark busy, then
call selector.Select with both endpoints (context set with
constants.ContextModelKey) and assert the selected endpoint is "idle"; use the
same Select, IncrementConnections and makeOMLXStatusEndpoint identifiers to
locate code.
In `@internal/adapter/balancer/omlx_loaded_first.go`:
- Around line 24-27: OMLXLoadedFirstSelector currently falls back to
LeastConnectionsSelector which conflicts with the project guideline to use
priority-based balancing in production; update the selector to fall back to the
priority-based implementation (e.g., replace composition of
LeastConnectionsSelector with the PrioritySelector/PriorityBasedSelector used
elsewhere) or make the fallback configurable and document the choice; locate the
selection logic inside OMLXLoadedFirstSelector and swap the fallback selector
instantiation/usage to the Priority-based selector (or add a flag/use existing
priority selector type) so the ordering semantics follow the recommended
priority-based balancer.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 9e466093-b968-470b-a459-524fbac2f6a3
📒 Files selected for processing (4)
internal/adapter/balancer/factory.gointernal/adapter/balancer/factory_test.gointernal/adapter/balancer/omlx_loaded_first.gointernal/adapter/balancer/omlx_loaded_first_test.go
|
Thanks for the contribution, could we run |
|
We also don't officially support omlx but I will get cracking on adding that official support too (unless you're planning to add support?). I'll next get an opportunity mid this week to work on Olla (Wed). |
|
Thanks @thushan, Olla is core to our Apple/Mac oMLX cluster effort: From our side, a setup with four Mac Max/Ultra nodes (64-512GB) plus a fifth Mac as a cache hub over Thunderbolt for fast model sync is an excellent hardware platform for private inference. Using agent-driven operations (for example with Hermes-style workflows and skills) makes cluster management and tuning much more practical. Right now, we are focused more on validating high-demand real-world scenarios than on deep Olla optimization. You clearly have stronger expertise there today. If this direction is useful, we are committed to investing in further Olla optimization for oMLX on Mac. |
|
@sg-shag / @bongiozzo: Have some thoughts.
I have the oMLX profile ready, may I add you both to test/check it when I PR? I'm currently away and have PopOS only on my laptop. PR for oMLX native profile: For the new load balancer, i've captured the details here: Feel free to add your thoughts (which I'm proposing superseeds this). |
|
@thushan I'd like to test the pr too. Currently using the project of @bongiozzo . |
Summary
proxy.load_balancer: omlx-loaded-first./v1/models/statusreports the requested backend model asloaded.Why
oMLX can serve multiple local models from one node-level process. It advertises local model directories through
/v1/models, then loads models lazily and evicts by LRU as memory pressure changes.With plain least-connections, Olla can send a request to a compatible but cold oMLX node while another compatible node already has that model resident in memory. This selector keeps Olla's existing model routing semantics intact, but reorders the already-compatible endpoint set to prefer hot model placement first.
Desired behavior:
omlx-loaded-firstprobes cached/v1/models/statusstate for those candidates.loaded, least-connections selects among that hot subset.Notes
This is intentionally small and oMLX-specific for review. I would also be happy to reshape it into a more generic
loaded-firstbalancer if you would rather centralize provider-specific loaded-state collection in discovery/model registry instead of inside a balancer.The implementation currently uses a short in-memory status cache and treats probe failures as a normal fallback to least-connections.
Test Plan
git diff --checklocally.go test ./internal/adapter/balancerbecause the current workspace does not have the Go toolchain installed.Summary by CodeRabbit
New Features
Tests