Skip to content

feat(balancer): add oMLX loaded-first selector#152

Open
sg-shag wants to merge 4 commits into
thushan:mainfrom
sg-shag:feat/omlx-loaded-first-balancer
Open

feat(balancer): add oMLX loaded-first selector#152
sg-shag wants to merge 4 commits into
thushan:mainfrom
sg-shag:feat/omlx-loaded-first-balancer

Conversation

@sg-shag

@sg-shag sg-shag commented May 30, 2026

Copy link
Copy Markdown

Summary

  • Add proxy.load_balancer: omlx-loaded-first.
  • Prefer compatible oMLX endpoints where /v1/models/status reports the requested backend model as loaded.
  • Fall back to least-connections when no compatible endpoint is hot or status probing fails.
  • Preserve model-alias routing by probing the endpoint-specific backend model id after alias resolution.

Why

oMLX can serve multiple local models from one node-level process. It advertises local model directories through /v1/models, then loads models lazily and evicts by LRU as memory pressure changes.

With plain least-connections, Olla can send a request to a compatible but cold oMLX node while another compatible node already has that model resident in memory. This selector keeps Olla's existing model routing semantics intact, but reorders the already-compatible endpoint set to prefer hot model placement first.

Desired behavior:

  1. Existing registry/routing code filters to healthy compatible endpoints.
  2. omlx-loaded-first probes cached /v1/models/status state for those candidates.
  3. If one or more candidates report the requested backend model as loaded, least-connections selects among that hot subset.
  4. If none are hot, least-connections selects among the full compatible subset and lets oMLX cold-load normally.

Notes

This is intentionally small and oMLX-specific for review. I would also be happy to reshape it into a more generic loaded-first balancer if you would rather centralize provider-specific loaded-state collection in discovery/model registry instead of inside a balancer.

The implementation currently uses a short in-memory status cache and treats probe failures as a normal fallback to least-connections.

Test Plan

  • Added unit coverage for loaded endpoint preference.
  • Added unit coverage for least-connections fallback when no candidate is hot.
  • Added unit coverage for endpoint-specific alias model lookup.
  • Ran git diff --check locally.
  • Not run locally: go test ./internal/adapter/balancer because the current workspace does not have the Go toolchain installed.

Summary by CodeRabbit

  • New Features

    • Adds a new model-aware routing strategy that prefers endpoints with the requested model already loaded, falls back to least-connected when needed, and is now listed among selectable strategies.
    • Background probes refresh model availability without blocking requests.
  • Tests

    • Expanded unit tests covering preference logic, alias resolution, probe failures, and non-blocking cold-cache behaviour.

@coderabbitai

coderabbitai Bot commented May 30, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 39442f5a-8d72-433a-9087-b9bcf139671d

📥 Commits

Reviewing files that changed from the base of the PR and between 3a89436 and 162791f.

📒 Files selected for processing (1)
  • internal/adapter/balancer/omlx_loaded_first.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • internal/adapter/balancer/omlx_loaded_first.go

Walkthrough

Adds OMLXLoadedFirstSelector and registers it in the balancer factory; the selector prefers endpoints that report a requested model as loaded via TTL-cached oMLX status probes, schedules background refreshes, and falls back to least-connections. Unit tests cover selection, alias resolution, probe failures, and timeout behaviour.

Changes

oMLX Loaded-First Balancer Strategy

Layer / File(s) Summary
Factory registration and constant definition
internal/adapter/balancer/factory.go, internal/adapter/balancer/factory_test.go
Adds DefaultBalancerOMLXLoadedFirst constant and registers NewOMLXLoadedFirstSelector in the factory. Updates factory tests to verify the strategy is discoverable and instantiable.
OMLXLoadedFirstSelector structure and core selection logic
internal/adapter/balancer/omlx_loaded_first.go
Defines the selector struct with configuration constants (status path, cache TTL, HTTP timeout), JSON/cache types, constructor, and main Select() entry point. Selection reads the requested model from context, delegates to an inner least-connections selector, and exposes Name(), IncrementConnections, and DecrementConnections.
Endpoint filtering, alias resolution, and status caching
internal/adapter/balancer/omlx_loaded_first.go
Implements model filtering logic that resolves backend model IDs using optional context-provided alias maps, checks per-endpoint load status with TTL-aware caching, and schedules de-duplicated background refreshes with bounded concurrency.
HTTP status fetching, storage and URL construction
internal/adapter/balancer/omlx_loaded_first.go
Performs HTTP GET requests to endpoint-specific oMLX status endpoints with timeout handling, JSON decoding, HTTP status validation, stores cache entries with timestamps, and constructs status URLs by parsing endpoint base URLs and trimming query/fragment.
OMLXLoadedFirstSelector test coverage
internal/adapter/balancer/omlx_loaded_first_test.go
Adds tests that prefer endpoints reporting the model as loaded, fall back to least-connections when none are loaded, resolve model aliases from context, handle probe failures by caching empty status and falling back, and ensure cold (uncached) slow probes do not block selection. Includes helpers for mock and delayed oMLX status servers.

Sequence Diagram

sequenceDiagram
  participant Client
  participant OMLXLoadedFirstSelector
  participant Cache
  participant oMLXEndpoint
  participant InnerSelector

  Client->>OMLXLoadedFirstSelector: Select(ctx, endpoints)
  OMLXLoadedFirstSelector->>OMLXLoadedFirstSelector: Extract model from context
  loop for each endpoint
    OMLXLoadedFirstSelector->>Cache: Check status (with TTL)
    alt Cache hit (fresh)
      Cache-->>OMLXLoadedFirstSelector: Loaded model list
    else Cache miss or stale
      OMLXLoadedFirstSelector->>oMLXEndpoint: GET /v1/models/status (with timeout)
      oMLXEndpoint-->>OMLXLoadedFirstSelector: JSON {models: [...]}
      OMLXLoadedFirstSelector->>Cache: Store status with timestamp
      Cache-->>OMLXLoadedFirstSelector: (stored)
    end
    OMLXLoadedFirstSelector->>OMLXLoadedFirstSelector: Filter by loaded status
  end
  alt Loaded endpoints exist
    OMLXLoadedFirstSelector->>InnerSelector: Select from loaded subset
    InnerSelector-->>OMLXLoadedFirstSelector: Least-connected endpoint
  else No loaded endpoints
    OMLXLoadedFirstSelector->>InnerSelector: Select from all endpoints
    InnerSelector-->>OMLXLoadedFirstSelector: Least-connected endpoint
  end
  OMLXLoadedFirstSelector-->>Client: Selected endpoint
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title directly and accurately summarises the main change: introducing a new oMLX loaded-first selector for the load balancer, which is the core functionality added across all modified files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
internal/adapter/balancer/omlx_loaded_first.go (1)

24-27: ⚡ Quick win

Fallback strategy diverges from the recommended production balancer.

This selector composes and falls back to LeastConnectionsSelector. Confirm this is intended given the project guidance to prefer Priority-based balancing for production deployments.

As per coding guidelines for internal/adapter/balancer/**/*.go: "Use Priority-based load balancing strategy as recommended for production deployments".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/adapter/balancer/omlx_loaded_first.go` around lines 24 - 27,
OMLXLoadedFirstSelector currently falls back to LeastConnectionsSelector which
conflicts with the project guideline to use priority-based balancing in
production; update the selector to fall back to the priority-based
implementation (e.g., replace composition of LeastConnectionsSelector with the
PrioritySelector/PriorityBasedSelector used elsewhere) or make the fallback
configurable and document the choice; locate the selection logic inside
OMLXLoadedFirstSelector and swap the fallback selector instantiation/usage to
the Priority-based selector (or add a flag/use existing priority selector type)
so the ordering semantics follow the recommended priority-based balancer.
internal/adapter/balancer/omlx_loaded_first_test.go (1)

34-53: ⚡ Quick win

Add coverage for the probe-failure fallback path.

Current tests cover loaded-preference, least-connections fallback (all cold), and alias resolution, but not the documented behaviour where a failed/unreachable status probe falls back to least-connections. A test pointing one endpoint at a closed/non-responsive server would lock in that contract.

Want me to draft this test case?

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/adapter/balancer/omlx_loaded_first_test.go` around lines 34 - 53,
Add a new test (e.g.,
TestOMLXLoadedFirstSelector_ProbeFailureFallsBackToLeastConnections) that uses
NewOMLXLoadedFirstSelector and the existing makeOMLXStatusEndpoint helpers to
simulate an unreachable probe: create a "busy" endpoint via
makeOMLXStatusEndpoint and immediately call its cleanup to close the probe
server (so status probe will fail), create an "idle" endpoint that stays
running, call selector.IncrementConnections(busyEndpoint) to mark busy, then
call selector.Select with both endpoints (context set with
constants.ContextModelKey) and assert the selected endpoint is "idle"; use the
same Select, IncrementConnections and makeOMLXStatusEndpoint identifiers to
locate code.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/adapter/balancer/omlx_loaded_first.go`:
- Around line 119-131: When fetchStatus fails in isModelLoaded, create and cache
a negative/empty Endpoint status (no loadedModels) via the existing storeStatus
call so future isModelLoaded calls hit the cache instead of re-probing;
specifically, in isModelLoaded after err != nil construct an empty/failed status
(no entries in loadedModels or an explicit unreachable flag if your
EndpointStatus supports it) and call
selector.storeStatus(endpoint.GetURLString(), negativeStatus) before returning
false; optionally set a shorter TTL for that negative entry if storeStatus
supports TTL.
- Around line 86-104: loadedEndpoints is currently performing synchronous
per-endpoint probes via isModelLoaded (which calls fetchStatus) on the request
hot path causing latency and lacking negative caching; change loadedEndpoints
(and usages from RetryHandler.ExecuteWithRetry -> selector.Select) to only
consult a shared cache of endpoint model-ready state (populated by a background
refresher with bounded concurrency) and stop doing HTTP GETs inline; update
isModelLoaded/fetchStatus to write both positive and negative entries with short
TTLs into that cache and ensure a background goroutine or worker pool refreshes
entries (with bounded concurrency and backoff) so Select/loadedEndpoints only
reads cached values and never blocks on network probes.

---

Nitpick comments:
In `@internal/adapter/balancer/omlx_loaded_first_test.go`:
- Around line 34-53: Add a new test (e.g.,
TestOMLXLoadedFirstSelector_ProbeFailureFallsBackToLeastConnections) that uses
NewOMLXLoadedFirstSelector and the existing makeOMLXStatusEndpoint helpers to
simulate an unreachable probe: create a "busy" endpoint via
makeOMLXStatusEndpoint and immediately call its cleanup to close the probe
server (so status probe will fail), create an "idle" endpoint that stays
running, call selector.IncrementConnections(busyEndpoint) to mark busy, then
call selector.Select with both endpoints (context set with
constants.ContextModelKey) and assert the selected endpoint is "idle"; use the
same Select, IncrementConnections and makeOMLXStatusEndpoint identifiers to
locate code.

In `@internal/adapter/balancer/omlx_loaded_first.go`:
- Around line 24-27: OMLXLoadedFirstSelector currently falls back to
LeastConnectionsSelector which conflicts with the project guideline to use
priority-based balancing in production; update the selector to fall back to the
priority-based implementation (e.g., replace composition of
LeastConnectionsSelector with the PrioritySelector/PriorityBasedSelector used
elsewhere) or make the fallback configurable and document the choice; locate the
selection logic inside OMLXLoadedFirstSelector and swap the fallback selector
instantiation/usage to the Priority-based selector (or add a flag/use existing
priority selector type) so the ordering semantics follow the recommended
priority-based balancer.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9e466093-b968-470b-a459-524fbac2f6a3

📥 Commits

Reviewing files that changed from the base of the PR and between fd57b87 and 3b43052.

📒 Files selected for processing (4)
  • internal/adapter/balancer/factory.go
  • internal/adapter/balancer/factory_test.go
  • internal/adapter/balancer/omlx_loaded_first.go
  • internal/adapter/balancer/omlx_loaded_first_test.go

Comment thread internal/adapter/balancer/omlx_loaded_first.go
Comment thread internal/adapter/balancer/omlx_loaded_first.go Outdated
@thushan

thushan commented May 31, 2026

Copy link
Copy Markdown
Owner

Thanks for the contribution, could we run make ready and get the alignment stuff autofixed, we'll review this properly soon after and run it on some Mac Mini's and the Studio rigs we have with oMLX.

@thushan

thushan commented May 31, 2026

Copy link
Copy Markdown
Owner

We also don't officially support omlx but I will get cracking on adding that official support too (unless you're planning to add support?). I'll next get an opportunity mid this week to work on Olla (Wed).

@bongiozzo

bongiozzo commented Jun 2, 2026

Copy link
Copy Markdown

Thanks @thushan,

Olla is core to our Apple/Mac oMLX cluster effort:
https://github.com/shared-goals/thunder-forge

From our side, a setup with four Mac Max/Ultra nodes (64-512GB) plus a fifth Mac as a cache hub over Thunderbolt for fast model sync is an excellent hardware platform for private inference.

Using agent-driven operations (for example with Hermes-style workflows and skills) makes cluster management and tuning much more practical.

Right now, we are focused more on validating high-demand real-world scenarios than on deep Olla optimization. You clearly have stronger expertise there today.
That said, we already see strong potential for agentic software development workloads and memory-processing workloads (for example, Hindsight-style memory systems).

If this direction is useful, we are committed to investing in further Olla optimization for oMLX on Mac.

@thushan thushan self-assigned this Jun 7, 2026
@thushan

thushan commented Jun 7, 2026

Copy link
Copy Markdown
Owner

@sg-shag / @bongiozzo:
I've spent sometime this weekend looking into oMLX and the things mentioned in this PR.

Have some thoughts.

  • Instead of a provider specific load balancer, we can go for a warm-first (generic) load balancer so it's reusable across others that may come into light (with a configurable fallback)
  • Natively support oMLX so we have a profile instead of relying on openai-compatible.

I have the oMLX profile ready, may I add you both to test/check it when I PR? I'm currently away and have PopOS only on my laptop.

PR for oMLX native profile:
#167

For the new load balancer, i've captured the details here:
#166

Feel free to add your thoughts (which I'm proposing superseeds this).

@gnezim

gnezim commented Jun 7, 2026

Copy link
Copy Markdown

@thushan I'd like to test the pr too. Currently using the project of @bongiozzo .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants