Concurrency limits per model and per provider #108

Open
opened 2026-05-23 16:35:53 -04:00 by jasoncouture · 0 comments
jasoncouture commented 2026-05-23 16:35:53 -04:00 (Migrated from github.com)

Cap how many concurrent requests can be in flight against a given model and against a given provider (separate dials). Backed by a semaphore/queue so requests above the cap wait their turn instead of stampeding the backend.

Why: local providers (Ollama on a single GPU) can only run one inference at a time. Hosted providers have rate / concurrency limits. Without a limit, parallel agent activity (main loop + compaction + heartbeat sub-agent, etc.) will either fail with provider errors or serialize through whatever locking the provider has — invisible to the framework.

Surface:

  • Provider config: `maxConcurrent: N` (default unlimited).
  • Model config: optional per-model override (one provider, multiple models with different limits).
  • FIFO queue when limit reached; cancellation token honored while waiting.
  • Metric / OTEL counter for queue depth and wait time so saturation is visible.

Prerequisite for partial compaction (sibling) — parallel compaction needs to coexist with the main loop on the same provider without races. Also useful on its own.

Cap how many concurrent requests can be in flight against a given model and against a given provider (separate dials). Backed by a semaphore/queue so requests above the cap wait their turn instead of stampeding the backend. Why: local providers (Ollama on a single GPU) can only run one inference at a time. Hosted providers have rate / concurrency limits. Without a limit, parallel agent activity (main loop + compaction + heartbeat sub-agent, etc.) will either fail with provider errors or serialize through whatever locking the provider has — invisible to the framework. Surface: - Provider config: \`maxConcurrent: N\` (default unlimited). - Model config: optional per-model override (one provider, multiple models with different limits). - FIFO queue when limit reached; cancellation token honored while waiting. - Metric / OTEL counter for queue depth and wait time so saturation is visible. Prerequisite for partial compaction (sibling) — parallel compaction needs to coexist with the main loop on the same provider without races. Also useful on its own.
Sign in to join this conversation.
No description provided.