Concurrency limits per model and per provider #108
Labels
No labels
bug
commercial
documentation
duplicate
enhancement
feature
good first issue
help wanted
invalid
question
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
jasoncouture/llama-shears#108
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Cap how many concurrent requests can be in flight against a given model and against a given provider (separate dials). Backed by a semaphore/queue so requests above the cap wait their turn instead of stampeding the backend.
Why: local providers (Ollama on a single GPU) can only run one inference at a time. Hosted providers have rate / concurrency limits. Without a limit, parallel agent activity (main loop + compaction + heartbeat sub-agent, etc.) will either fail with provider errors or serialize through whatever locking the provider has — invisible to the framework.
Surface:
Prerequisite for partial compaction (sibling) — parallel compaction needs to coexist with the main loop on the same provider without races. Also useful on its own.