llama.cpp native API provider #2

New issue

Closed

opened 2026-05-07 01:27:16 -04:00 by jasoncouture · 1 comment

jasoncouture commented

2026-05-07 01:27:16 -04:00

(Migrated from github.com)

Separate provider targeting llama-server's native endpoints (/completion, /embedding, slot management) for the extra knobs OpenAI-compat hides — logprobs, multi-slot batching, finer-grained sampling controls.

Sibling to the OpenAI-compatible provider; choose one or the other per agent based on which control surface the deployment needs.

Tracked in TASKS.md.

Separate provider targeting `llama-server`'s *native* endpoints (`/completion`, `/embedding`, slot management) for the extra knobs OpenAI-compat hides — logprobs, multi-slot batching, finer-grained sampling controls. Sibling to the OpenAI-compatible provider; choose one or the other per agent based on which control surface the deployment needs. Tracked in [TASKS.md](../blob/main/TASKS.md).

jasoncouture commented

2026-05-07 20:41:06 -04:00

(Migrated from github.com)

Closed by PR #43 — design pivoted from a separate /completion//embedding provider to the OpenAI-compat provider with OpenAIProviderOptions.ExtraRequestParams (deep-merged into every request body). llama-server's native knobs (cache_prompt, slot_id, samplers, n_probs, min_p, etc.) round-trip via that field, so a single provider covers api.openai.com, llama-server, vLLM, LM Studio, and TabbyAPI. If a future need surfaces for the raw /completion endpoint specifically, reopen with the use case.

Closed by PR #43 — design pivoted from a separate `/completion`/`/embedding` provider to the OpenAI-compat provider with `OpenAIProviderOptions.ExtraRequestParams` (deep-merged into every request body). llama-server's native knobs (`cache_prompt`, `slot_id`, `samplers`, `n_probs`, `min_p`, etc.) round-trip via that field, so a single provider covers api.openai.com, llama-server, vLLM, LM Studio, and TabbyAPI. If a future need surfaces for the raw `/completion` endpoint specifically, reopen with the use case.