Experimental: Parallel ACompletions
parallel_acompletions
lets you fire multiple Router.acompletion
requests concurrently with a simple list, handling retries/fallbacks per underlying call exactly as the normal router would. A sister iterator interface streams results as soon as each finishes.
Statusโ
- Flag gated (OFF by default).
- API surface, naming, and result shape may change.
- Enable explicitly before relying in production.
Enablingโ
export LITELLM_ENABLE_PARALLEL_ACOMPLETIONS=1
Basic Usageโ
import os, asyncio
os.environ["LITELLM_ENABLE_PARALLEL_ACOMPLETIONS"] = "1"
from litellm import Router
from litellm.router_utils.parallel_acompletion import RouterParallelRequest
router = Router(model_list=[
{
"model_name": "gpt35",
"litellm_params": {"model": "gpt-3.5-turbo", "api_key": "..."}
}
])
requests = [
RouterParallelRequest(model="gpt35", messages=[{"role":"user","content":"Hello"}]),
RouterParallelRequest(model="gpt35", messages=[{"role":"user","content":"Tell me a joke"}]),
]
async def main():
results = await router.parallel_acompletions(requests, concurrency=4, preserve_order=True)
for r in results:
if r.exception:
print("ERR:", r.exception)
else:
print(r.response)
asyncio.run(main())
Iterator Form (completion order)โ
async for result in router.iter_parallel_acompletions(requests, concurrency=4):
...
Behavior & Error Semanticsโ
-
return_exceptions=True
(default)- Each item in the results has either
response
orexception
set. - Iterator form yields all results; you handle
result.exception
per item.
- Each item in the results has either
-
return_exceptions=False
(fail-fast)parallel_acompletions(...)
raises on the first error and cancels remaining tasks.iter_parallel_acompletions(...)
raises on the first error and stops iteration; any outstanding tasks are cancelled.
-
Concurrency
- A bounded semaphore limits in-flight calls at the orchestration layer.
- For very large request lists, tasks are scheduled but concurrency limits actual in-flight calls.
Flag Gating Notesโ
- The feature is controlled by
LITELLM_ENABLE_PARALLEL_ACOMPLETIONS
and evaluated on import. - If toggling the env var at runtime, restart the process (or reload modules) to apply.
Argumentsโ
Param | Description |
---|---|
requests | List of RouterParallelRequest (model, messages, optional kwargs) |
concurrency | Max in-flight tasks at orchestration layer (default 8) |
return_exceptions | If True (default) errors captured per-result; if False first error cancels all |
preserve_order | For parallel_acompletions : if True, final list matches input order |
Result Objectโ
Each result is a RouterParallelResult
:
{
"index": 0,
"request": RouterParallelRequest(...),
"response": <ModelResponse or custom>,
"exception": Optional[Exception]
}
Error Handlingโ
- With
return_exceptions=True
you get a result entry containingexception
. - With
False
, the first exception aborts remaining tasks (standardasyncio.gather
propagation).
Why Flag Gated?โ
To allow fast iteration on:
- Naming (
parallel_acompletions
,iter_parallel_acompletions
) - Result shape (object vs tuple)
- Performance tuning (fair queueing, cancellation semantics)
Provide feedback in the GitHub issue / PR.
Roadmap Ideasโ
- Per-request timeouts.
- Integrated progress callbacks.
- Batch adaptive concurrency.
- Streaming passthrough merge (aggregate token usage).