Roadmap Notes¶
This page tracks planned ergonomics that are intentionally being implemented in phases so each phase can be tested and verified independently.
Phase 1: Model Catalog Filtering¶
Improve ooai-llm models list and the matching Python API so users can inspect
models across providers without manually remembering provider-specific names.
Implemented filters:
Provider filters: one provider, repeated providers, or all providers.
Capability filters:
chat,reasoning,coding,vision,function_calling, andcheap.Date filters: release/date-like suffix filters such as
--released-afterand--released-before.Cost filters: maximum input and output USD cost per one million tokens.
Context filters: minimum context/input-token window.
Sort modes: recency, provider, model name, total cost, input cost, output cost, and context window.
Output formats: Rich table when available, plain table fallback, JSON, and CSV.
The practical model inventory command is:
ooai-llm models list \
--source litellm \
--providers openai,anthropic,mistral \
--reasoning-only \
--released-after 2026-01 \
--max-output-cost-per-1m 200 \
--sort output_cost
Phase 1a: Catalog Cost Intelligence¶
Add a planning layer over the model catalog so users can answer questions like “what is cheapest for a 10k input / 2k output call?”, “which coding models are available?”, and “how many calls of one model cost the same as one call of another?”
Implemented surfaces:
ModelCallShapefor explicit input/output token assumptions.compare_model_catalog(...)andcompare_model_candidates(...)for cost-ranked model estimates.get_cheapest_models(...)andget_coding_model_comparison(...)convenience helpers.ModelCostComparison.model_list(),model_dict(),by_provider(),cheapest_per_provider(), andequivalents(...).ooai-llm models comparewith provider, date, capability, input/output token-limit, cost, coding/reasoning, tool-calling, structured-output, per-provider, budget, and baseline-ratio filters.
Example:
ooai-llm models compare \
--source litellm \
--providers openai,anthropic,google,deepseek,mistral \
--input-tokens 10000 \
--output-tokens 2000 \
--tool-calling-only \
--structured-output-only \
--sort output_tokens \
--per-provider
Catalog prices are estimates for planning. Observed provider usage metadata is still the source of truth for real post-call accounting.
Phase 1b: Model Suites For Experiments¶
Add reusable model suites so application code can get an ordered list or keyed dict of candidate models for comparisons, LangGraph node variants, and provider experiments.
Implemented surfaces:
get_model_suite(...)for preset-backed suites from currentAppSettings.model_suite_from_catalog(...)for dynamic suites from the same catalog filters used byooai-llm models list.ModelSuite.model_list(),ModelSuite.model_dict(),ModelSuite.to_profiles(),ModelSuite.create_llms(), andModelSuite.create_runtimes().Suite-level
parallel_tool_callsforwarding into generated profiles for deterministic tool-loop comparisons.ooai-llm models suitefor table, JSON, or CSV output.
Example:
suite = get_model_suite("comparison", providers=["openai", "anthropic"])
profiles = suite.filter(roles=["cheap", "balanced"]).to_profiles()
Phase 2: Serializable Chat Model Profiles¶
Add a Pydantic ChatModelProfile that can be serialized, checked into config,
and used to create a LangChain BaseChatModel.
The profile captures:
Model selection:
model,provider,alias,preset.Runtime kwargs:
temperature,max_tokens,top_p, penalties,seed,stop,max_retries,parallel_tool_calls, timeout, streaming,model_kwargs, and other safe constructor kwargs.Reasoning config through the existing
ReasoningConfig.Cache policy, cache namespace, and cache key strategy.
Tags, metadata, run names, and cost-accounting labels.
Optional model-default auto-refresh behavior.
Expected shape:
profile = ChatModelProfile(model="openai:gpt-5.4-mini", temperature=0)
llm = profile.create_llm(settings=settings)
bundle = profile.create_bundle(settings=settings)
Current milestone target:
JSON v1 profile serialization.
Profile CLI commands for validate, render, and resolve.
Namespaced cache key policy backed by a LangChain cache wrapper.
Profile IDs that flow into runtime metadata.
Phase 3: LLM Runtime Wrapper¶
Add an LLM class that owns a profile, settings, and a private runnable.
The runtime should:
Lazily build and cache the underlying chat model as a private attribute.
Rebuild the runnable when the profile or refreshed defaults change.
Expose
invoke,ainvoke,stream, andbatchstyle methods.Attach usage/cost callbacks consistently for LangChain and LangGraph.
Expose accumulated usage totals and per-model/per-run summaries.
Expose stable
idanduuidvalues that are attached to LangChain runtime metadata, usage events, and logs.
Current milestone target:
Lazy runtime construction from
ChatModelProfile.Observed post-call usage/cost recording from LangChain response metadata.
Optional
ultiloglogging facade with stdlib fallback.Provider preflight token counting remains a follow-up phase.
Phase 4: Agent And Middleware Integration¶
Keep ooai-llm focused on model profiles, runtime objects, cost accounting,
context accounting, and LangChain-compatible adapters. A future ooai-agents
package can build higher-level graph templates, agent profiles, memory, and
workflow orchestration on top of these primitives.
Package boundary for the future ooai-agents package:
ooai-llmowns provider/model concerns: model strings, model catalog, profiles, model suites, runtime construction, caches, reasoning kwargs, usage/cost accounting, context snapshots, and LangChain runnable adapters.ooai-agentsowns agent concerns: agent profiles, graph templates, node policies, memory plans, task decomposition, subagent specs, skill loading, sandbox/filesystem strategy, human-in-the-loop policy, and deployment workflow.ooai-llmmay expose lightweight adapter primitives that are useful without an agent package, but it should avoid owning long-running agent orchestration.ooai-agentsshould consumeooai-llmthrough stable contracts such asChatModelProfile,LLM,LLMRegistry,UsageRecorder,ContextSnapshot, andruntime.runnable.Avoid dependency cycles.
ooai-llmmust not importooai-agents; the agents package imports and composesooai-llm.Keep LangChain and Deep Agents integration behind thin adapter layers so
ooai-agentscan eventually support other runtimes without rewriting model accounting or catalog logic.
Integration targets from current LangChain and Deep Agents docs:
LangChain
create_agentaccepts a model string or initialized chat model and builds a LangGraph-backed agent runtime.LangChain middleware supports dynamic model selection with
wrap_model_call, dynamic system prompts withdynamic_prompt, dynamic tool filtering through request overrides, tool-error handling withwrap_tool_call, and runtime context throughrequest.runtime.context.LangChain custom middleware has node-style hooks (
before_agent,before_model,after_model,after_agent) and wrap-style hooks (wrap_model_call,wrap_tool_call).Deep Agents
create_deep_agentaccepts aprovider:modelstring or an initialized chat model instance, and it layers planning, todo, filesystem, subagent, skills, backend, and summarization middleware around that model.Deep Agents subagents can each define their own model, tools, system prompt, middleware, response format, filesystem permissions, and skills.
Deep Agents already owns deep-agent memory and context mechanics such as filesystem backends, store-backed long-term memory, tool-specific prompt guidance, subagent context isolation, offloading, and summarization.
Design principles:
ChatModelProfile.idis the stable human-readable configuration identifier.LLM.idis the stable runtime routing identifier, defaulting to the profile id when available.LLM.uuididentifies a concrete runtime instance and should flow into logs, LangChain runtime metadata, usage events, traces, and future graph/node diagnostics.The registry layer should support lookups by key, profile id, runtime id, and UUID so agent middleware can route without rebuilding models unnecessarily.
Middleware should override LangChain requests with the selected
runtime.runnablerather than makingLLMpretend to be a full agent.LLMremains the model/runtime/accounting object; agent orchestration belongs inooai-agents.Keep the base model object separate from augmented, agent-ready bindings. A base
LLMshould know how to construct and account for a model; an augmented runtime should know how to attach prompts, tools, structured output, tool selection, and context policies for a specific agent turn.
Planned primitives:
LLMRegistry: keyed container for profiles and lazyLLMruntimes.LLMRegistry.get(...): retrieve a runtime by key, runtime id, profile id, or UUID.LLMRegistry.model_dict(...): expose selected models for routing tables, dashboards, and LangGraph node configuration.LLMRegistry.usage_summary: aggregate usage across all registered runtimes through a sharedUsageRecorder.ContextBudget: requested/reserved token budget, warning threshold, and hard-limit policy.ContextSnapshot: model, max context tokens, estimated input tokens, reserved output tokens, remaining tokens, context-used percentage, and count source.AugmentedLLMorLLMBinding: request-ready wrapper around a baseLLMplus prompt, tools, response format, context budget, and tool policy.PromptPolicy: static system prompt, LangChain prompt template,ChatPromptTemplate, callable prompt builder, plus dynamic prompt blocks derived from state, runtime context, store, or previous tool results.ToolPolicy: available tools, tool-selection strategy, parallel-tool-call preference, retry behavior, timeout behavior, and error handling.ToolSelectionStrategy: static tools, all tools, filtered tools, semantic selection, permission-aware selection, or runtime-discovered tools.AgentRoutingDecision: recorded model/tool/prompt choice for each agentic split so routing can be debugged and compared later.AgentInjection: a lightweight export object that containsmodel,tools,middleware,system_prompt,response_format,context_schema, and optional Deep Agents-specific fields such assubagentsor backend/store hints.LangChain v1 middleware helpers for model routing, usage recording, budget checks, context snapshotting, dynamic prompt injection, and tool filtering.
Runnable and middleware compatibility contract:
LLM.runnableis the official LangChain interop boundary. It returns the underlying chat model object created by LangChain’sinit_chat_modelpath and should remain suitable anywhere LangChain expects a chat model or Runnable.LLM.invoke(...),LLM.ainvoke(...),LLM.batch(...), andLLM.stream(...)are convenience methods that add ooai usage/cost recording around the same runnable rather than defining a separate execution protocol.LLMshould not subclass provider-specific chat models unless LangChain requires it. Composition keeps provider behavior, tool calling, streaming, structured output, and future LangChain changes behind the native runnable.Middleware should select a runtime and call
request.override(model=runtime.runnable). The selected runtime’sid,uuid, profile id, tags, and cost labels should be injected into request metadata or config before dispatch.Registry and middleware helpers should avoid importing LangChain middleware symbols at package import time. Use lazy imports so core
ooai-llmremains usable without optional agent/middleware dependencies.If a future LangChain v2 changes middleware classes, the adapter layer should be the only code that changes. Profiles, registries, usage events, context snapshots, and
LLM.runnableshould remain stable.Any LangGraph node should be able to use either
runtime.runnabledirectly orruntime.invoke(...)when ooai-managed usage recording is desired.
Base versus augmented LLM split:
LLM
├── Owns: profile, model construction, runnable, ids, usage recorder, cache
├── Stable across: requests, graph nodes, model registry, dashboards
└── Does not own: agent prompt, selected tools, response schema, memory policy
AugmentedLLM / LLMBinding
├── Owns: prompt policy, tool policy, response format, context budget
├── Built from: one base LLM plus per-agent/per-turn state
└── Produces: a bound runnable or model-call request override
This keeps model identity, pricing, cache behavior, and usage accounting stable while allowing prompts and tools to change per state, user, task, graph node, or middleware decision. It also prevents a cached base model from being polluted by request-specific tool bindings.
Tool policy shape:
tool_policy = ToolPolicy(
selection="permission-aware",
allow_parallel_tool_calls=False,
max_tool_retries=2,
retry_on_tool_errors=True,
tool_timeout_seconds=30,
)
Tool binding should be late and request-scoped:
Select the base runtime from
LLMRegistry.Build a context snapshot for the current messages, system prompt, selected tools, and reserved output budget.
Select or filter tools according to state, runtime context, permissions, and context pressure.
Bind tools or structured output to
runtime.runnableonly for the current request.Record an
AgentRoutingDecisionwith selected model id, runtime UUID, prompt policy id, selected tool names, context-used percentage, and reason.Dispatch through LangChain middleware with
request.override(model=bound_model).
Prompt compatibility goals:
Support LangChain v1 dynamic prompts by generating or replacing
request.system_messageinside middleware.Support older LangChain prompt concepts by accepting string prompts,
PromptTemplate,ChatPromptTemplate, message-template lists, and plain callables that receive state/runtime context.Keep prompt rendering separate from model construction. A profile picks the model; a prompt policy renders the agent/request prompt.
Preserve structured message content blocks when appending dynamic context, especially for provider-specific features like prompt caching.
Make prompt inputs explicit so dynamic prompts can be tested without invoking an LLM.
Expected create_agent integration:
augmented = registry.augment(
"coding",
prompt=PromptPolicy.from_chat_template(repo_review_prompt),
tools=ToolPolicy(
tools=[read_file, edit_file, run_tests],
selection="filtered",
allow_parallel_tool_calls=False,
max_tool_retries=2,
),
context=ContextBudget(max_used_percent=80, reserve_output_tokens=2_000),
)
agent = create_agent(**augmented.to_create_agent_kwargs())
The augmented object should be injectable into LangChain in two styles:
Simple style: pass
model=augmented.default_model,tools=augmented.all_tools, andmiddleware=augmented.middleware.Explicit style: pass the base
runtime.runnableyourself and use only the middleware pieces you want.Export style: call
augmented.to_create_agent_kwargs()to get the exact keyword arguments forcreate_agent(...).
Expected create_deep_agent integration:
deep_agent = create_deep_agent(
**augmented.to_deep_agent_kwargs(
subagents=[
registry.augment(
"research",
prompt=PromptPolicy.static("Research only. Return citations."),
tools=ToolPolicy(tools=[web_search, read_file]),
).to_subagent(name="research-agent"),
],
),
backend=backend,
store=store,
)
Deep Agents compatibility notes:
Prefer passing
runtime.runnablewhen model parameters, provider-specific constructor kwargs, reasoning config, cache behavior, or ooai metadata must be preserved. Passing a model string is acceptable for simple examples.Do not duplicate Deep Agents filesystem, backend, long-term memory, skills, todo, or summarization middleware inside
ooai-llm. Expose hooks and metadata that let those systems be observed and configured.Treat Deep Agents subagents as separate augmented bindings with their own runtime UUID, prompt policy, tool policy, response format, and cost labels.
Capture parent/child runtime UUIDs in
AgentRoutingDecisionso subagent spend and context isolation can be traced.Keep
system_promptas user/task instructions. Do not overwrite Deep Agents’ built-in prompt assembly; let Deep Agents append its harness/tool guidance.Avoid pre-binding tools to models when structured output or dynamic model selection is active. Prefer request-scoped binding or
request.override(...)inside middleware.ContextSnapshotshould complement Deep Agents’ built-in offloading and summarization by exposing context-used percentage and budget warnings before the deep-agent harness decides to compress.
Expected ooai-agents consumption shape:
registry = LLMRegistry.from_profiles(
{
"default": ChatModelProfile(model="openai:gpt-5.4-mini"),
"research": ChatModelProfile(model="anthropic:claude-sonnet-4.6"),
}
)
agent_profile = AgentProfile(
name="research_assistant",
default_llm="default",
subagents=[
SubagentProfile(
name="research",
llm="research",
tools=["web_search", "read_file"],
)
],
)
agent = create_ooai_agent(
profile=agent_profile,
llms=registry,
)
In this shape, ooai-agents decides what an agent is. ooai-llm supplies the
registered model runtimes, UUIDs, usage/cost recorder, catalog metadata,
context percentages, and LangChain-compatible runnable objects.
Old prompt-template compatibility shape:
from langchain_core.prompts import ChatPromptTemplate
repo_review_prompt = ChatPromptTemplate.from_messages(
[
("system", "You are reviewing repo {repo_name}. Focus on {focus_area}."),
("placeholder", "{messages}"),
]
)
prompt_policy = PromptPolicy.from_chat_template(
repo_review_prompt,
input_mapper=lambda request: {
"repo_name": request.runtime.context.repo_name,
"focus_area": request.state.get("focus_area", "correctness"),
"messages": request.messages,
},
)
Expected LangChain v1 shape:
registry = LLMRegistry.from_profiles(
{
"cheap": ChatModelProfile(model="openai:gpt-5.4-mini", temperature=0),
"coding": ChatModelProfile(model="anthropic:claude-sonnet-4.6"),
}
)
middleware = registry.model_router(
selector=lambda request, registry: "coding"
if request.state.get("task_type") == "code"
else "cheap"
)
agent = create_agent(
model=registry["cheap"].runnable,
tools=tools,
middleware=[middleware],
)
Context-budget shape:
snapshot = registry["coding"].context_snapshot(
messages=state["messages"],
reserve_output_tokens=2_000,
)
print(snapshot.context_used_percent)
print(snapshot.remaining_context_tokens)
Middleware responsibilities:
wrap_model_call: choose the runtime, attach the runnable, optionally filter tools, and add runtime/profile UUID metadata.before_model: compute context percentage, trim or summarize messages, and inject dynamic system prompt blocks.after_model: record response usage/cost, validate output, and persist telemetry.wrap_tool_call: monitor tool calls and attach tool latency/error metadata to the same run/runtime identifiers.
Open questions to settle before implementation:
Whether
LLMRegistryshould live at top level or under an optionalooai_llm.agentsmodule.Whether exact provider preflight token counting belongs in the registry,
LLM, or a separate token-estimation service.How much middleware should be shipped in
ooai-llmversus deferred to the futureooai-agentspackage.Whether UUIDs should be purely runtime-generated or optionally persisted in checked-in profile registries for reproducible graph wiring.
Whether Deep Agents integration belongs in
ooai-llmas optional adapters or only in the futureooai-agentspackage.Whether
AgentInjectionshould be a Pydantic model, a typed dict, or a small protocol so it can remain independent of LangChain import-time internals.Which prompt and tool policy objects should live in
ooai-llmas minimal adapter contracts versus inooai-agentsas full orchestration concepts.
References to keep in view while implementing:
Token Counting Notes¶
Core rule: use the same tokenizer or provider counting method as the model and payload shape you will actually call.
Local tokenizers are useful for fast estimates, but provider APIs are safer for real chat payloads with tools, JSON schemas, files, images, cached context, reasoning tokens, or provider-side wrappers.
Recommended layers:
Local estimate:
tiktokenfor OpenAI-family text,transformersorsentencepiecefor open models.Provider preflight: OpenAI Responses input-token count, Anthropic
messages.count_tokens, Geminicount_tokens, or LiteLLM when it can route to provider-specific counters.Post-call accounting: provider response usage metadata, LangChain callbacks, LangGraph callbacks, or LlamaIndex token-counting handlers.
Cost projection: LiteLLM pricing metadata,
tokencost, or package-owned pricing tables, with provider usage metadata as the billing source of truth.
The runtime should preserve provenance for every token count:
count_source: Literal[
"provider_preflight",
"provider_usage_metadata",
"local_tokenizer",
"framework_callback",
"approximation",
]
Provider-specific notes:
OpenAI: local
tiktokenis good for plain text estimates; the OpenAI input-token-count endpoint is preferred for exact Responses-style payloads.Anthropic: use
messages.count_tokensfor Claude request preflight, especially with tools and multimodal content.Gemini: use
google-genaicount_tokensbefore generation.LiteLLM: useful as a high-level multi-provider interface, but exactness depends on provider-specific support for the requested model and payload.
Hugging Face/open models: use the exact model tokenizer through
transformers.AutoTokenizer; usetokenizersorsentencepiecefor lower level or custom-tokenizer work.LangChain/LangGraph: use framework callbacks for post-call accounting and chat-history trimming, not as a universal exact preflight billing source.
References to keep in view while implementing: