Roadmap Notes¶

This page tracks planned ergonomics that are intentionally being implemented in phases so each phase can be tested and verified independently.

Phase 1: Model Catalog Filtering¶

Improve ooai-llm models list and the matching Python API so users can inspect models across providers without manually remembering provider-specific names.

Implemented filters:

Provider filters: one provider, repeated providers, or all providers.
Capability filters: chat, reasoning, coding, vision, function_calling, and cheap.
Date filters: release/date-like suffix filters such as --released-after and --released-before.
Cost filters: maximum input and output USD cost per one million tokens.
Context filters: minimum context/input-token window.
Sort modes: recency, provider, model name, total cost, input cost, output cost, and context window.
Output formats: Rich table when available, plain table fallback, JSON, and CSV.

The practical model inventory command is:

ooai-llm models list \
  --source litellm \
  --providers openai,anthropic,mistral \
  --reasoning-only \
  --released-after 2026-01 \
  --max-output-cost-per-1m 200 \
  --sort output_cost

Phase 1a: Catalog Cost Intelligence¶

Add a planning layer over the model catalog so users can answer questions like “what is cheapest for a 10k input / 2k output call?”, “which coding models are available?”, and “how many calls of one model cost the same as one call of another?”

Implemented surfaces:

ModelCallShape for explicit input/output token assumptions.
compare_model_catalog(...) and compare_model_candidates(...) for cost-ranked model estimates.
get_cheapest_models(...) and get_coding_model_comparison(...) convenience helpers.
ModelCostComparison.model_list(), model_dict(), by_provider(), cheapest_per_provider(), and equivalents(...).
ooai-llm models compare with provider, date, capability, input/output token-limit, cost, coding/reasoning, tool-calling, structured-output, per-provider, budget, and baseline-ratio filters.

Example:

ooai-llm models compare \
  --source litellm \
  --providers openai,anthropic,google,deepseek,mistral \
  --input-tokens 10000 \
  --output-tokens 2000 \
  --tool-calling-only \
  --structured-output-only \
  --sort output_tokens \
  --per-provider

Catalog prices are estimates for planning. Observed provider usage metadata is still the source of truth for real post-call accounting.

Phase 1b: Model Suites For Experiments¶

Add reusable model suites so application code can get an ordered list or keyed dict of candidate models for comparisons, LangGraph node variants, and provider experiments.

Implemented surfaces:

get_model_suite(...) for preset-backed suites from current AppSettings.
model_suite_from_catalog(...) for dynamic suites from the same catalog filters used by ooai-llm models list.
ModelSuite.model_list(), ModelSuite.model_dict(), ModelSuite.to_profiles(), ModelSuite.create_llms(), and ModelSuite.create_runtimes().
Suite-level parallel_tool_calls forwarding into generated profiles for deterministic tool-loop comparisons.
ooai-llm models suite for table, JSON, or CSV output.

Example:

suite = get_model_suite("comparison", providers=["openai", "anthropic"])
profiles = suite.filter(roles=["cheap", "balanced"]).to_profiles()

Phase 2: Serializable Chat Model Profiles¶

Add a Pydantic ChatModelProfile that can be serialized, checked into config, and used to create a LangChain BaseChatModel.

The profile captures:

Model selection: model, provider, alias, preset.
Runtime kwargs: temperature, max_tokens, top_p, penalties, seed, stop, max_retries, parallel_tool_calls, timeout, streaming, model_kwargs, and other safe constructor kwargs.
Reasoning config through the existing ReasoningConfig.
Cache policy, cache namespace, and cache key strategy.
Tags, metadata, run names, and cost-accounting labels.
Optional model-default auto-refresh behavior.

Expected shape:

profile = ChatModelProfile(model="openai:gpt-5.4-mini", temperature=0)
llm = profile.create_llm(settings=settings)
bundle = profile.create_bundle(settings=settings)

Current milestone target:

JSON v1 profile serialization.
Profile CLI commands for validate, render, and resolve.
Namespaced cache key policy backed by a LangChain cache wrapper.
Profile IDs that flow into runtime metadata.

Phase 3: `LLM` Runtime Wrapper¶

Add an LLM class that owns a profile, settings, and a private runnable.

The runtime should:

Lazily build and cache the underlying chat model as a private attribute.
Rebuild the runnable when the profile or refreshed defaults change.
Expose invoke, ainvoke, stream, and batch style methods.
Attach usage/cost callbacks consistently for LangChain and LangGraph.
Expose accumulated usage totals and per-model/per-run summaries.
Expose stable id and uuid values that are attached to LangChain runtime metadata, usage events, and logs.

Current milestone target:

Lazy runtime construction from ChatModelProfile.
Observed post-call usage/cost recording from LangChain response metadata.
Optional ultilog logging facade with stdlib fallback.
Provider preflight token counting remains a follow-up phase.

Phase 4: Agent And Middleware Integration¶

Keep ooai-llm focused on model profiles, runtime objects, cost accounting, context accounting, and LangChain-compatible adapters. A future ooai-agents package can build higher-level graph templates, agent profiles, memory, and workflow orchestration on top of these primitives.

Package boundary for the future ooai-agents package:

ooai-llm owns provider/model concerns: model strings, model catalog, profiles, model suites, runtime construction, caches, reasoning kwargs, usage/cost accounting, context snapshots, and LangChain runnable adapters.
ooai-agents owns agent concerns: agent profiles, graph templates, node policies, memory plans, task decomposition, subagent specs, skill loading, sandbox/filesystem strategy, human-in-the-loop policy, and deployment workflow.
ooai-llm may expose lightweight adapter primitives that are useful without an agent package, but it should avoid owning long-running agent orchestration.
ooai-agents should consume ooai-llm through stable contracts such as ChatModelProfile, LLM, LLMRegistry, UsageRecorder, ContextSnapshot, and runtime.runnable.
Avoid dependency cycles. ooai-llm must not import ooai-agents; the agents package imports and composes ooai-llm.
Keep LangChain and Deep Agents integration behind thin adapter layers so ooai-agents can eventually support other runtimes without rewriting model accounting or catalog logic.

Integration targets from current LangChain and Deep Agents docs:

LangChain create_agent accepts a model string or initialized chat model and builds a LangGraph-backed agent runtime.
LangChain middleware supports dynamic model selection with wrap_model_call, dynamic system prompts with dynamic_prompt, dynamic tool filtering through request overrides, tool-error handling with wrap_tool_call, and runtime context through request.runtime.context.
LangChain custom middleware has node-style hooks (before_agent, before_model, after_model, after_agent) and wrap-style hooks (wrap_model_call, wrap_tool_call).
Deep Agents create_deep_agent accepts a provider:model string or an initialized chat model instance, and it layers planning, todo, filesystem, subagent, skills, backend, and summarization middleware around that model.
Deep Agents subagents can each define their own model, tools, system prompt, middleware, response format, filesystem permissions, and skills.
Deep Agents already owns deep-agent memory and context mechanics such as filesystem backends, store-backed long-term memory, tool-specific prompt guidance, subagent context isolation, offloading, and summarization.

Design principles:

ChatModelProfile.id is the stable human-readable configuration identifier.
LLM.id is the stable runtime routing identifier, defaulting to the profile id when available.
LLM.uuid identifies a concrete runtime instance and should flow into logs, LangChain runtime metadata, usage events, traces, and future graph/node diagnostics.
The registry layer should support lookups by key, profile id, runtime id, and UUID so agent middleware can route without rebuilding models unnecessarily.
Middleware should override LangChain requests with the selected runtime.runnable rather than making LLM pretend to be a full agent.
LLM remains the model/runtime/accounting object; agent orchestration belongs in ooai-agents.
Keep the base model object separate from augmented, agent-ready bindings. A base LLM should know how to construct and account for a model; an augmented runtime should know how to attach prompts, tools, structured output, tool selection, and context policies for a specific agent turn.

Planned primitives:

LLMRegistry: keyed container for profiles and lazy LLM runtimes.
LLMRegistry.get(...): retrieve a runtime by key, runtime id, profile id, or UUID.
LLMRegistry.model_dict(...): expose selected models for routing tables, dashboards, and LangGraph node configuration.
LLMRegistry.usage_summary: aggregate usage across all registered runtimes through a shared UsageRecorder.
ContextBudget: requested/reserved token budget, warning threshold, and hard-limit policy.
ContextSnapshot: model, max context tokens, estimated input tokens, reserved output tokens, remaining tokens, context-used percentage, and count source.
AugmentedLLM or LLMBinding: request-ready wrapper around a base LLM plus prompt, tools, response format, context budget, and tool policy.
PromptPolicy: static system prompt, LangChain prompt template, ChatPromptTemplate, callable prompt builder, plus dynamic prompt blocks derived from state, runtime context, store, or previous tool results.
ToolPolicy: available tools, tool-selection strategy, parallel-tool-call preference, retry behavior, timeout behavior, and error handling.
ToolSelectionStrategy: static tools, all tools, filtered tools, semantic selection, permission-aware selection, or runtime-discovered tools.
AgentRoutingDecision: recorded model/tool/prompt choice for each agentic split so routing can be debugged and compared later.
AgentInjection: a lightweight export object that contains model, tools, middleware, system_prompt, response_format, context_schema, and optional Deep Agents-specific fields such as subagents or backend/store hints.
LangChain v1 middleware helpers for model routing, usage recording, budget checks, context snapshotting, dynamic prompt injection, and tool filtering.

Runnable and middleware compatibility contract:

LLM.runnable is the official LangChain interop boundary. It returns the underlying chat model object created by LangChain’s init_chat_model path and should remain suitable anywhere LangChain expects a chat model or Runnable.
LLM.invoke(...), LLM.ainvoke(...), LLM.batch(...), and LLM.stream(...) are convenience methods that add ooai usage/cost recording around the same runnable rather than defining a separate execution protocol.
LLM should not subclass provider-specific chat models unless LangChain requires it. Composition keeps provider behavior, tool calling, streaming, structured output, and future LangChain changes behind the native runnable.
Middleware should select a runtime and call request.override(model=runtime.runnable). The selected runtime’s id, uuid, profile id, tags, and cost labels should be injected into request metadata or config before dispatch.
Registry and middleware helpers should avoid importing LangChain middleware symbols at package import time. Use lazy imports so core ooai-llm remains usable without optional agent/middleware dependencies.
If a future LangChain v2 changes middleware classes, the adapter layer should be the only code that changes. Profiles, registries, usage events, context snapshots, and LLM.runnable should remain stable.
Any LangGraph node should be able to use either runtime.runnable directly or runtime.invoke(...) when ooai-managed usage recording is desired.

Base versus augmented LLM split:

LLM
├── Owns: profile, model construction, runnable, ids, usage recorder, cache
├── Stable across: requests, graph nodes, model registry, dashboards
└── Does not own: agent prompt, selected tools, response schema, memory policy

AugmentedLLM / LLMBinding
├── Owns: prompt policy, tool policy, response format, context budget
├── Built from: one base LLM plus per-agent/per-turn state
└── Produces: a bound runnable or model-call request override

This keeps model identity, pricing, cache behavior, and usage accounting stable while allowing prompts and tools to change per state, user, task, graph node, or middleware decision. It also prevents a cached base model from being polluted by request-specific tool bindings.

Tool policy shape:

tool_policy = ToolPolicy(
    selection="permission-aware",
    allow_parallel_tool_calls=False,
    max_tool_retries=2,
    retry_on_tool_errors=True,
    tool_timeout_seconds=30,
)

Tool binding should be late and request-scoped:

Select the base runtime from LLMRegistry.
Build a context snapshot for the current messages, system prompt, selected tools, and reserved output budget.
Select or filter tools according to state, runtime context, permissions, and context pressure.
Bind tools or structured output to runtime.runnable only for the current request.
Record an AgentRoutingDecision with selected model id, runtime UUID, prompt policy id, selected tool names, context-used percentage, and reason.
Dispatch through LangChain middleware with request.override(model=bound_model).

Prompt compatibility goals:

Support LangChain v1 dynamic prompts by generating or replacing request.system_message inside middleware.
Support older LangChain prompt concepts by accepting string prompts, PromptTemplate, ChatPromptTemplate, message-template lists, and plain callables that receive state/runtime context.
Keep prompt rendering separate from model construction. A profile picks the model; a prompt policy renders the agent/request prompt.
Preserve structured message content blocks when appending dynamic context, especially for provider-specific features like prompt caching.
Make prompt inputs explicit so dynamic prompts can be tested without invoking an LLM.

Expected create_agent integration:

augmented = registry.augment(
    "coding",
    prompt=PromptPolicy.from_chat_template(repo_review_prompt),
    tools=ToolPolicy(
        tools=[read_file, edit_file, run_tests],
        selection="filtered",
        allow_parallel_tool_calls=False,
        max_tool_retries=2,
    ),
    context=ContextBudget(max_used_percent=80, reserve_output_tokens=2_000),
)

agent = create_agent(**augmented.to_create_agent_kwargs())

The augmented object should be injectable into LangChain in two styles:

Simple style: pass model=augmented.default_model, tools=augmented.all_tools, and middleware=augmented.middleware.
Explicit style: pass the base runtime.runnable yourself and use only the middleware pieces you want.
Export style: call augmented.to_create_agent_kwargs() to get the exact keyword arguments for create_agent(...).

Expected create_deep_agent integration:

deep_agent = create_deep_agent(
    **augmented.to_deep_agent_kwargs(
        subagents=[
            registry.augment(
                "research",
                prompt=PromptPolicy.static("Research only. Return citations."),
                tools=ToolPolicy(tools=[web_search, read_file]),
            ).to_subagent(name="research-agent"),
        ],
    ),
    backend=backend,
    store=store,
)

Deep Agents compatibility notes:

Prefer passing runtime.runnable when model parameters, provider-specific constructor kwargs, reasoning config, cache behavior, or ooai metadata must be preserved. Passing a model string is acceptable for simple examples.
Do not duplicate Deep Agents filesystem, backend, long-term memory, skills, todo, or summarization middleware inside ooai-llm. Expose hooks and metadata that let those systems be observed and configured.
Treat Deep Agents subagents as separate augmented bindings with their own runtime UUID, prompt policy, tool policy, response format, and cost labels.
Capture parent/child runtime UUIDs in AgentRoutingDecision so subagent spend and context isolation can be traced.
Keep system_prompt as user/task instructions. Do not overwrite Deep Agents’ built-in prompt assembly; let Deep Agents append its harness/tool guidance.
Avoid pre-binding tools to models when structured output or dynamic model selection is active. Prefer request-scoped binding or request.override(...) inside middleware.
ContextSnapshot should complement Deep Agents’ built-in offloading and summarization by exposing context-used percentage and budget warnings before the deep-agent harness decides to compress.

Expected ooai-agents consumption shape:

registry = LLMRegistry.from_profiles(
    {
        "default": ChatModelProfile(model="openai:gpt-5.4-mini"),
        "research": ChatModelProfile(model="anthropic:claude-sonnet-4.6"),
    }
)

agent_profile = AgentProfile(
    name="research_assistant",
    default_llm="default",
    subagents=[
        SubagentProfile(
            name="research",
            llm="research",
            tools=["web_search", "read_file"],
        )
    ],
)

agent = create_ooai_agent(
    profile=agent_profile,
    llms=registry,
)

In this shape, ooai-agents decides what an agent is. ooai-llm supplies the registered model runtimes, UUIDs, usage/cost recorder, catalog metadata, context percentages, and LangChain-compatible runnable objects.

Old prompt-template compatibility shape:

from langchain_core.prompts import ChatPromptTemplate

repo_review_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", "You are reviewing repo {repo_name}. Focus on {focus_area}."),
        ("placeholder", "{messages}"),
    ]
)

prompt_policy = PromptPolicy.from_chat_template(
    repo_review_prompt,
    input_mapper=lambda request: {
        "repo_name": request.runtime.context.repo_name,
        "focus_area": request.state.get("focus_area", "correctness"),
        "messages": request.messages,
    },
)

Expected LangChain v1 shape:

registry = LLMRegistry.from_profiles(
    {
        "cheap": ChatModelProfile(model="openai:gpt-5.4-mini", temperature=0),
        "coding": ChatModelProfile(model="anthropic:claude-sonnet-4.6"),
    }
)

middleware = registry.model_router(
    selector=lambda request, registry: "coding"
    if request.state.get("task_type") == "code"
    else "cheap"
)

agent = create_agent(
    model=registry["cheap"].runnable,
    tools=tools,
    middleware=[middleware],
)

Context-budget shape:

snapshot = registry["coding"].context_snapshot(
    messages=state["messages"],
    reserve_output_tokens=2_000,
)

print(snapshot.context_used_percent)
print(snapshot.remaining_context_tokens)

Middleware responsibilities:

wrap_model_call: choose the runtime, attach the runnable, optionally filter tools, and add runtime/profile UUID metadata.
before_model: compute context percentage, trim or summarize messages, and inject dynamic system prompt blocks.
after_model: record response usage/cost, validate output, and persist telemetry.
wrap_tool_call: monitor tool calls and attach tool latency/error metadata to the same run/runtime identifiers.

Open questions to settle before implementation:

Whether LLMRegistry should live at top level or under an optional ooai_llm.agents module.
Whether exact provider preflight token counting belongs in the registry, LLM, or a separate token-estimation service.
How much middleware should be shipped in ooai-llm versus deferred to the future ooai-agents package.
Whether UUIDs should be purely runtime-generated or optionally persisted in checked-in profile registries for reproducible graph wiring.
Whether Deep Agents integration belongs in ooai-llm as optional adapters or only in the future ooai-agents package.
Whether AgentInjection should be a Pydantic model, a typed dict, or a small protocol so it can remain independent of LangChain import-time internals.
Which prompt and tool policy objects should live in ooai-llm as minimal adapter contracts versus in ooai-agents as full orchestration concepts.

References to keep in view while implementing:

Token Counting Notes¶

Core rule: use the same tokenizer or provider counting method as the model and payload shape you will actually call.

Local tokenizers are useful for fast estimates, but provider APIs are safer for real chat payloads with tools, JSON schemas, files, images, cached context, reasoning tokens, or provider-side wrappers.

Recommended layers:

Local estimate: tiktoken for OpenAI-family text, transformers or sentencepiece for open models.
Provider preflight: OpenAI Responses input-token count, Anthropic messages.count_tokens, Gemini count_tokens, or LiteLLM when it can route to provider-specific counters.
Post-call accounting: provider response usage metadata, LangChain callbacks, LangGraph callbacks, or LlamaIndex token-counting handlers.
Cost projection: LiteLLM pricing metadata, tokencost, or package-owned pricing tables, with provider usage metadata as the billing source of truth.

The runtime should preserve provenance for every token count:

count_source: Literal[
    "provider_preflight",
    "provider_usage_metadata",
    "local_tokenizer",
    "framework_callback",
    "approximation",
]

Provider-specific notes:

OpenAI: local tiktoken is good for plain text estimates; the OpenAI input-token-count endpoint is preferred for exact Responses-style payloads.
Anthropic: use messages.count_tokens for Claude request preflight, especially with tools and multimodal content.
Gemini: use google-genai count_tokens before generation.
LiteLLM: useful as a high-level multi-provider interface, but exactness depends on provider-specific support for the requested model and payload.
Hugging Face/open models: use the exact model tokenizer through transformers.AutoTokenizer; use tokenizers or sentencepiece for lower level or custom-tokenizer work.
LangChain/LangGraph: use framework callbacks for post-call accounting and chat-history trimming, not as a universal exact preflight billing source.