DSPy Integration Plan

This page records the DSPy integration contract across ooai-llm and the future ooai-agents package. The ooai-llm LM substrate is implemented; the program, runnable, node, optimizer, and artifact layers remain planned for ooai-agents.

The short version:

ooai-llm
|-- Owns DSPy model configuration from ChatModelProfile
|-- Creates and configures dspy.LM instances
|-- Extracts DSPy usage/cost into UsageRecorder
`-- Exposes thin, optional helpers behind ooai-llm[dspy]

ooai-agents
|-- Owns DSPy programs, signatures, modules, tools, and optimizers
|-- Wraps DSPy programs as LangChain Runnables or LangGraph nodes
|-- Converts dspy.Prediction outputs into messages, dict updates, or typed data
`-- Stores optimized DSPy artifacts and evaluation results

Research Summary

DSPy is not just another model provider. It is a framework for programming language-model systems through declarative signatures, modules, adapters, and optimizers.

The main pieces relevant to OOAI are:

  • dspy.LM: model client configured with LiteLLM-style model strings such as openai/gpt-5-mini, anthropic/claude-sonnet-..., or gemini/gemini-2.5-flash.

  • Signatures: typed input/output contracts such as "question -> answer: float" or class-based dspy.Signature declarations.

  • Modules: reusable strategies such as Predict, ChainOfThought, ReAct, ProgramOfThought, CodeAct, Parallel, and custom dspy.Module classes.

  • Adapters: conversion from signatures and examples into model messages, including chat, JSON, XML, and two-step styles.

  • Optimizers: offline or experiment-time compilation methods such as BootstrapFewShot, MIPROv2, GEPA, SIMBA, and finetuning workflows.

  • Runtime concerns: DSPy has its own caching, async, streaming, usage tracking, and save/load paths for compiled programs.

The key design consequence is that ooai-llm should provide model substrate support, while ooai-agents should provide program and workflow support.

Upstream Stability

DSPy currently documents a planned BaseLM transition for DSPy 3.3 through 4.0. The OOAI integration should avoid subclassing DSPy internals and should prefer adapter functions and protocols.

Package Boundary

ooai-llm implements the minimum useful bridge:

  • Optional extra: ooai-llm[dspy].

  • DSPyLMConfig: serializable config for dspy.LM.

  • create_dspy_lm(...): create a DSPy LM from ChatModelProfile, DSPyLMConfig, ModelString, or a raw model string.

  • configure_dspy_lm(...): create the LM and call dspy.configure(lm=...).

  • resolve_dspy_model_name(...): convert OOAI model choices to LiteLLM-style names.

  • extract_dspy_usage(...): pull prediction.get_lm_usage() or LM history usage into OOAI usage events when available.

  • record_dspy_usage(...): record DSPy usage into UsageRecorder.

  • create_dspy_lm_bundle(...): return a native DSPy LM with resolved model, metadata, config, and trace metadata.

  • Documentation and examples showing how to choose a model from the catalog or profile layer and hand it to DSPy.

ooai-llm should not implement:

  • DSPy program builders.

  • DSPy signatures for particular tasks.

  • Agent workflows or graph topology.

  • Optimizer jobs and dataset management.

  • ARC-specific DSPy modules.

ooai-agents should implement the richer layer:

  • DSPyProgramSpec: declarative program/signature/module definition.

  • DSPyRunnable: LangChain-style runnable wrapper around a DSPy program.

  • DSPyNode: LangGraph node wrapper around a DSPy program.

  • DSPyOutputPolicy: conversion policy from dspy.Prediction to messages, state updates, JSON, or Pydantic models.

  • DSPyOptimizationJob: optimizer configuration, datasets, metrics, budgets, and artifact paths.

  • DSPyArtifactRegistry: saved compiled programs, run metadata, and eval reports.

  • Domain packages such as ARC, coding, research, and extraction programs.

Model Flow

The model flow should stay boring:

from ooai_llm import ChatModelProfile, configure_dspy_lm

profile = ChatModelProfile(
    id="dspy-coding",
    model="openai:gpt-5-mini",
    temperature=0,
    max_tokens=2000,
    cache={"namespace": "dspy", "key": "coding-v1"},
)

lm = configure_dspy_lm(profile, model_type="responses")

For downstream wrappers that need trace metadata, use the bundle helper:

from ooai_llm import create_dspy_lm_bundle

bundle = create_dspy_lm_bundle(profile, model_type="responses")

print(bundle.lm)
print(bundle.model.as_litellm())
print(bundle.trace_metadata["ooai_profile_id"])

The same bridge is available from an LLM runtime:

runtime = profile.create_runtime(id="coding-runtime")
lm = runtime.create_dspy_lm(model_type="responses")

Internally this should resolve:

ChatModelProfile(model="openai:gpt-5-mini")
`-- ModelString("openai:gpt-5-mini")
    `-- "openai/gpt-5-mini"
        `-- dspy.LM("openai/gpt-5-mini", ...)

ChatModelProfile remains the single source of model configuration. DSPy-only options should live in DSPyLMConfig.lm_kwargs or explicit DSPy config fields.

Output Compatibility

DSPy returns dspy.Prediction objects. LangChain and LangGraph usually want messages, runnable outputs, or graph state updates. The compatibility layer belongs in ooai-agents, not ooai-llm.

Planned output modes:

type DSPyOutputMode = Literal[
    "prediction",
    "dict",
    "ai_message",
    "state_update",
    "pydantic",
]

Recommended conversions:

  • prediction: return the raw dspy.Prediction for pure DSPy workflows.

  • dict: return JSON-safe field values from the prediction.

  • ai_message: return a LangChain AIMessage whose content is one selected field or a rendered JSON summary.

  • state_update: return a LangGraph state update such as {"messages": [message], "dspy": metadata}.

  • pydantic: validate prediction fields into a configured Pydantic model.

The AIMessage conversion should preserve provenance:

AIMessage(
    content=rendered_output,
    additional_kwargs={
        "dspy": {
            "program_id": "arc-hypothesis",
            "signature": "task, examples -> hypothesis, confidence: float",
            "module": "chain_of_thought",
            "fields": prediction_dict,
        }
    },
    response_metadata={
        "ooai_runtime_id": runtime.id,
        "ooai_runtime_uuid": str(runtime.uuid),
        "dspy_program_id": "arc-hypothesis",
    },
    usage_metadata=usage_metadata,
)

The LangGraph node conversion should return only serializable state:

{
    "messages": [ai_message],
    "dspy_results": {
        "arc-hypothesis": {
            "fields": prediction_dict,
            "confidence": prediction_dict.get("confidence"),
            "usage": usage_snapshot,
        }
    },
}

Live DSPy objects such as modules, LMs, adapters, or compiled programs should not be stored in graph state. Keep them on the wrapper object or registry.

Traceability Contract

DSPy calls should be traceable by default when they are routed through OOAI wrappers. The trace boundary should be the runnable or graph node, not a hidden tool call, so LangChain and LangGraph can show named DSPy program steps.

Recommended trace shape:

agent_or_graph
|-- load_task
|-- dspy.arc_hypothesis
|-- model.call
|-- verify_candidate
`-- dspy.adjudicate

Every DSPy program call should preserve:

{
    "ooai_runtime_id": runtime.id,
    "ooai_runtime_uuid": str(runtime.uuid),
    "ooai_profile_id": runtime.profile.id,
    "ooai_model": resolved_model.as_langchain(),
    "ooai_litellm_model": resolved_model.as_litellm(),
    "dspy_program_id": spec.id,
    "dspy_module": spec.module,
    "dspy_signature": spec.signature,
    "dspy_output_mode": spec.output.mode,
}

Usage and cost should be recorded after every DSPy prediction when DSPy or the underlying LM exposes usage data. Missing usage should not fail the agent run; record an event with a clear count source only when a real count is available.

Preferred tracing rules:

  • Use DSPyRunnable for reusable, traceable program calls.

  • Use create_dspy_node(...) for LangGraph state updates.

  • Use DSPy as a LangChain tool only when the language model should choose whether to call that program.

  • Keep runtime id, runtime UUID, program id, signature, module, and output mode in every converted output.

Async, Batch, And Streaming Contract

The OOAI wrapper should support the LangChain runnable surface even when DSPy’s native support varies by version, program, or LM:

DSPyRunnable.invoke(...)
DSPyRunnable.ainvoke(...)
DSPyRunnable.batch(...)
DSPyRunnable.abatch(...)
DSPyRunnable.stream(...)
DSPyRunnable.astream(...)

Implementation rules:

  • If DSPy exposes native async for the program, use it.

  • If DSPy exposes native streaming for the program, convert streamed chunks through the configured DSPyOutputPolicy.

  • If async is unavailable, run the sync invocation in a worker thread or executor so LangGraph async paths still work.

  • If streaming is unavailable, yield one final converted output chunk.

  • Preserve the same metadata and usage recording behavior across sync, async, batch, and streaming paths.

This means a DSPy program can be used as a normal LangGraph node even before every DSPy module has first-class streaming semantics.

Serialization And Artifacts

Serialize declarations and artifacts, not live DSPy clients or modules.

Serializable:

  • DSPyProgramSpec

  • DSPyOutputPolicy

  • DSPy LM binding reference, usually an OOAI runtime key

  • optimizer config

  • dataset and metric references

  • artifact metadata

  • compiled program path, hash, and version metadata

Do not put these into config or graph state:

  • live dspy.Module objects

  • live dspy.LM clients

  • provider SDK clients

  • callbacks

  • open files

  • LangGraph runtime objects

The durable shape should be:

Config serializes what to build.
Artifact metadata serializes where compiled DSPy output lives.
Runtime wrappers reconstruct live objects when needed.
Graph state stores only JSON-safe outputs and provenance.

Planned artifact reference:

class DSPyArtifactRef(BaseModel):
    id: str
    program_id: str
    path: Path
    sha256: str
    dspy_version: str
    ooai_agents_version: str
    created_at: datetime
    optimizer: str | None = None
    metric: str | None = None

Loading pickle or cloudpickle-based DSPy artifacts should be treated as a trusted-only operation. The metadata file should be safe to inspect without loading executable artifact payloads.

Target ooai-agents API

The target program declaration should be compact:

from ooai_agents.dspy import DSPyOutputPolicy, DSPyProgramSpec

spec = DSPyProgramSpec(
    id="arc-hypothesis",
    llm="reasoning",
    module="chain_of_thought",
    signature="task, examples -> hypothesis: str, confidence: float",
    output=DSPyOutputPolicy(
        mode="state_update",
        message_field="hypothesis",
        state_key="dspy_results",
    ),
)

Compile it into a LangGraph node:

from ooai_agents.dspy import create_dspy_node

node = create_dspy_node(spec, llms=registry)
graph.add_node("hypothesis", node)

Compile it into a LangChain-style runnable:

from ooai_agents.dspy import create_dspy_runnable

runnable = create_dspy_runnable(spec, llms=registry)
result = runnable.invoke({"task": task_text, "examples": examples})

Use it inside a normal OOAI agent:

agent = create_agent(
    id="arc-agent",
    runtimes=registry,
    tools=[load_arc_task, verify_candidate],
    middleware=[
        create_dspy_node_middleware(
            programs=[spec],
            run_before_model=["arc-hypothesis"],
        )
    ],
)

ooai-agents Module Layout

Target package layout:

src/ooai_agents/dspy/
|-- __init__.py
|-- config.py          # DSPyProgramSpec, DSPyOutputPolicy, optimizer specs
|-- programs.py        # builders for Predict, CoT, ReAct, custom modules
|-- runnable.py        # LangChain Runnable wrappers
|-- nodes.py           # LangGraph node wrappers
|-- output.py          # Prediction -> AIMessage/dict/state/Pydantic
|-- usage.py           # DSPy usage -> ooai UsageRecorder
|-- optimization.py    # optimizer job specs and runners
|-- artifacts.py       # save/load compiled programs and metadata
`-- testing.py         # fake DSPy objects for tests

Optional domain packages can then consume this:

src/ooai_agents/agents/arc/dspy_programs.py
src/ooai_agents/agents/coding/dspy_programs.py
src/ooai_agents/agents/research/dspy_programs.py

Task Breakdown

Phase 1: ooai-llm DSPy Substrate

  1. Add the optional dspy extra.

  2. Add DSPyLMConfig.

  3. Add resolve_dspy_model_name(...).

  4. Add create_dspy_lm(...).

  5. Add configure_dspy_lm(...).

  6. Map ChatModelProfile fields to DSPy safely: temperature, max_tokens, cache, num_retries, timeout, parallel_tool_calls, reasoning kwargs, and pass-through lm_kwargs.

  7. Add DSPy usage extraction helpers.

  8. Add unit tests with fake DSPy modules.

  9. Add docs and examples.

Acceptance criteria:

  • Importing ooai_llm does not import DSPy.

  • Missing DSPy raises an actionable optional-extra error.

  • Profile-based model names resolve to LiteLLM format.

  • Unit tests do not require network, provider keys, or real DSPy.

  • Docs clearly mark DSPy program support as owned by ooai-agents.

Phase 2: ooai-agents DSPy Program Specs

  1. Add an optional dspy extra to ooai-agents.

  2. Add DSPyProgramSpec.

  3. Add DSPyOutputPolicy.

  4. Add builders for common modules: predict, chain_of_thought, react, program_of_thought, and code_act.

  5. Add support for raw inline signatures and imported class-based signatures.

  6. Add tests using fake DSPy modules.

Acceptance criteria:

  • A spec can be serialized to JSON/YAML-like data.

  • Program construction stays lazy and does not require DSPy at import time.

  • Unsupported module names fail with clear messages.

Phase 3: Runnable And LangGraph Compatibility

  1. Implement DSPyRunnable.

  2. Implement create_dspy_runnable(...).

  3. Implement create_dspy_node(...).

  4. Convert dspy.Prediction to: dict, AIMessage, state update, raw prediction, and Pydantic output.

  5. Preserve program id, runtime id, runtime UUID, signature, module type, and usage metadata.

  6. Add sync and async paths when DSPy supports them.

Acceptance criteria:

  • The runnable supports .invoke(...) and .ainvoke(...).

  • The node returns a serializable LangGraph state update.

  • No live DSPy object is written into graph state.

  • Usage can be recorded into an existing UsageRecorder.

Phase 4: Optimizer And Artifact Workflows

  1. Add DSPyOptimizationJob.

  2. Add optimizer registry names for bootstrap_few_shot, mipro_v2, gepa, simba, and future optimizers.

  3. Add dataset and metric references instead of embedding large datasets in config.

  4. Add budget fields: max examples, max trials, max LM calls, max estimated cost, and timeout.

  5. Add artifact save/load metadata around DSPy’s native save/load.

  6. Add CLI commands:

ooai-agents dspy run arc-hypothesis --input task.json
ooai-agents dspy optimize arc-hypothesis --optimizer mipro-v2
ooai-agents dspy inspect artifacts/dspy/arc-hypothesis

Acceptance criteria:

  • Optimizer jobs are explicit, budgeted, and opt-in.

  • Artifacts include spec, model profile id, optimizer config, dataset refs, metrics, timestamp, and package versions.

  • Loading pickle/cloudpickle artifacts is clearly marked trusted-only.

Phase 5: Domain Integrations

  1. ARC: hypothesis generation, rule proposal, transformation proposal, verification critique, and ensemble adjudication.

  2. Coding: bug classification, repair-plan generation, patch critique, and generated-test evaluation.

  3. Research: query decomposition, citation faithfulness, extraction, and synthesis.

  4. Multi-agent: DSPy program nodes as specialist subagents or pre-model evidence generators.

Acceptance criteria:

  • Each domain has at least one runnable example.

  • DSPy program output can be compared against LangChain-agent output.

  • Usage/cost is visible in the same OOAI recorder summaries.

Testing Strategy

ooai-llm tests:

  • Fake dspy.LM constructor receives expected model name and kwargs.

  • ChatModelProfile maps to DSPyLMConfig.

  • configure_dspy_lm calls dspy.configure.

  • Missing dependency raises an optional-extra error.

  • Usage extraction handles absent usage gracefully.

ooai-agents tests:

  • DSPyProgramSpec serializes.

  • Program builders call the expected DSPy module constructor.

  • Runnable adapters return the configured output mode.

  • LangGraph nodes return serializable state updates.

  • Usage recording works with fake predictions.

  • Optimizer job specs validate budget and artifact paths.

Live tests should be opt-in only:

OOAI_REQUIRE_LIVE=true OOAI_LIVE_PROVIDERS=openai pdm run pytest -m live --no-cov

Open Questions

  • Should DSPyRunnable live in ooai-agents only, or should a tiny protocol live in ooai-llm for reuse outside agent workflows?

  • Which DSPy output field should become AIMessage.content by default: first output field, explicitly configured field, or JSON rendering?

  • Should optimizer artifacts live under ooai-agents storage policies or a separate experiment/artifact package later?

  • How much of DSPy’s native cache configuration should be mapped from CacheKeyPolicy versus left as DSPy-native kwargs?

  • Should ooai-agents expose CLI commands for DSPy immediately, or only after the program/runnable/node layer is stable?

References