RFD 007: Llamacpp Reasoning Support
- Status: Implemented
- Category: Design
- Authors: Jean Mertz git@jeanmertz.com
- Date: 2025-07-18
Summary
The Llamacpp provider silently drops reasoning content when the llama.cpp server uses its default --reasoning-format deepseek setting. This RFD proposes fixing reasoning extraction to work across all three llama.cpp reasoning formats.
Motivation
The llama.cpp server exposes an OpenAI-compatible /v1/chat/completions endpoint. When models produce thinking tokens, the server can return them in three ways depending on the --reasoning-format flag:
| Format | delta.content | delta.reasoning_content |
|---|---|---|
deepseek (default) | Regular content only | Reasoning content |
none | Everything (with <think> tags) | Not set |
deepseek-legacy | Everything (with <think> tags) | Reasoning content |
Our Llamacpp provider uses the openai Rust crate, which only reads delta.content and delta.tool_calls. The reasoning_content field is a non-standard DeepSeek extension that the crate ignores.
The existing ReasoningExtractor parses <think> tags from delta.content, which works for none and deepseek-legacy but does nothing for the default deepseek format — there are no tags to extract because the server already stripped them.
This is the same class of bug we fixed in the Ollama provider, where message.content and message.thinking were never processed. The Ollama fix was straightforward because ollama-rs exposes the thinking field natively. For Llamacpp, the openai crate does not expose reasoning_content.
Design
From a user's perspective, reasoning content from llama.cpp models appears in conversations the same way it does for Ollama or Anthropic — no configuration changes, no new flags. The fix is entirely internal to the Llamacpp provider.
Drop the openai Crate Entirely
The openai crate's ChatCompletionMessageDelta struct does not include reasoning_content, and the field is unlikely to be added upstream since it's not part of the OpenAI spec. Rather than forking or patching the crate, we replace all of its usage — both streaming and request building — with our own implementation using reqwest, reqwest_eventsource, and serde_json.
The openai crate is used for four things, all replaced:
ChatCompletionDelta::builder()+create_stream()— replaced with manual JSON body construction viaserde_json::json!and SSE streaming viareqwest_eventsource::EventSource.ChatCompletionMessage/ChatCompletionMessageRole— conversation history is serialized directly toserde_json::Valuemessages.ChatCompletionChoiceDelta/ChatCompletionMessageDelta— replaced with local serde types (StreamChunk,StreamChoice,StreamDelta) that include thereasoning_contentfield.Credentials— removed; base URL comes fromLlamacppConfigdirectly.
The Llamacpp struct simplifies to just reqwest_client and base_url. Model listing already uses reqwest directly and is unaffected.
SSE Streaming
The llama.cpp server sends SSE events in this shape:
data: {"choices":[{"delta":{"reasoning_content":"Let me think..."},"index":0}]}
data: {"choices":[{"delta":{"content":"The answer is..."},"index":0}]}We parse each SSE event's JSON ourselves, reading both reasoning_content and content from the delta object. The llama.cpp server source (common_chat_msg_diff_to_json_oaicompat in common/chat.cpp) is the authoritative reference for the streaming delta format.
The SSE loop handles:
Event::Open— ignoredEvent::Messagewithdata: [DONE]— finalize and flushEvent::Messagewith JSON — parse asStreamChunk, route fields to events- Errors — delegated to the existing
From<reqwest_eventsource::Error> for StreamErrorclassifier inerror.rs, which handles retry-after extraction,x-should-retry, and status code classification
Reasoning Routing
The implementation auto-detects which llama.cpp format is active based on which fields are present in each delta, with no user configuration required:
| Format | reasoning_content | content | Routing |
|---|---|---|---|
deepseek | Present | Regular text | reasoning_content → reasoning events |
| directly | |||
none | Absent | Everything with tags | content → ReasoningExtractor parses |
<think> | |||
deepseek-legacy | Present | Everything with tags | reasoning_content → reasoning; |
content → text |
For deepseek-legacy, when both fields are present, reasoning_content goes directly to reasoning events and content is treated as regular text (bypassing the extractor, since the server already separates them).
Event Indexing
Following the Ollama provider's convention:
- Index 0: reasoning content
- Index 1: message content (or structured output)
- Index 2+: tool calls
The ToolCallRequestAggregator is reused unchanged to accumulate partial tool call JSON chunks across multiple SSE events.
Flush Strategy
Same as Ollama — reasoning (index 0) flushes when the first content or tool call chunk arrives. This guarantees reasoning events precede content and tool call events in the conversation history. On finish_reason or [DONE], all remaining indices flush.
Request Building
Conversation history serialization produces serde_json::Value messages directly. Reasoning responses from previous turns are wrapped in <think> tags so the model can pick up its own chain-of-thought. Tool definitions use parameters_with_strict_mode from the OpenAI provider module (shared utility, not the openai crate).
Alternatives
- Patch the
openaicrate to addreasoning_contenttoChatCompletionMessageDelta. Rejected because the field is a non-standard DeepSeek extension unlikely to be accepted upstream, and maintaining a fork for a single field is more burden than owning the SSE loop directly. - Keep the
openaicrate for request building, replace only streaming. Rejected because the crate's builder, message types, and credential handling are tightly coupled — replacing streaming alone leaves dead code and a confusing split of responsibilities.
Non-Goals
- Adding user-facing configuration for
--reasoning-format. The routing auto-detects the format from the delta fields. - Changing other providers. The Ollama and OpenRouter providers already handle reasoning correctly.
Risks
- Owning the SSE loop: Edge cases like malformed events and the
[DONE]sentinel are our responsibility. Mitigated by keeping the loop thin and delegating error classification to the existingreqwest_eventsourceclassifier. ReasoningExtractorbuffering latency: The extractor holds a small tail buffer (< 8 bytes) to detect split<think>tags. Only affects thenoneformat path; latency is negligible.
Implementation
- Define local serde types for the OpenAI Chat Completions streaming delta format:
StreamChunk,StreamChoice,StreamDelta,ToolCallDelta,FunctionDelta. The key addition isreasoning_contentonStreamDelta. - Replace request building with
build_request()producingserde_json::Value. Conversation history, tool definitions, and structured output schemas serialize to JSON directly. - Replace
openaicrate streaming withreqwest_eventsource::EventSource. Thehandle_sse_eventfunction processes each SSE event and routes reasoning/content/tool-call fields to provider-agnosticEventvalues. - Wire up reasoning routing:
reasoning_contentis used directly when present,ReasoningExtractoras fallback when onlycontentis available. - Remove all
openaicrate imports fromllamacpp.rs. - Add unit tests for delta parsing across the three formats, request building (tool call merging, reasoning wrapping), and tool choice conversion.
- Re-record VCR cassettes against a live llama.cpp server.
References
- llama.cpp server reasoning format docs:
--reasoning-formatflag intools/server/README.md - llama.cpp source:
common_chat_msg_diff_to_json_oaicompatincommon/chat.cpp - Ollama provider (this codebase): reads
message.thinkingandmessage.contentdirectly; uses the same index convention (0/1/2+) and flush strategy that this implementation mirrors. - yoagent
openai_compat.rs: Reference implementation of OpenAI-compatible SSE streaming withreqwest_eventsource. Provides the serde type patterns and SSE loop structure adapted for this implementation.