RFD 007: Llamacpp Reasoning Support

Status: Implemented
Category: Design
Authors: Jean Mertz git@jeanmertz.com
Date: 2025-07-18

Summary

The Llamacpp provider silently drops reasoning content when the llama.cpp server uses its default --reasoning-format deepseek setting. This RFD proposes fixing reasoning extraction to work across all three llama.cpp reasoning formats.

Motivation

The llama.cpp server exposes an OpenAI-compatible /v1/chat/completions endpoint. When models produce thinking tokens, the server can return them in three ways depending on the --reasoning-format flag:

Format	`delta.content`	`delta.reasoning_content`
`deepseek` (default)	Regular content only	Reasoning content
`none`	Everything (with `<think>` tags)	Not set
`deepseek-legacy`	Everything (with `<think>` tags)	Reasoning content

Our Llamacpp provider uses the openai Rust crate, which only reads delta.content and delta.tool_calls. The reasoning_content field is a non-standard DeepSeek extension that the crate ignores.

The existing ReasoningExtractor parses <think> tags from delta.content, which works for none and deepseek-legacy but does nothing for the default deepseek format — there are no tags to extract because the server already stripped them.

This is the same class of bug we fixed in the Ollama provider, where message.content and message.thinking were never processed. The Ollama fix was straightforward because ollama-rs exposes the thinking field natively. For Llamacpp, the openai crate does not expose reasoning_content.

Design

From a user's perspective, reasoning content from llama.cpp models appears in conversations the same way it does for Ollama or Anthropic — no configuration changes, no new flags. The fix is entirely internal to the Llamacpp provider.

Drop the `openai` Crate Entirely

The openai crate's ChatCompletionMessageDelta struct does not include reasoning_content, and the field is unlikely to be added upstream since it's not part of the OpenAI spec. Rather than forking or patching the crate, we replace all of its usage — both streaming and request building — with our own implementation using reqwest, reqwest_eventsource, and serde_json.

The openai crate is used for four things, all replaced:

ChatCompletionDelta::builder() + create_stream() — replaced with manual JSON body construction via serde_json::json! and SSE streaming via reqwest_eventsource::EventSource.
ChatCompletionMessage / ChatCompletionMessageRole — conversation history is serialized directly to serde_json::Value messages.
ChatCompletionChoiceDelta / ChatCompletionMessageDelta — replaced with local serde types (StreamChunk, StreamChoice, StreamDelta) that include the reasoning_content field.
Credentials — removed; base URL comes from LlamacppConfig directly.

The Llamacpp struct simplifies to just reqwest_client and base_url. Model listing already uses reqwest directly and is unaffected.

SSE Streaming

The llama.cpp server sends SSE events in this shape:

json

data: {"choices":[{"delta":{"reasoning_content":"Let me think..."},"index":0}]}
data: {"choices":[{"delta":{"content":"The answer is..."},"index":0}]}

We parse each SSE event's JSON ourselves, reading both reasoning_content and content from the delta object. The llama.cpp server source (common_chat_msg_diff_to_json_oaicompat in common/chat.cpp) is the authoritative reference for the streaming delta format.

The SSE loop handles:

Event::Open — ignored
Event::Message with data: [DONE] — finalize and flush
Event::Message with JSON — parse as StreamChunk, route fields to events
Errors — delegated to the existing From<reqwest_eventsource::Error> for StreamError classifier in error.rs, which handles retry-after extraction, x-should-retry, and status code classification

Reasoning Routing

The implementation auto-detects which llama.cpp format is active based on which fields are present in each delta, with no user configuration required:

Format	`reasoning_content`	`content`	Routing
`deepseek`	Present	Regular text	`reasoning_content` → reasoning events
			directly
`none`	Absent	Everything with tags	`content` → `ReasoningExtractor` parses
			`<think>`
`deepseek-legacy`	Present	Everything with tags	`reasoning_content` → reasoning;
			`content` → text

For deepseek-legacy, when both fields are present, reasoning_content goes directly to reasoning events and content is treated as regular text (bypassing the extractor, since the server already separates them).

Event Indexing

Following the Ollama provider's convention:

Index 0: reasoning content
Index 1: message content (or structured output)
Index 2+: tool calls

The ToolCallRequestAggregator is reused unchanged to accumulate partial tool call JSON chunks across multiple SSE events.

Flush Strategy

Same as Ollama — reasoning (index 0) flushes when the first content or tool call chunk arrives. This guarantees reasoning events precede content and tool call events in the conversation history. On finish_reason or [DONE], all remaining indices flush.

Request Building

Conversation history serialization produces serde_json::Value messages directly. Reasoning responses from previous turns are wrapped in <think> tags so the model can pick up its own chain-of-thought. Tool definitions use parameters_with_strict_mode from the OpenAI provider module (shared utility, not the openai crate).

Alternatives

Patch the openai crate to add reasoning_content to ChatCompletionMessageDelta. Rejected because the field is a non-standard DeepSeek extension unlikely to be accepted upstream, and maintaining a fork for a single field is more burden than owning the SSE loop directly.
Keep the openai crate for request building, replace only streaming. Rejected because the crate's builder, message types, and credential handling are tightly coupled — replacing streaming alone leaves dead code and a confusing split of responsibilities.

Non-Goals

Adding user-facing configuration for --reasoning-format. The routing auto-detects the format from the delta fields.
Changing other providers. The Ollama and OpenRouter providers already handle reasoning correctly.

Risks

Owning the SSE loop: Edge cases like malformed events and the [DONE] sentinel are our responsibility. Mitigated by keeping the loop thin and delegating error classification to the existing reqwest_eventsource classifier.
ReasoningExtractor buffering latency: The extractor holds a small tail buffer (< 8 bytes) to detect split <think> tags. Only affects the none format path; latency is negligible.

Implementation

Define local serde types for the OpenAI Chat Completions streaming delta format: StreamChunk, StreamChoice, StreamDelta, ToolCallDelta, FunctionDelta. The key addition is reasoning_content on StreamDelta.
Replace request building with build_request() producing serde_json::Value. Conversation history, tool definitions, and structured output schemas serialize to JSON directly.
Replace openai crate streaming with reqwest_eventsource::EventSource. The handle_sse_event function processes each SSE event and routes reasoning/content/tool-call fields to provider-agnostic Event values.
Wire up reasoning routing: reasoning_content is used directly when present, ReasoningExtractor as fallback when only content is available.
Remove all openai crate imports from llamacpp.rs.
Add unit tests for delta parsing across the three formats, request building (tool call merging, reasoning wrapping), and tool choice conversion.
Re-record VCR cassettes against a live llama.cpp server.

References

llama.cpp server reasoning format docs: --reasoning-format flag in tools/server/README.md
llama.cpp source: common_chat_msg_diff_to_json_oaicompat in common/chat.cpp
Ollama provider (this codebase): reads message.thinking and message.content directly; uses the same index convention (0/1/2+) and flush strategy that this implementation mirrors.
yoagent openai_compat.rs: Reference implementation of OpenAI-compatible SSE streaming with reqwest_eventsource. Provides the serde type patterns and SSE loop structure adapted for this implementation.

RFD 007: Llamacpp Reasoning Support ​

Summary ​

Motivation ​

Design ​

Drop the openai Crate Entirely ​

SSE Streaming ​

Reasoning Routing ​

Event Indexing ​

Flush Strategy ​

Request Building ​

Alternatives ​

Non-Goals ​

Risks ​

Implementation ​

References ​