Files

Nikketryhard b1bd57ab5e feat: forward generation params via MITM + add usageMetadata to Gemini

- Add GenerationParams struct to MitmStore for temperature, top_p,
  top_k, max_output_tokens, stop_sequences, frequency/presence_penalty
- MITM modify_request injects params into request.generationConfig
- All 3 endpoints (Completions, Responses, Gemini) store client params
- Add usageMetadata to Gemini sync responses (promptTokenCount,
  candidatesTokenCount, totalTokenCount, thoughtsTokenCount)
- Add generation param fields to GeminiRequest (temperature, topP, etc.)
- Completions stream_options.include_usage emits final usage chunk
- Completions reasoning_tokens in completion_tokens_details
- Update endpoint gap analysis doc (all high-priority gaps resolved)

2026-02-15 14:23:05 -06:00

35 KiB

Raw Blame History

Endpoint Gap Analysis

Generated: 2026-02-15 (updated)
Proxy Version: 3.1.0
Scope: All three API endpoints vs official OpenAI / Gemini specifications

Endpoint Overview
Feature Parity Matrix
Detailed Endpoint Analysis
Priority Gaps
Architecture Notes

Endpoint Overview

The proxy exposes three main API endpoints, each serving different client ecosystems:

Endpoint	Protocol	Primary Clients	Spec Reference
`POST /v1/responses`	OpenAI Responses API	Claude Code, Antigravity-native clients	platform.openai.com/docs/api-reference/responses
`POST /v1/chat/completions`	OpenAI Chat Completions API	OpenCode, Vercel AI SDK, any OpenAI-compatible client	platform.openai.com/docs/api-reference/chat
`POST /v1/gemini`	Custom Gemini-native API	Direct Gemini-format consumers	ai.google.dev/api (loosely based)

All three endpoints share the same backend pipeline:

Client Request → Proxy Endpoint → LS (Language Server) → Google API
                                ↓
                          MITM Proxy (captures real usage + injects generation params + tool calls)

Feature Parity Matrix

Core Features

Feature	Responses	Completions	Gemini
Sync mode	✅	✅	✅
Streaming mode (SSE)	✅	✅	✅
Model selection	✅	✅	✅
Model validation	✅	✅	✅
Auth check (OAuth)	✅	✅	✅
Timeout control	✅	✅	✅

Generation Parameters (MITM-injected)

Feature	Responses	Completions	Gemini
`temperature`	✅	✅	✅
`top_p` / `topP`	✅	✅	✅
`top_k` / `topK`	❌	❌	✅
`max_output_tokens`	✅	✅	✅
`stop_sequences`	❌	❌	✅
`frequency_penalty`	❌	✅	❌
`presence_penalty`	❌	✅	❌

Note: All generation parameters are forwarded to Google's API via MITM injection into request.generationConfig. They override the LS defaults.

Thinking / Reasoning

Feature	Responses	Completions	Gemini
Thinking — LS path (streaming)	✅ `reasoning_summary_text.delta`	✅ `reasoning_content` delta	✅ `thought: true` part
Thinking — LS path (sync)	✅ `reasoning` output item	✅ `reasoning_content` in message	✅ `thought: true` part
Thinking — Bypass path (streaming)	✅	✅	✅
Thinking — Bypass path (sync)	✅	✅	✅
Thinking signature (multi-turn)	✅ `thinking_signature` field	❌ Not applicable	❌ Not applicable

Tool Calls

Feature	Responses	Completions	Gemini
Tool definitions input	✅ OpenAI format → Gemini	✅ OpenAI format → Gemini	✅ Native Gemini format
Tool choice control	✅ `tool_choice`	✅ `tool_choice`	✅ `tool_config`
Tool call output (streaming)	✅ `function_call` items	✅ `tool_calls` in delta	✅ `functionCall` parts
Tool call output (sync)	✅ `function_call` items	✅ `tool_calls` in message	✅ `functionCall` parts
Tool result input	✅ `function_call_output` items	✅ `tool` role messages	✅ `functionResponse` in `tool_results`
MITM bypass (custom tools)	✅	✅	✅
Stale state protection	✅	✅	✅

Session Management

Feature	Responses	Completions	Gemini
Session/conversation reuse	✅ `conversation` field	❌ Not supported	✅ `conversation` field
Session listing (`GET /v1/sessions`)	✅ Shared	✅ Shared	✅ Shared
Session deletion	✅ Shared	✅ Shared	✅ Shared

Usage / Token Tracking

Feature	Responses	Completions	Gemini
Usage in sync response	✅ MITM real tokens	✅ MITM real tokens	✅ `usageMetadata`
Usage in streaming (final chunk)	❌ Not emitted	✅ `stream_options.include_usage`	❌ Not emitted
`reasoning_tokens` in usage	✅ In `output_tokens_details`	✅ In `completion_tokens_details`	✅ `thoughtsTokenCount`
Cache tokens	✅ `cached_tokens`	✅ `cached_tokens`	✅ `cachedContentTokenCount`

Detailed Endpoint Analysis

Responses API (`/v1/responses`)

Spec: OpenAI Responses API

Request Fields

Field	Spec	Status	Implementation Details
`model`	Required	✅	Mapped to internal model enum via `lookup_model()`
`input`	Required	✅	String or array. Array supports `message` items and `function_call_output` items
`instructions`	Optional	✅	Prepended to user text as system instructions
`stream`	Optional	✅	SSE stream with `response.*` events
`tools`	Optional	✅	OpenAI function format → auto-converted to Gemini `functionDeclarations` via `openai_tools_to_gemini()`
`tool_choice`	Optional	✅	`"auto"`, `"required"`, `"none"`, or `{"type":"function","function":{"name":"X"}}` → converted to Gemini `functionCallingConfig`
`store`	Optional	✅	Accepted, echoed in response. Not actually persisted.
`temperature`	Optional	✅	Forwarded to Google via MITM `generationConfig` injection.
`top_p`	Optional	✅	Forwarded to Google via MITM.
`max_output_tokens`	Optional	✅	Forwarded to Google via MITM.
`previous_response_id`	Optional	✅	Accepted, echoed. Not used for chaining (use `conversation` instead).
`metadata`	Optional	✅	Accepted, echoed back in response.
`user`	Optional	✅	Accepted, echoed.
`conversation`	Extension	✅	Proxy-specific: session ID for multi-turn cascade reuse.
`timeout`	Extension	✅	Proxy-specific: request timeout in seconds (default 120).
`reasoning.effort`	Optional	❌	Could map to model variant selection (e.g., `"high"` → Opus, `"low"` → Flash).
`reasoning.generate_summary`	Optional	❌	Not implemented. Could control thinking output inclusion.
`truncation`	Optional	❌	Not applicable — LS manages context window.
`parallel_tool_calls`	Optional	✅	Hardcoded `true` in response.

Response Object

Field	Spec	Status	Notes
`id`	Required	✅	`resp_` + UUID
`object`	Required	✅	Always `"response"`
`created_at`	Required	✅	Unix timestamp
`status`	Required	✅	`"completed"` or `"incomplete"`
`completed_at`	Required	✅	Unix timestamp or null
`error`	Required	✅	null on success
`incomplete_details`	Required	✅	null
`instructions`	Required	✅	Echoed from request
`max_output_tokens`	Required	✅	Echoed or null
`model`	Required	✅	Model name string
`output`	Required	✅	Array of `reasoning` and/or `message` items
`parallel_tool_calls`	Required	✅	`true`
`previous_response_id`	Required	✅	Echoed or null
`reasoning`	Required	✅	`{effort: null, summary: null}`
`store`	Required	✅	Echoed
`temperature`	Required	✅	Echoed (default 1.0)
`text`	Required	✅	`{format: {type: "text"}}`
`tool_choice`	Required	✅	`"auto"`
`tools`	Required	✅	Echoed or `[]`
`top_p`	Required	✅	Echoed (default 1.0)
`truncation`	Required	✅	`"disabled"`
`usage`	Required	✅	MITM-intercepted real tokens when available, estimated otherwise
`user`	Required	✅	Echoed or null
`metadata`	Required	✅	Echoed or `{}`
`thinking_signature`	Extension	✅	Proxy-specific: opaque blob for multi-turn thinking chain

Streaming Events

Event	Spec	Status	Notes
`response.created`	Required	✅	Initial response shell
`response.in_progress`	Required	✅
`response.output_item.added`	Required	✅	For reasoning + message items
`response.content_part.added`	Required	✅
`response.output_text.delta`	Required	✅	Progressive text deltas
`response.output_text.done`	Required	✅
`response.content_part.done`	Required	✅
`response.output_item.done`	Required	✅
`response.completed`	Required	✅	Final event with full response
`response.reasoning_summary_text.delta`	Required	✅	Progressive thinking deltas
`response.reasoning_summary_text.done`	Required	✅
`response.function_call_arguments.delta`	For tools	✅	Tool call argument streaming
`response.function_call_arguments.done`	For tools	✅

Chat Completions API (`/v1/chat/completions`)

Spec: OpenAI Chat Completions API

Request Fields

Field	Spec	Status	Implementation Details
`model`	Required	✅	Mapped to internal model enum
`messages`	Required	✅	Supports `system`, `developer`, `user`, `assistant`, `tool` roles
`messages[].content`	Required	✅	String or array of `{type: "text", text: "..."}` objects
`messages[].tool_calls`	Optional	✅	For assistant messages with tool calls
`messages[].tool_call_id`	Optional	✅	For tool result messages
`stream`	Optional	✅	SSE with `chat.completion.chunk` events
`stream_options`	Optional	✅	`include_usage: true` emits final usage chunk before `[DONE]`
`tools`	Optional	✅	OpenAI function format → auto-converted to Gemini
`tool_choice`	Optional	✅	`"auto"`, `"none"`, `"required"`, or specific function
`timeout`	Extension	✅	Proxy-specific (default 120s)
`temperature`	Optional	✅	Forwarded to Google via MITM `generationConfig.temperature`
`top_p`	Optional	✅	Forwarded to Google via MITM `generationConfig.topP`
`max_tokens`	Optional	✅	Forwarded to Google via MITM `generationConfig.maxOutputTokens`
`max_completion_tokens`	Optional	✅	Forwarded (same as `max_tokens`, newer OpenAI param)
`frequency_penalty`	Optional	✅	Forwarded to Google via MITM `generationConfig.frequencyPenalty`
`presence_penalty`	Optional	✅	Forwarded to Google via MITM `generationConfig.presencePenalty`
`user`	Optional	✅	Accepted, not used
`n`	Optional	❌	N/A — single generation only
`logprobs`	Optional	❌	N/A
`top_logprobs`	Optional	❌	N/A
`logit_bias`	Optional	❌	N/A
`response_format`	Optional	❌	Could be useful for JSON mode
`seed`	Optional	❌	N/A
`stop`	Optional	❌	Could be forwarded as `stopSequences`

Sync Response Object

Field	Spec	Status	Notes
`id`	Required	✅	`chatcmpl-` + UUID
`object`	Required	✅	`"chat.completion"`
`created`	Required	✅	Unix timestamp
`model`	Required	✅	Model name
`choices[0].index`	Required	✅	`0`
`choices[0].message.role`	Required	✅	`"assistant"`
`choices[0].message.content`	Required	✅	Response text
`choices[0].message.reasoning_content`	Extension	✅	Thinking text (when model produces thinking)
`choices[0].message.tool_calls`	Conditional	✅	When model returns tool calls
`choices[0].message.refusal`	Optional	❌	Not implemented
`choices[0].message.annotations`	Optional	❌	Not implemented
`choices[0].logprobs`	Optional	❌	Not implemented
`choices[0].finish_reason`	Required	✅	`"stop"` or `"tool_calls"`
`usage.prompt_tokens`	Required	✅	MITM real or estimated
`usage.completion_tokens`	Required	✅	MITM real or estimated
`usage.total_tokens`	Required	✅	Sum
`usage.prompt_tokens_details.cached_tokens`	Optional	✅	MITM cache read tokens
`usage.completion_tokens_details.reasoning_tokens`	Optional	✅	MITM thinking token count
`usage.completion_tokens_details.accepted_prediction_tokens`	Optional	❌	N/A
`usage.completion_tokens_details.rejected_prediction_tokens`	Optional	❌	N/A
`system_fingerprint`	Deprecated	❌	Cosmetic, not needed
`service_tier`	Optional	❌	Cosmetic, not needed

Streaming Chunk Object

Field	Spec	Status	Notes
`id`	Required	✅	Same across all chunks
`object`	Required	✅	`"chat.completion.chunk"`
`created`	Required	✅	Same across all chunks
`model`	Required	✅
`choices[0].index`	Required	✅	`0`
`choices[0].delta.role`	First chunk	✅	`"assistant"` in first chunk
`choices[0].delta.content`	Text chunks	✅	Progressive text deltas
`choices[0].delta.reasoning_content`	Thinking chunks	✅	Progressive thinking deltas
`choices[0].delta.tool_calls`	Tool chunks	✅	Tool call data
`choices[0].delta`	Final chunk	✅	Empty `{}`
`choices[0].finish_reason`	Final chunk	✅	`"stop"` or `"tool_calls"`
`choices[0].logprobs`	Optional	❌	Not implemented
`usage` (final chunk)	Optional	✅	Emitted when `stream_options.include_usage` is `true`
`data: [DONE]`	Required	✅	Stream termination signal

Gemini API (`/v1/gemini`)

Spec: Custom endpoint loosely based on Gemini REST API

Note: This is NOT a 1:1 Gemini API replica. It's a simplified proxy-native endpoint that uses Gemini's functionDeclarations / functionCall / functionResponse format directly, avoiding OpenAI ↔ Gemini format conversion overhead.

Request Fields

Field	Spec	Status	Implementation Details
`model`	Required	✅	Mapped to internal model enum
`input`	Required	✅	String only (no array/multipart)
`tools`	Optional	✅	Native Gemini `[{functionDeclarations: [...]}]` format
`tool_config`	Optional	✅	Native Gemini `{functionCallingConfig: {mode: "AUTO"}}`
`tool_results`	Optional	✅	Array of `{functionResponse: {name, response}}`
`conversation`	Optional	✅	Session ID for cascade reuse
`stream`	Optional	✅	SSE streaming
`timeout`	Optional	✅	Default 120s
`temperature`	Optional	✅	Forwarded to Google via MITM `generationConfig.temperature`
`top_p` / `topP`	Optional	✅	Forwarded to Google via MITM `generationConfig.topP`
`top_k` / `topK`	Optional	✅	Forwarded to Google via MITM `generationConfig.topK`
`max_output_tokens` / `maxOutputTokens`	Optional	✅	Forwarded via MITM `generationConfig.maxOutputTokens`
`stop_sequences` / `stopSequences`	Optional	✅	Forwarded via MITM `generationConfig.stopSequences`

Sync Response Object

Field	Spec	Status	Notes
`candidates[0].content.parts`	Required	✅	Array of text/functionCall parts
`candidates[0].content.parts[].text`	Text	✅	Response text
`candidates[0].content.parts[].thought`	Extension	✅	`true` for thinking parts
`candidates[0].content.parts[].functionCall`	Tool call	✅	`{name, args}`
`candidates[0].content.role`	Required	✅	`"model"`
`candidates[0].finishReason`	Required	✅	`"STOP"`
`modelVersion`	Required	✅	Model name string
`usageMetadata`	Optional	✅	MITM-intercepted token counts

usageMetadata fields:

Field	Status	Notes
`promptTokenCount`	✅	Input tokens
`candidatesTokenCount`	✅	Output tokens
`totalTokenCount`	✅	Input + output
`thoughtsTokenCount`	✅	Thinking/reasoning tokens
`cachedContentTokenCount`	✅	Cache read tokens

Streaming Format

Each SSE data: chunk is a complete Gemini-format JSON object with progressive candidates[0].content.parts:

data: {"candidates":[{"content":{"parts":[{"text":"thinking...","thought":true}],"role":"model"}}],"modelVersion":"opus-4.6"}

data: {"candidates":[{"content":{"parts":[{"text":"Hello!"}],"role":"model"}}],"modelVersion":"opus-4.6"}

data: {"candidates":[{"content":{"parts":[{"text":""}],"role":"model"},"finishReason":"STOP"}],"modelVersion":"opus-4.6"}

data: [DONE]

Priority Gaps

🔴 High Priority — RESOLVED ✅

All high-priority gaps have been addressed:

~~Completions: stream_options.include_usage~~ → ✅ Implemented
~~Completions: completion_tokens_details.reasoning_tokens~~ → ✅ Implemented
~~Completions: Accept temperature, top_p, max_tokens~~ → ✅ Forwarded via MITM
~~Gemini: usageMetadata~~ → ✅ Implemented

🟡 Medium Priority

Responses: reasoning.effort
- What: Map reasoning effort levels ("high", "medium", "low") to model variant selection
- Why: Could automatically select Opus vs Flash based on reasoning needs
- Effort: Medium — needs model selection logic changes
Completions: Session/conversation support
- What: Add session reuse similar to Responses and Gemini endpoints
- Why: Would allow multi-turn conversations via the completions API
- Effort: Medium — need a way to pass session ID (maybe via user field or custom header)
Completions: stop sequences
- What: Forward stop to Google as stopSequences in generationConfig
- Why: Some clients use stop sequences to control generation
- Effort: Trivial — just add to CompletionRequest and GenerationParams
Completions: response_format (JSON mode)
- What: Forward response_format: {"type": "json_object"} to Google's responseMimeType
- Why: Useful for structured output
- Effort: Low — inject responseMimeType: "application/json" in generationConfig

🟢 Low Priority

Cosmetic or not applicable to our architecture:

system_fingerprint — OpenAI-specific field, meaningless for our proxy
service_tier — OpenAI billing concept, not applicable
n > 1 — Multiple completions per request; our backend only generates one
logprobs — Would require token-level access we don't have
seed — Deterministic sampling not controllable through our proxy

Architecture Notes

Generation Parameter Injection

Client-specified sampling parameters are forwarded to Google's API via the MITM request modification pipeline:

Client sends temperature=0.5  →  API handler stores in MitmStore.generation_params
                                        ↓
                                  LS sends request to Google API
                                        ↓
                                  MITM intercepts request
                                        ↓
                                  modify_request() reads generation_params
                                        ↓
                                  Injects into request.generationConfig:
                                    temperature, topP, topK, maxOutputTokens,
                                    stopSequences, frequencyPenalty, presencePenalty
                                        ↓
                                  Forwards modified request to Google

This approach overrides whatever defaults the LS sets, giving clients direct control over sampling parameters.

Dual Path Architecture

All three endpoints share a dual-path architecture:

                      ┌─────────────────┐
                      │  Has custom      │
Request ────────────► │  tools?          │
                      └────────┬────────┘
                               │
                    ┌──── Yes ──┴── No ────┐
                    │                       │
              ┌─────▼─────┐          ┌─────▼─────┐
              │  MITM      │          │  LS Steps  │
              │  Bypass    │          │  Polling   │
              │  Path      │          │  Path      │
              └─────┬─────┘          └─────┬─────┘
                    │                       │
              ┌─────▼─────┐          ┌─────▼─────┐
              │  Poll      │          │  Poll      │
              │  MitmStore │          │  get_steps │
              │  directly  │          │  from LS   │
              └─────┬─────┘          └─────┬─────┘
                    │                       │
                    └──────────┬────────────┘
                               │
                         ┌─────▼─────┐
                         │  Response  │
                         │  to client │
                         └───────────┘

Bypass Path: When custom tools are present, the handler polls MitmStore directly for response text, thinking text, and function calls. The MITM proxy captures these from the Google API response before the LS processes them.
LS Path: When no custom tools are present, the handler polls the LS's get_steps API for progressive response data (text, thinking, status).

Stale State Protection

All bypass paths include protection against stale response_complete flags from previous requests:

if complete && text.is_empty() && thinking.is_none() {
    warn!("stale response_complete detected — clearing");
    state.mitm_store.clear_response_async().await;
    continue; // or retry
}

This handles the race condition where a previous request's MITM handler calls mark_response_complete() after the new request has already called clear_response_async().

Tool Format Conversion

OpenAI tools ──► openai_tools_to_gemini() ──► Gemini functionDeclarations
                                                    │
                                              MitmStore.set_tools()
                                                    │
                                              MITM proxy injects into
                                              outgoing LS request

The Gemini endpoint skips this conversion entirely — tools are stored in native Gemini format.

35 KiB Raw Blame History

Endpoint Gap Analysis

Table of Contents

Endpoint Overview

Feature Parity Matrix

Core Features

Generation Parameters (MITM-injected)

Thinking / Reasoning

Tool Calls

Session Management

Usage / Token Tracking

Detailed Endpoint Analysis

Responses API (/v1/responses)

Request Fields

Response Object

Streaming Events

Chat Completions API (/v1/chat/completions)

Request Fields

Sync Response Object

Streaming Chunk Object

Gemini API (/v1/gemini)

Request Fields

Sync Response Object

Streaming Format

Priority Gaps

🔴 High Priority — RESOLVED ✅

🟡 Medium Priority

🟢 Low Priority

Architecture Notes

Generation Parameter Injection

Dual Path Architecture

Stale State Protection

Tool Format Conversion

35 KiB

Raw Blame History

Responses API (`/v1/responses`)

Chat Completions API (`/v1/chat/completions`)

Gemini API (`/v1/gemini`)