- Add GenerationParams struct to MitmStore for temperature, top_p, top_k, max_output_tokens, stop_sequences, frequency/presence_penalty - MITM modify_request injects params into request.generationConfig - All 3 endpoints (Completions, Responses, Gemini) store client params - Add usageMetadata to Gemini sync responses (promptTokenCount, candidatesTokenCount, totalTokenCount, thoughtsTokenCount) - Add generation param fields to GeminiRequest (temperature, topP, etc.) - Completions stream_options.include_usage emits final usage chunk - Completions reasoning_tokens in completion_tokens_details - Update endpoint gap analysis doc (all high-priority gaps resolved)
35 KiB
Endpoint Gap Analysis
Generated: 2026-02-15 (updated)
Proxy Version: 3.1.0
Scope: All three API endpoints vs official OpenAI / Gemini specifications
Table of Contents
Endpoint Overview
The proxy exposes three main API endpoints, each serving different client ecosystems:
| Endpoint | Protocol | Primary Clients | Spec Reference |
|---|---|---|---|
POST /v1/responses |
OpenAI Responses API | Claude Code, Antigravity-native clients | platform.openai.com/docs/api-reference/responses |
POST /v1/chat/completions |
OpenAI Chat Completions API | OpenCode, Vercel AI SDK, any OpenAI-compatible client | platform.openai.com/docs/api-reference/chat |
POST /v1/gemini |
Custom Gemini-native API | Direct Gemini-format consumers | ai.google.dev/api (loosely based) |
All three endpoints share the same backend pipeline:
Client Request → Proxy Endpoint → LS (Language Server) → Google API
↓
MITM Proxy (captures real usage + injects generation params + tool calls)
Feature Parity Matrix
Core Features
| Feature | Responses | Completions | Gemini |
|---|---|---|---|
| Sync mode | ✅ | ✅ | ✅ |
| Streaming mode (SSE) | ✅ | ✅ | ✅ |
| Model selection | ✅ | ✅ | ✅ |
| Model validation | ✅ | ✅ | ✅ |
| Auth check (OAuth) | ✅ | ✅ | ✅ |
| Timeout control | ✅ | ✅ | ✅ |
Generation Parameters (MITM-injected)
| Feature | Responses | Completions | Gemini |
|---|---|---|---|
temperature |
✅ | ✅ | ✅ |
top_p / topP |
✅ | ✅ | ✅ |
top_k / topK |
❌ | ❌ | ✅ |
max_output_tokens |
✅ | ✅ | ✅ |
stop_sequences |
❌ | ❌ | ✅ |
frequency_penalty |
❌ | ✅ | ❌ |
presence_penalty |
❌ | ✅ | ❌ |
Note: All generation parameters are forwarded to Google's API via MITM injection into
request.generationConfig. They override the LS defaults.
Thinking / Reasoning
| Feature | Responses | Completions | Gemini |
|---|---|---|---|
| Thinking — LS path (streaming) | ✅ reasoning_summary_text.delta |
✅ reasoning_content delta |
✅ thought: true part |
| Thinking — LS path (sync) | ✅ reasoning output item |
✅ reasoning_content in message |
✅ thought: true part |
| Thinking — Bypass path (streaming) | ✅ | ✅ | ✅ |
| Thinking — Bypass path (sync) | ✅ | ✅ | ✅ |
| Thinking signature (multi-turn) | ✅ thinking_signature field |
❌ Not applicable | ❌ Not applicable |
Tool Calls
| Feature | Responses | Completions | Gemini |
|---|---|---|---|
| Tool definitions input | ✅ OpenAI format → Gemini | ✅ OpenAI format → Gemini | ✅ Native Gemini format |
| Tool choice control | ✅ tool_choice |
✅ tool_choice |
✅ tool_config |
| Tool call output (streaming) | ✅ function_call items |
✅ tool_calls in delta |
✅ functionCall parts |
| Tool call output (sync) | ✅ function_call items |
✅ tool_calls in message |
✅ functionCall parts |
| Tool result input | ✅ function_call_output items |
✅ tool role messages |
✅ functionResponse in tool_results |
| MITM bypass (custom tools) | ✅ | ✅ | ✅ |
| Stale state protection | ✅ | ✅ | ✅ |
Session Management
| Feature | Responses | Completions | Gemini |
|---|---|---|---|
| Session/conversation reuse | ✅ conversation field |
❌ Not supported | ✅ conversation field |
Session listing (GET /v1/sessions) |
✅ Shared | ✅ Shared | ✅ Shared |
| Session deletion | ✅ Shared | ✅ Shared | ✅ Shared |
Usage / Token Tracking
| Feature | Responses | Completions | Gemini |
|---|---|---|---|
| Usage in sync response | ✅ MITM real tokens | ✅ MITM real tokens | ✅ usageMetadata |
| Usage in streaming (final chunk) | ❌ Not emitted | ✅ stream_options.include_usage |
❌ Not emitted |
reasoning_tokens in usage |
✅ In output_tokens_details |
✅ In completion_tokens_details |
✅ thoughtsTokenCount |
| Cache tokens | ✅ cached_tokens |
✅ cached_tokens |
✅ cachedContentTokenCount |
Detailed Endpoint Analysis
Responses API (/v1/responses)
Spec: OpenAI Responses API
Request Fields
| Field | Spec | Status | Implementation Details |
|---|---|---|---|
model |
Required | ✅ | Mapped to internal model enum via lookup_model() |
input |
Required | ✅ | String or array. Array supports message items and function_call_output items |
instructions |
Optional | ✅ | Prepended to user text as system instructions |
stream |
Optional | ✅ | SSE stream with response.* events |
tools |
Optional | ✅ | OpenAI function format → auto-converted to Gemini functionDeclarations via openai_tools_to_gemini() |
tool_choice |
Optional | ✅ | "auto", "required", "none", or {"type":"function","function":{"name":"X"}} → converted to Gemini functionCallingConfig |
store |
Optional | ✅ | Accepted, echoed in response. Not actually persisted. |
temperature |
Optional | ✅ | Forwarded to Google via MITM generationConfig injection. |
top_p |
Optional | ✅ | Forwarded to Google via MITM. |
max_output_tokens |
Optional | ✅ | Forwarded to Google via MITM. |
previous_response_id |
Optional | ✅ | Accepted, echoed. Not used for chaining (use conversation instead). |
metadata |
Optional | ✅ | Accepted, echoed back in response. |
user |
Optional | ✅ | Accepted, echoed. |
conversation |
Extension | ✅ | Proxy-specific: session ID for multi-turn cascade reuse. |
timeout |
Extension | ✅ | Proxy-specific: request timeout in seconds (default 120). |
reasoning.effort |
Optional | ❌ | Could map to model variant selection (e.g., "high" → Opus, "low" → Flash). |
reasoning.generate_summary |
Optional | ❌ | Not implemented. Could control thinking output inclusion. |
truncation |
Optional | ❌ | Not applicable — LS manages context window. |
parallel_tool_calls |
Optional | ✅ | Hardcoded true in response. |
Response Object
| Field | Spec | Status | Notes |
|---|---|---|---|
id |
Required | ✅ | resp_ + UUID |
object |
Required | ✅ | Always "response" |
created_at |
Required | ✅ | Unix timestamp |
status |
Required | ✅ | "completed" or "incomplete" |
completed_at |
Required | ✅ | Unix timestamp or null |
error |
Required | ✅ | null on success |
incomplete_details |
Required | ✅ | null |
instructions |
Required | ✅ | Echoed from request |
max_output_tokens |
Required | ✅ | Echoed or null |
model |
Required | ✅ | Model name string |
output |
Required | ✅ | Array of reasoning and/or message items |
parallel_tool_calls |
Required | ✅ | true |
previous_response_id |
Required | ✅ | Echoed or null |
reasoning |
Required | ✅ | {effort: null, summary: null} |
store |
Required | ✅ | Echoed |
temperature |
Required | ✅ | Echoed (default 1.0) |
text |
Required | ✅ | {format: {type: "text"}} |
tool_choice |
Required | ✅ | "auto" |
tools |
Required | ✅ | Echoed or [] |
top_p |
Required | ✅ | Echoed (default 1.0) |
truncation |
Required | ✅ | "disabled" |
usage |
Required | ✅ | MITM-intercepted real tokens when available, estimated otherwise |
user |
Required | ✅ | Echoed or null |
metadata |
Required | ✅ | Echoed or {} |
thinking_signature |
Extension | ✅ | Proxy-specific: opaque blob for multi-turn thinking chain |
Streaming Events
| Event | Spec | Status | Notes |
|---|---|---|---|
response.created |
Required | ✅ | Initial response shell |
response.in_progress |
Required | ✅ | |
response.output_item.added |
Required | ✅ | For reasoning + message items |
response.content_part.added |
Required | ✅ | |
response.output_text.delta |
Required | ✅ | Progressive text deltas |
response.output_text.done |
Required | ✅ | |
response.content_part.done |
Required | ✅ | |
response.output_item.done |
Required | ✅ | |
response.completed |
Required | ✅ | Final event with full response |
response.reasoning_summary_text.delta |
Required | ✅ | Progressive thinking deltas |
response.reasoning_summary_text.done |
Required | ✅ | |
response.function_call_arguments.delta |
For tools | ✅ | Tool call argument streaming |
response.function_call_arguments.done |
For tools | ✅ |
Chat Completions API (/v1/chat/completions)
Spec: OpenAI Chat Completions API
Request Fields
| Field | Spec | Status | Implementation Details |
|---|---|---|---|
model |
Required | ✅ | Mapped to internal model enum |
messages |
Required | ✅ | Supports system, developer, user, assistant, tool roles |
messages[].content |
Required | ✅ | String or array of {type: "text", text: "..."} objects |
messages[].tool_calls |
Optional | ✅ | For assistant messages with tool calls |
messages[].tool_call_id |
Optional | ✅ | For tool result messages |
stream |
Optional | ✅ | SSE with chat.completion.chunk events |
stream_options |
Optional | ✅ | include_usage: true emits final usage chunk before [DONE] |
tools |
Optional | ✅ | OpenAI function format → auto-converted to Gemini |
tool_choice |
Optional | ✅ | "auto", "none", "required", or specific function |
timeout |
Extension | ✅ | Proxy-specific (default 120s) |
temperature |
Optional | ✅ | Forwarded to Google via MITM generationConfig.temperature |
top_p |
Optional | ✅ | Forwarded to Google via MITM generationConfig.topP |
max_tokens |
Optional | ✅ | Forwarded to Google via MITM generationConfig.maxOutputTokens |
max_completion_tokens |
Optional | ✅ | Forwarded (same as max_tokens, newer OpenAI param) |
frequency_penalty |
Optional | ✅ | Forwarded to Google via MITM generationConfig.frequencyPenalty |
presence_penalty |
Optional | ✅ | Forwarded to Google via MITM generationConfig.presencePenalty |
user |
Optional | ✅ | Accepted, not used |
n |
Optional | ❌ | N/A — single generation only |
logprobs |
Optional | ❌ | N/A |
top_logprobs |
Optional | ❌ | N/A |
logit_bias |
Optional | ❌ | N/A |
response_format |
Optional | ❌ | Could be useful for JSON mode |
seed |
Optional | ❌ | N/A |
stop |
Optional | ❌ | Could be forwarded as stopSequences |
Sync Response Object
| Field | Spec | Status | Notes |
|---|---|---|---|
id |
Required | ✅ | chatcmpl- + UUID |
object |
Required | ✅ | "chat.completion" |
created |
Required | ✅ | Unix timestamp |
model |
Required | ✅ | Model name |
choices[0].index |
Required | ✅ | 0 |
choices[0].message.role |
Required | ✅ | "assistant" |
choices[0].message.content |
Required | ✅ | Response text |
choices[0].message.reasoning_content |
Extension | ✅ | Thinking text (when model produces thinking) |
choices[0].message.tool_calls |
Conditional | ✅ | When model returns tool calls |
choices[0].message.refusal |
Optional | ❌ | Not implemented |
choices[0].message.annotations |
Optional | ❌ | Not implemented |
choices[0].logprobs |
Optional | ❌ | Not implemented |
choices[0].finish_reason |
Required | ✅ | "stop" or "tool_calls" |
usage.prompt_tokens |
Required | ✅ | MITM real or estimated |
usage.completion_tokens |
Required | ✅ | MITM real or estimated |
usage.total_tokens |
Required | ✅ | Sum |
usage.prompt_tokens_details.cached_tokens |
Optional | ✅ | MITM cache read tokens |
usage.completion_tokens_details.reasoning_tokens |
Optional | ✅ | MITM thinking token count |
usage.completion_tokens_details.accepted_prediction_tokens |
Optional | ❌ | N/A |
usage.completion_tokens_details.rejected_prediction_tokens |
Optional | ❌ | N/A |
system_fingerprint |
Deprecated | ❌ | Cosmetic, not needed |
service_tier |
Optional | ❌ | Cosmetic, not needed |
Streaming Chunk Object
| Field | Spec | Status | Notes |
|---|---|---|---|
id |
Required | ✅ | Same across all chunks |
object |
Required | ✅ | "chat.completion.chunk" |
created |
Required | ✅ | Same across all chunks |
model |
Required | ✅ | |
choices[0].index |
Required | ✅ | 0 |
choices[0].delta.role |
First chunk | ✅ | "assistant" in first chunk |
choices[0].delta.content |
Text chunks | ✅ | Progressive text deltas |
choices[0].delta.reasoning_content |
Thinking chunks | ✅ | Progressive thinking deltas |
choices[0].delta.tool_calls |
Tool chunks | ✅ | Tool call data |
choices[0].delta |
Final chunk | ✅ | Empty {} |
choices[0].finish_reason |
Final chunk | ✅ | "stop" or "tool_calls" |
choices[0].logprobs |
Optional | ❌ | Not implemented |
usage (final chunk) |
Optional | ✅ | Emitted when stream_options.include_usage is true |
data: [DONE] |
Required | ✅ | Stream termination signal |
Gemini API (/v1/gemini)
Spec: Custom endpoint loosely based on Gemini REST API
Note: This is NOT a 1:1 Gemini API replica. It's a simplified proxy-native endpoint that uses Gemini's
functionDeclarations/functionCall/functionResponseformat directly, avoiding OpenAI ↔ Gemini format conversion overhead.
Request Fields
| Field | Spec | Status | Implementation Details |
|---|---|---|---|
model |
Required | ✅ | Mapped to internal model enum |
input |
Required | ✅ | String only (no array/multipart) |
tools |
Optional | ✅ | Native Gemini [{functionDeclarations: [...]}] format |
tool_config |
Optional | ✅ | Native Gemini {functionCallingConfig: {mode: "AUTO"}} |
tool_results |
Optional | ✅ | Array of {functionResponse: {name, response}} |
conversation |
Optional | ✅ | Session ID for cascade reuse |
stream |
Optional | ✅ | SSE streaming |
timeout |
Optional | ✅ | Default 120s |
temperature |
Optional | ✅ | Forwarded to Google via MITM generationConfig.temperature |
top_p / topP |
Optional | ✅ | Forwarded to Google via MITM generationConfig.topP |
top_k / topK |
Optional | ✅ | Forwarded to Google via MITM generationConfig.topK |
max_output_tokens / maxOutputTokens |
Optional | ✅ | Forwarded via MITM generationConfig.maxOutputTokens |
stop_sequences / stopSequences |
Optional | ✅ | Forwarded via MITM generationConfig.stopSequences |
Sync Response Object
| Field | Spec | Status | Notes |
|---|---|---|---|
candidates[0].content.parts |
Required | ✅ | Array of text/functionCall parts |
candidates[0].content.parts[].text |
Text | ✅ | Response text |
candidates[0].content.parts[].thought |
Extension | ✅ | true for thinking parts |
candidates[0].content.parts[].functionCall |
Tool call | ✅ | {name, args} |
candidates[0].content.role |
Required | ✅ | "model" |
candidates[0].finishReason |
Required | ✅ | "STOP" |
modelVersion |
Required | ✅ | Model name string |
usageMetadata |
Optional | ✅ | MITM-intercepted token counts |
usageMetadata fields:
| Field | Status | Notes |
|---|---|---|
promptTokenCount |
✅ | Input tokens |
candidatesTokenCount |
✅ | Output tokens |
totalTokenCount |
✅ | Input + output |
thoughtsTokenCount |
✅ | Thinking/reasoning tokens |
cachedContentTokenCount |
✅ | Cache read tokens |
Streaming Format
Each SSE data: chunk is a complete Gemini-format JSON object with progressive candidates[0].content.parts:
data: {"candidates":[{"content":{"parts":[{"text":"thinking...","thought":true}],"role":"model"}}],"modelVersion":"opus-4.6"}
data: {"candidates":[{"content":{"parts":[{"text":"Hello!"}],"role":"model"}}],"modelVersion":"opus-4.6"}
data: {"candidates":[{"content":{"parts":[{"text":""}],"role":"model"},"finishReason":"STOP"}],"modelVersion":"opus-4.6"}
data: [DONE]
Priority Gaps
🔴 High Priority — RESOLVED ✅
All high-priority gaps have been addressed:
Completions:→ ✅ Implementedstream_options.include_usageCompletions:→ ✅ Implementedcompletion_tokens_details.reasoning_tokensCompletions: Accept→ ✅ Forwarded via MITMtemperature,top_p,max_tokensGemini:→ ✅ ImplementedusageMetadata
🟡 Medium Priority
-
Responses:
reasoning.effort- What: Map reasoning effort levels (
"high","medium","low") to model variant selection - Why: Could automatically select Opus vs Flash based on reasoning needs
- Effort: Medium — needs model selection logic changes
- What: Map reasoning effort levels (
-
Completions: Session/conversation support
- What: Add session reuse similar to Responses and Gemini endpoints
- Why: Would allow multi-turn conversations via the completions API
- Effort: Medium — need a way to pass session ID (maybe via
userfield or custom header)
-
Completions:
stopsequences- What: Forward
stopto Google asstopSequencesingenerationConfig - Why: Some clients use stop sequences to control generation
- Effort: Trivial — just add to
CompletionRequestandGenerationParams
- What: Forward
-
Completions:
response_format(JSON mode)- What: Forward
response_format: {"type": "json_object"}to Google'sresponseMimeType - Why: Useful for structured output
- Effort: Low — inject
responseMimeType: "application/json"in generationConfig
- What: Forward
🟢 Low Priority
Cosmetic or not applicable to our architecture:
system_fingerprint— OpenAI-specific field, meaningless for our proxyservice_tier— OpenAI billing concept, not applicablen> 1 — Multiple completions per request; our backend only generates onelogprobs— Would require token-level access we don't haveseed— Deterministic sampling not controllable through our proxy
Architecture Notes
Generation Parameter Injection
Client-specified sampling parameters are forwarded to Google's API via the MITM request modification pipeline:
Client sends temperature=0.5 → API handler stores in MitmStore.generation_params
↓
LS sends request to Google API
↓
MITM intercepts request
↓
modify_request() reads generation_params
↓
Injects into request.generationConfig:
temperature, topP, topK, maxOutputTokens,
stopSequences, frequencyPenalty, presencePenalty
↓
Forwards modified request to Google
This approach overrides whatever defaults the LS sets, giving clients direct control over sampling parameters.
Dual Path Architecture
All three endpoints share a dual-path architecture:
┌─────────────────┐
│ Has custom │
Request ────────────► │ tools? │
└────────┬────────┘
│
┌──── Yes ──┴── No ────┐
│ │
┌─────▼─────┐ ┌─────▼─────┐
│ MITM │ │ LS Steps │
│ Bypass │ │ Polling │
│ Path │ │ Path │
└─────┬─────┘ └─────┬─────┘
│ │
┌─────▼─────┐ ┌─────▼─────┐
│ Poll │ │ Poll │
│ MitmStore │ │ get_steps │
│ directly │ │ from LS │
└─────┬─────┘ └─────┬─────┘
│ │
└──────────┬────────────┘
│
┌─────▼─────┐
│ Response │
│ to client │
└───────────┘
-
Bypass Path: When custom tools are present, the handler polls
MitmStoredirectly for response text, thinking text, and function calls. The MITM proxy captures these from the Google API response before the LS processes them. -
LS Path: When no custom tools are present, the handler polls the LS's
get_stepsAPI for progressive response data (text, thinking, status).
Stale State Protection
All bypass paths include protection against stale response_complete flags from previous requests:
if complete && text.is_empty() && thinking.is_none() {
warn!("stale response_complete detected — clearing");
state.mitm_store.clear_response_async().await;
continue; // or retry
}
This handles the race condition where a previous request's MITM handler calls mark_response_complete() after the new request has already called clear_response_async().
Tool Format Conversion
OpenAI tools ──► openai_tools_to_gemini() ──► Gemini functionDeclarations
│
MitmStore.set_tools()
│
MITM proxy injects into
outgoing LS request
The Gemini endpoint skips this conversion entirely — tools are stored in native Gemini format.