Files
zerogravity/docs/endpoint-gap-analysis.md
Nikketryhard b1bd57ab5e feat: forward generation params via MITM + add usageMetadata to Gemini
- Add GenerationParams struct to MitmStore for temperature, top_p,
  top_k, max_output_tokens, stop_sequences, frequency/presence_penalty
- MITM modify_request injects params into request.generationConfig
- All 3 endpoints (Completions, Responses, Gemini) store client params
- Add usageMetadata to Gemini sync responses (promptTokenCount,
  candidatesTokenCount, totalTokenCount, thoughtsTokenCount)
- Add generation param fields to GeminiRequest (temperature, topP, etc.)
- Completions stream_options.include_usage emits final usage chunk
- Completions reasoning_tokens in completion_tokens_details
- Update endpoint gap analysis doc (all high-priority gaps resolved)
2026-02-15 14:23:05 -06:00

35 KiB

Endpoint Gap Analysis

Generated: 2026-02-15 (updated)
Proxy Version: 3.1.0
Scope: All three API endpoints vs official OpenAI / Gemini specifications


Table of Contents


Endpoint Overview

The proxy exposes three main API endpoints, each serving different client ecosystems:

Endpoint Protocol Primary Clients Spec Reference
POST /v1/responses OpenAI Responses API Claude Code, Antigravity-native clients platform.openai.com/docs/api-reference/responses
POST /v1/chat/completions OpenAI Chat Completions API OpenCode, Vercel AI SDK, any OpenAI-compatible client platform.openai.com/docs/api-reference/chat
POST /v1/gemini Custom Gemini-native API Direct Gemini-format consumers ai.google.dev/api (loosely based)

All three endpoints share the same backend pipeline:

Client Request → Proxy Endpoint → LS (Language Server) → Google API
                                ↓
                          MITM Proxy (captures real usage + injects generation params + tool calls)

Feature Parity Matrix

Core Features

Feature Responses Completions Gemini
Sync mode
Streaming mode (SSE)
Model selection
Model validation
Auth check (OAuth)
Timeout control

Generation Parameters (MITM-injected)

Feature Responses Completions Gemini
temperature
top_p / topP
top_k / topK
max_output_tokens
stop_sequences
frequency_penalty
presence_penalty

Note: All generation parameters are forwarded to Google's API via MITM injection into request.generationConfig. They override the LS defaults.

Thinking / Reasoning

Feature Responses Completions Gemini
Thinking — LS path (streaming) reasoning_summary_text.delta reasoning_content delta thought: true part
Thinking — LS path (sync) reasoning output item reasoning_content in message thought: true part
Thinking — Bypass path (streaming)
Thinking — Bypass path (sync)
Thinking signature (multi-turn) thinking_signature field Not applicable Not applicable

Tool Calls

Feature Responses Completions Gemini
Tool definitions input OpenAI format → Gemini OpenAI format → Gemini Native Gemini format
Tool choice control tool_choice tool_choice tool_config
Tool call output (streaming) function_call items tool_calls in delta functionCall parts
Tool call output (sync) function_call items tool_calls in message functionCall parts
Tool result input function_call_output items tool role messages functionResponse in tool_results
MITM bypass (custom tools)
Stale state protection

Session Management

Feature Responses Completions Gemini
Session/conversation reuse conversation field Not supported conversation field
Session listing (GET /v1/sessions) Shared Shared Shared
Session deletion Shared Shared Shared

Usage / Token Tracking

Feature Responses Completions Gemini
Usage in sync response MITM real tokens MITM real tokens usageMetadata
Usage in streaming (final chunk) Not emitted stream_options.include_usage Not emitted
reasoning_tokens in usage In output_tokens_details In completion_tokens_details thoughtsTokenCount
Cache tokens cached_tokens cached_tokens cachedContentTokenCount

Detailed Endpoint Analysis

Responses API (/v1/responses)

Spec: OpenAI Responses API

Request Fields

Field Spec Status Implementation Details
model Required Mapped to internal model enum via lookup_model()
input Required String or array. Array supports message items and function_call_output items
instructions Optional Prepended to user text as system instructions
stream Optional SSE stream with response.* events
tools Optional OpenAI function format → auto-converted to Gemini functionDeclarations via openai_tools_to_gemini()
tool_choice Optional "auto", "required", "none", or {"type":"function","function":{"name":"X"}} → converted to Gemini functionCallingConfig
store Optional Accepted, echoed in response. Not actually persisted.
temperature Optional Forwarded to Google via MITM generationConfig injection.
top_p Optional Forwarded to Google via MITM.
max_output_tokens Optional Forwarded to Google via MITM.
previous_response_id Optional Accepted, echoed. Not used for chaining (use conversation instead).
metadata Optional Accepted, echoed back in response.
user Optional Accepted, echoed.
conversation Extension Proxy-specific: session ID for multi-turn cascade reuse.
timeout Extension Proxy-specific: request timeout in seconds (default 120).
reasoning.effort Optional Could map to model variant selection (e.g., "high" → Opus, "low" → Flash).
reasoning.generate_summary Optional Not implemented. Could control thinking output inclusion.
truncation Optional Not applicable — LS manages context window.
parallel_tool_calls Optional Hardcoded true in response.

Response Object

Field Spec Status Notes
id Required resp_ + UUID
object Required Always "response"
created_at Required Unix timestamp
status Required "completed" or "incomplete"
completed_at Required Unix timestamp or null
error Required null on success
incomplete_details Required null
instructions Required Echoed from request
max_output_tokens Required Echoed or null
model Required Model name string
output Required Array of reasoning and/or message items
parallel_tool_calls Required true
previous_response_id Required Echoed or null
reasoning Required {effort: null, summary: null}
store Required Echoed
temperature Required Echoed (default 1.0)
text Required {format: {type: "text"}}
tool_choice Required "auto"
tools Required Echoed or []
top_p Required Echoed (default 1.0)
truncation Required "disabled"
usage Required MITM-intercepted real tokens when available, estimated otherwise
user Required Echoed or null
metadata Required Echoed or {}
thinking_signature Extension Proxy-specific: opaque blob for multi-turn thinking chain

Streaming Events

Event Spec Status Notes
response.created Required Initial response shell
response.in_progress Required
response.output_item.added Required For reasoning + message items
response.content_part.added Required
response.output_text.delta Required Progressive text deltas
response.output_text.done Required
response.content_part.done Required
response.output_item.done Required
response.completed Required Final event with full response
response.reasoning_summary_text.delta Required Progressive thinking deltas
response.reasoning_summary_text.done Required
response.function_call_arguments.delta For tools Tool call argument streaming
response.function_call_arguments.done For tools

Chat Completions API (/v1/chat/completions)

Spec: OpenAI Chat Completions API

Request Fields

Field Spec Status Implementation Details
model Required Mapped to internal model enum
messages Required Supports system, developer, user, assistant, tool roles
messages[].content Required String or array of {type: "text", text: "..."} objects
messages[].tool_calls Optional For assistant messages with tool calls
messages[].tool_call_id Optional For tool result messages
stream Optional SSE with chat.completion.chunk events
stream_options Optional include_usage: true emits final usage chunk before [DONE]
tools Optional OpenAI function format → auto-converted to Gemini
tool_choice Optional "auto", "none", "required", or specific function
timeout Extension Proxy-specific (default 120s)
temperature Optional Forwarded to Google via MITM generationConfig.temperature
top_p Optional Forwarded to Google via MITM generationConfig.topP
max_tokens Optional Forwarded to Google via MITM generationConfig.maxOutputTokens
max_completion_tokens Optional Forwarded (same as max_tokens, newer OpenAI param)
frequency_penalty Optional Forwarded to Google via MITM generationConfig.frequencyPenalty
presence_penalty Optional Forwarded to Google via MITM generationConfig.presencePenalty
user Optional Accepted, not used
n Optional N/A — single generation only
logprobs Optional N/A
top_logprobs Optional N/A
logit_bias Optional N/A
response_format Optional Could be useful for JSON mode
seed Optional N/A
stop Optional Could be forwarded as stopSequences

Sync Response Object

Field Spec Status Notes
id Required chatcmpl- + UUID
object Required "chat.completion"
created Required Unix timestamp
model Required Model name
choices[0].index Required 0
choices[0].message.role Required "assistant"
choices[0].message.content Required Response text
choices[0].message.reasoning_content Extension Thinking text (when model produces thinking)
choices[0].message.tool_calls Conditional When model returns tool calls
choices[0].message.refusal Optional Not implemented
choices[0].message.annotations Optional Not implemented
choices[0].logprobs Optional Not implemented
choices[0].finish_reason Required "stop" or "tool_calls"
usage.prompt_tokens Required MITM real or estimated
usage.completion_tokens Required MITM real or estimated
usage.total_tokens Required Sum
usage.prompt_tokens_details.cached_tokens Optional MITM cache read tokens
usage.completion_tokens_details.reasoning_tokens Optional MITM thinking token count
usage.completion_tokens_details.accepted_prediction_tokens Optional N/A
usage.completion_tokens_details.rejected_prediction_tokens Optional N/A
system_fingerprint Deprecated Cosmetic, not needed
service_tier Optional Cosmetic, not needed

Streaming Chunk Object

Field Spec Status Notes
id Required Same across all chunks
object Required "chat.completion.chunk"
created Required Same across all chunks
model Required
choices[0].index Required 0
choices[0].delta.role First chunk "assistant" in first chunk
choices[0].delta.content Text chunks Progressive text deltas
choices[0].delta.reasoning_content Thinking chunks Progressive thinking deltas
choices[0].delta.tool_calls Tool chunks Tool call data
choices[0].delta Final chunk Empty {}
choices[0].finish_reason Final chunk "stop" or "tool_calls"
choices[0].logprobs Optional Not implemented
usage (final chunk) Optional Emitted when stream_options.include_usage is true
data: [DONE] Required Stream termination signal

Gemini API (/v1/gemini)

Spec: Custom endpoint loosely based on Gemini REST API

Note: This is NOT a 1:1 Gemini API replica. It's a simplified proxy-native endpoint that uses Gemini's functionDeclarations / functionCall / functionResponse format directly, avoiding OpenAI ↔ Gemini format conversion overhead.

Request Fields

Field Spec Status Implementation Details
model Required Mapped to internal model enum
input Required String only (no array/multipart)
tools Optional Native Gemini [{functionDeclarations: [...]}] format
tool_config Optional Native Gemini {functionCallingConfig: {mode: "AUTO"}}
tool_results Optional Array of {functionResponse: {name, response}}
conversation Optional Session ID for cascade reuse
stream Optional SSE streaming
timeout Optional Default 120s
temperature Optional Forwarded to Google via MITM generationConfig.temperature
top_p / topP Optional Forwarded to Google via MITM generationConfig.topP
top_k / topK Optional Forwarded to Google via MITM generationConfig.topK
max_output_tokens / maxOutputTokens Optional Forwarded via MITM generationConfig.maxOutputTokens
stop_sequences / stopSequences Optional Forwarded via MITM generationConfig.stopSequences

Sync Response Object

Field Spec Status Notes
candidates[0].content.parts Required Array of text/functionCall parts
candidates[0].content.parts[].text Text Response text
candidates[0].content.parts[].thought Extension true for thinking parts
candidates[0].content.parts[].functionCall Tool call {name, args}
candidates[0].content.role Required "model"
candidates[0].finishReason Required "STOP"
modelVersion Required Model name string
usageMetadata Optional MITM-intercepted token counts

usageMetadata fields:

Field Status Notes
promptTokenCount Input tokens
candidatesTokenCount Output tokens
totalTokenCount Input + output
thoughtsTokenCount Thinking/reasoning tokens
cachedContentTokenCount Cache read tokens

Streaming Format

Each SSE data: chunk is a complete Gemini-format JSON object with progressive candidates[0].content.parts:

data: {"candidates":[{"content":{"parts":[{"text":"thinking...","thought":true}],"role":"model"}}],"modelVersion":"opus-4.6"}

data: {"candidates":[{"content":{"parts":[{"text":"Hello!"}],"role":"model"}}],"modelVersion":"opus-4.6"}

data: {"candidates":[{"content":{"parts":[{"text":""}],"role":"model"},"finishReason":"STOP"}],"modelVersion":"opus-4.6"}

data: [DONE]

Priority Gaps

🔴 High Priority — RESOLVED

All high-priority gaps have been addressed:

  1. Completions: stream_options.include_usage Implemented
  2. Completions: completion_tokens_details.reasoning_tokens Implemented
  3. Completions: Accept temperature, top_p, max_tokens Forwarded via MITM
  4. Gemini: usageMetadata Implemented

🟡 Medium Priority

  1. Responses: reasoning.effort

    • What: Map reasoning effort levels ("high", "medium", "low") to model variant selection
    • Why: Could automatically select Opus vs Flash based on reasoning needs
    • Effort: Medium — needs model selection logic changes
  2. Completions: Session/conversation support

    • What: Add session reuse similar to Responses and Gemini endpoints
    • Why: Would allow multi-turn conversations via the completions API
    • Effort: Medium — need a way to pass session ID (maybe via user field or custom header)
  3. Completions: stop sequences

    • What: Forward stop to Google as stopSequences in generationConfig
    • Why: Some clients use stop sequences to control generation
    • Effort: Trivial — just add to CompletionRequest and GenerationParams
  4. Completions: response_format (JSON mode)

    • What: Forward response_format: {"type": "json_object"} to Google's responseMimeType
    • Why: Useful for structured output
    • Effort: Low — inject responseMimeType: "application/json" in generationConfig

🟢 Low Priority

Cosmetic or not applicable to our architecture:

  1. system_fingerprint — OpenAI-specific field, meaningless for our proxy
  2. service_tier — OpenAI billing concept, not applicable
  3. n > 1 — Multiple completions per request; our backend only generates one
  4. logprobs — Would require token-level access we don't have
  5. seed — Deterministic sampling not controllable through our proxy

Architecture Notes

Generation Parameter Injection

Client-specified sampling parameters are forwarded to Google's API via the MITM request modification pipeline:

Client sends temperature=0.5  →  API handler stores in MitmStore.generation_params
                                        ↓
                                  LS sends request to Google API
                                        ↓
                                  MITM intercepts request
                                        ↓
                                  modify_request() reads generation_params
                                        ↓
                                  Injects into request.generationConfig:
                                    temperature, topP, topK, maxOutputTokens,
                                    stopSequences, frequencyPenalty, presencePenalty
                                        ↓
                                  Forwards modified request to Google

This approach overrides whatever defaults the LS sets, giving clients direct control over sampling parameters.

Dual Path Architecture

All three endpoints share a dual-path architecture:

                      ┌─────────────────┐
                      │  Has custom      │
Request ────────────► │  tools?          │
                      └────────┬────────┘
                               │
                    ┌──── Yes ──┴── No ────┐
                    │                       │
              ┌─────▼─────┐          ┌─────▼─────┐
              │  MITM      │          │  LS Steps  │
              │  Bypass    │          │  Polling   │
              │  Path      │          │  Path      │
              └─────┬─────┘          └─────┬─────┘
                    │                       │
              ┌─────▼─────┐          ┌─────▼─────┐
              │  Poll      │          │  Poll      │
              │  MitmStore │          │  get_steps │
              │  directly  │          │  from LS   │
              └─────┬─────┘          └─────┬─────┘
                    │                       │
                    └──────────┬────────────┘
                               │
                         ┌─────▼─────┐
                         │  Response  │
                         │  to client │
                         └───────────┘
  • Bypass Path: When custom tools are present, the handler polls MitmStore directly for response text, thinking text, and function calls. The MITM proxy captures these from the Google API response before the LS processes them.

  • LS Path: When no custom tools are present, the handler polls the LS's get_steps API for progressive response data (text, thinking, status).

Stale State Protection

All bypass paths include protection against stale response_complete flags from previous requests:

if complete && text.is_empty() && thinking.is_none() {
    warn!("stale response_complete detected — clearing");
    state.mitm_store.clear_response_async().await;
    continue; // or retry
}

This handles the race condition where a previous request's MITM handler calls mark_response_complete() after the new request has already called clear_response_async().

Tool Format Conversion

OpenAI tools ──► openai_tools_to_gemini() ──► Gemini functionDeclarations
                                                    │
                                              MitmStore.set_tools()
                                                    │
                                              MITM proxy injects into
                                              outgoing LS request

The Gemini endpoint skips this conversion entirely — tools are stored in native Gemini format.