Files
zerogravity/docs/endpoint-gap-analysis.md
Nikketryhard 4e4d8e9474 chore: code cleanup and documentation overhaul
- Remove debug header dump from MITM proxy (was temp debugging code)
- Suppress dead_code warnings for intentional OpenAI compat fields
- Rewrite README with styled mermaid architecture diagrams, full
  feature listing, usage examples, and CLI reference
- Update endpoint-gap-analysis: images implemented, audio only stretch
- Update mitm-interception-status: add request modification and error
  capture components
- Update standalone-ls-todo: add new endpoints to test results
- Zero compiler warnings
2026-02-15 18:27:53 -06:00

8.0 KiB

Endpoint Gap Analysis

Updated: 2026-02-15
Sources: OpenAI Chat Completions API, OpenAI Responses API, Gemini Thinking Mode, proxy source code
Method: Full source audit cross-referenced against context7 OpenAI API docs


What's Implemented

All Endpoints

  • Sync + streaming modes
  • Model selection + validation
  • OAuth auth check
  • Timeout control
  • Tool definitions, tool choice, tool results (OpenAI → Gemini auto-conversion)
  • MITM bypass path for custom tools
  • Thinking/reasoning in both sync and streaming
  • Generation params forwarded via MITM (temperature, top_p, top_k, max_output_tokens, stop_sequences, frequency_penalty, presence_penalty)
  • reasoning_effort / thinkingLevel — forwarded as generationConfig.thinkingConfig.thinkingLevel
  • response_format: {type: "json_object"} — injected as responseMimeType: "application/json"
  • Google Search grounding — web_search: true (Completions), tools: [{type: "web_search_preview"}] (Responses), google_search: true (Gemini)
  • /v1/search endpoint — dedicated web search via Google Search grounding, returns structured results + citations
  • Image uploads — input_image / image_url with base64 data URIs, injected via MITM as inlineData
  • Upstream error propagation — Google API errors (400, 429, 500) returned to client instantly instead of hanging

Reasoning Effort → Thinking Level Mapping

OpenAI reasoning_effort Google thinkingLevel Gemini 3 Pro Gemini 3 Flash
"low" "low"
"medium" "medium"
"high" "high" (default) (default)
"minimal"

Completions-Specific

  • stream_options.include_usage — final chunk with usage before [DONE]
  • completion_tokens_details.reasoning_tokens — thinking token count
  • prompt_tokens_details.cached_tokens — cache read tokens
  • temperature, top_p, max_tokens, max_completion_tokens, frequency_penalty, presence_penalty
  • reasoning_effort
  • stop — string or array, forwarded as generationConfig.stopSequences
  • response_format: {type: "json_object"} — injects responseMimeType
  • response_format: {type: "json_schema", json_schema: {...}} — injects responseMimeType + responseSchema via MITM
  • n (multiple choices) — fires N parallel cascades, collects into choices[] (sync only, capped at 5)
  • conversation — session ID for multi-turn cascade reuse (custom extension)
  • reasoning_content — thinking text in assistant message
  • system_fingerprintfp_<version> in sync + all streaming chunks
  • service_tier"default" in sync + all streaming chunks
  • logprobs: null — in every choice (sync + streaming)
  • metadata — accepted in request, ignored
  • finish_reason — correctly maps Google's MAX_TOKENS"length", SAFETY"content_filter", etc.
  • Full messages[] history — all user, assistant, system, tool messages forwarded

Responses-Specific

  • Full streaming event set (all response.* events including reasoning summary)
  • temperature, top_p, max_output_tokens
  • reasoning_effort — echoed from client request
  • thinking_signature for multi-turn thinking chains
  • instructions, metadata, user — echoed in response
  • Usage with MITM-intercepted real tokens
  • max_tool_calls — limits tool calls returned per response
  • conversation — session reuse
  • previous_response_id, store, parallel_tool_calls, truncation, text.format, tool_choice — echoed
  • tools — echoed from client request (was previously always [])
  • text.format{format: {type: "json_schema", ...}} injects responseMimeType + responseSchema via MITM, echoed in response

Gemini-Specific

  • Native tool format (no conversion needed)
  • usageMetadata in sync and streaming responses
  • temperature, topP, topK, maxOutputTokens, stopSequences
  • thinkingLevel
  • Session/conversation reuse
  • Array/multipart input — strings, string arrays, {text: "..."} object arrays

Fixed Bugs

# Bug Fix
B1 Messages history dropped extract_chat_input now calls build_conversation_with_tools with ALL messages — full multi-turn via messages[] works.
B2 finish_reason never "length" google_to_openai_finish_reason() helper maps MAX_TOKENS"length", SAFETY/RECITATION/etc→"content_filter". Applied to all paths.
B3 reasoning always null build_response_object now echoes client's reasoning_effort from RequestParams.
B4 tool_choice always "auto" Changed from &'static str to serde_json::Value. Echoes whatever the client sent.
B5 tools always [] Echoes the client's tools array in the response.
B7 temperature/top_p wrong Already defaults to 1.0 via unwrap_or(1.0). Was a false positive — no fix needed.

Acceptable / Won't Fix

# Bug Status
B6 Usage::estimate fake tokens as fallback Only triggers on timeout/error paths. Heuristic len/4 is reasonable for timeouts where output tokens = 0.

TODO — New Features

Trivial (all done )

All trivial response shape fixes have been implemented.

Medium (schema injection via MITM) — all done

All structured output features have been implemented.

Hard (new features)

# Gap API Notes
7 parallel_tool_calls Both Accept param, echo in response. Can't enforce server-side.

Stretch (research needed)

# Gap API Notes
12 Audio input Both Audio modalities not yet supported. Vision/images work via MITM.

Won't Implement

# Gap Reason
9 prediction (Predicted Output) Inference-level speculative decoding optimization. No Gemini equivalent.
10 logprobs / top_logprobs Gemini never exposes token-level log probabilities.