zerogravity/docs/endpoint-gap-analysis.md

# Endpoint Gap Analysis

> **Updated:** 2026-02-15
> **Sources:** [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat/create), [OpenAI Responses API](https://platform.openai.com/docs/api-reference/responses), [Gemini Thinking Mode](https://ai.google.dev/gemini-api/docs/thinking-mode), proxy source code
> **Method:** Full source audit cross-referenced against context7 OpenAI API docs

---

## What's Implemented

### All Endpoints

- ✅ Sync + streaming modes
- ✅ Model selection + validation
- ✅ OAuth auth check
- ✅ Timeout control
- ✅ Tool definitions, tool choice, tool results (OpenAI → Gemini auto-conversion)
- ✅ MITM bypass path for custom tools
- ✅ Thinking/reasoning in both sync and streaming
- ✅ Generation params forwarded via MITM (`temperature`, `top_p`, `top_k`, `max_output_tokens`, `stop_sequences`, `frequency_penalty`, `presence_penalty`)
- ✅ `reasoning_effort` / `thinkingLevel` — forwarded as `generationConfig.thinkingConfig.thinkingLevel`
- ✅ `response_format: {type: "json_object"}` — injected as `responseMimeType: "application/json"`
- ✅ Google Search grounding — `web_search: true` (Completions), `tools: [{type: "web_search_preview"}]` (Responses), `google_search: true` (Gemini)
- ✅ `/v1/search` endpoint — dedicated web search via Google Search grounding, returns structured results + citations

### Reasoning Effort → Thinking Level Mapping

| OpenAI `reasoning_effort` | Google `thinkingLevel` | Gemini 3 Pro | Gemini 3 Flash |
| :-----------------------: | :--------------------: | :----------: | :------------: |
|          `"low"`          |        `"low"`         |      ✅      |       ✅       |
|        `"medium"`         |       `"medium"`       |      ❌      |       ✅       |
|         `"high"`          |        `"high"`        | ✅ (default) |  ✅ (default)  |
|             —             |      `"minimal"`       |      ❌      |       ✅       |

### Completions-Specific

- ✅ `stream_options.include_usage` — final chunk with usage before `[DONE]`
- ✅ `completion_tokens_details.reasoning_tokens` — thinking token count
- ✅ `prompt_tokens_details.cached_tokens` — cache read tokens
- ✅ `temperature`, `top_p`, `max_tokens`, `max_completion_tokens`, `frequency_penalty`, `presence_penalty`
- ✅ `reasoning_effort`
- ✅ `stop` — string or array, forwarded as `generationConfig.stopSequences`
- ✅ `response_format: {type: "json_object"}` — injects `responseMimeType`
- ✅ `response_format: {type: "json_schema", json_schema: {...}}` — injects `responseMimeType` + `responseSchema` via MITM
- ✅ `n` (multiple choices) — fires N parallel cascades, collects into `choices[]` (sync only, capped at 5)
- ✅ `conversation` — session ID for multi-turn cascade reuse (custom extension)
- ✅ `reasoning_content` — thinking text in assistant message
- ✅ `system_fingerprint` — `fp_<version>` in sync + all streaming chunks
- ✅ `service_tier` — `"default"` in sync + all streaming chunks
- ✅ `logprobs: null` — in every choice (sync + streaming)
- ✅ `metadata` — accepted in request, ignored
- ✅ `finish_reason` — correctly maps Google's `MAX_TOKENS`→`"length"`, `SAFETY`→`"content_filter"`, etc.
- ✅ Full `messages[]` history — all user, assistant, system, tool messages forwarded

### Responses-Specific

- ✅ Full streaming event set (all `response.*` events including reasoning summary)
- ✅ `temperature`, `top_p`, `max_output_tokens`
- ✅ `reasoning_effort` — echoed from client request
- ✅ `thinking_signature` for multi-turn thinking chains
- ✅ `instructions`, `metadata`, `user` — echoed in response
- ✅ Usage with MITM-intercepted real tokens
- ✅ `max_tool_calls` — limits tool calls returned per response
- ✅ `conversation` — session reuse
- ✅ `previous_response_id`, `store`, `parallel_tool_calls`, `truncation`, `text.format`, `tool_choice` — echoed
- ✅ `tools` — echoed from client request (was previously always `[]`)
- ✅ `text.format` — `{format: {type: "json_schema", ...}}` injects `responseMimeType` + `responseSchema` via MITM, echoed in response

### Gemini-Specific

- ✅ Native tool format (no conversion needed)
- ✅ `usageMetadata` in sync **and streaming** responses
- ✅ `temperature`, `topP`, `topK`, `maxOutputTokens`, `stopSequences`
- ✅ `thinkingLevel`
- ✅ Session/conversation reuse
- ✅ Array/multipart `input` — strings, string arrays, `{text: "..."}` object arrays

---

## Fixed Bugs

| #   | Bug                              | Fix                                                                                                                                         |
| --- | -------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------- |
| B1  | Messages history dropped         | `extract_chat_input` now calls `build_conversation_with_tools` with ALL messages — full multi-turn via `messages[]` works.                  |
| B2  | `finish_reason` never `"length"` | `google_to_openai_finish_reason()` helper maps `MAX_TOKENS`→`"length"`, `SAFETY`/`RECITATION`/etc→`"content_filter"`. Applied to all paths. |
| B3  | `reasoning` always null          | `build_response_object` now echoes client's `reasoning_effort` from `RequestParams`.                                                        |
| B4  | `tool_choice` always `"auto"`    | Changed from `&'static str` to `serde_json::Value`. Echoes whatever the client sent.                                                        |
| B5  | `tools` always `[]`              | Echoes the client's tools array in the response.                                                                                            |
| B7  | `temperature`/`top_p` wrong      | Already defaults to `1.0` via `unwrap_or(1.0)`. Was a false positive — no fix needed.                                                       |

### Acceptable / Won't Fix

| #   | Bug                                       | Status                                                                                                      |
| --- | ----------------------------------------- | ----------------------------------------------------------------------------------------------------------- |
| B6  | `Usage::estimate` fake tokens as fallback | Only triggers on timeout/error paths. Heuristic `len/4` is reasonable for timeouts where output tokens = 0. |

---

## TODO — New Features

### Trivial (all done ✅)

All trivial response shape fixes have been implemented.

### Medium (schema injection via MITM) — all done ✅

All structured output features have been implemented.

### Hard (new features)

| #   | Gap                       | API  | Notes                                                      |
| --- | ------------------------- | ---- | ---------------------------------------------------------- |
| 7   | **`parallel_tool_calls`** | Both | Accept param, echo in response. Can't enforce server-side. |

### Stretch (research needed)

| #   | Gap                        | API  | Notes                                                                                                                        |
| --- | -------------------------- | ---- | ---------------------------------------------------------------------------------------------------------------------------- |
| 12  | **Image/audio modalities** | Both | LS `sendMessage` is text-only. Need to reverse-engineer proto format for binary payloads. Gemini 3 supports vision natively. |

---

## Won't Implement

| #   | Gap                             | Reason                                                                   |
| --- | ------------------------------- | ------------------------------------------------------------------------ |
| 9   | `prediction` (Predicted Output) | Inference-level speculative decoding optimization. No Gemini equivalent. |
| 10  | `logprobs` / `top_logprobs`     | Gemini never exposes token-level log probabilities.                      |