Files
zerogravity/docs/endpoint-gap-analysis.md

129 lines
8.0 KiB
Markdown

# Endpoint Gap Analysis
> **Updated:** 2026-02-15
> **Sources:** [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat/create), [OpenAI Responses API](https://platform.openai.com/docs/api-reference/responses), [Gemini Thinking Mode](https://ai.google.dev/gemini-api/docs/thinking-mode), proxy source code
> **Method:** Full source audit cross-referenced against context7 OpenAI API docs
---
## What's Implemented
### All Endpoints
- ✅ Sync + streaming modes
- ✅ Model selection + validation
- ✅ OAuth auth check
- ✅ Timeout control
- ✅ Tool definitions, tool choice, tool results (OpenAI → Gemini auto-conversion)
- ✅ MITM bypass path for custom tools
- ✅ Thinking/reasoning in both sync and streaming
- ✅ Generation params forwarded via MITM (`temperature`, `top_p`, `top_k`, `max_output_tokens`, `stop_sequences`, `frequency_penalty`, `presence_penalty`)
-`reasoning_effort` / `thinkingLevel` — forwarded as `generationConfig.thinkingConfig.thinkingLevel`
-`response_format: {type: "json_object"}` — injected as `responseMimeType: "application/json"`
- ✅ Google Search grounding — `web_search: true` (Completions), `tools: [{type: "web_search_preview"}]` (Responses), `google_search: true` (Gemini)
-`/v1/search` endpoint — dedicated web search via Google Search grounding, returns structured results + citations
### Reasoning Effort → Thinking Level Mapping
| OpenAI `reasoning_effort` | Google `thinkingLevel` | Gemini 3 Pro | Gemini 3 Flash |
| :-----------------------: | :--------------------: | :----------: | :------------: |
| `"low"` | `"low"` | ✅ | ✅ |
| `"medium"` | `"medium"` | ❌ | ✅ |
| `"high"` | `"high"` | ✅ (default) | ✅ (default) |
| — | `"minimal"` | ❌ | ✅ |
### Completions-Specific
-`stream_options.include_usage` — final chunk with usage before `[DONE]`
-`completion_tokens_details.reasoning_tokens` — thinking token count
-`prompt_tokens_details.cached_tokens` — cache read tokens
-`temperature`, `top_p`, `max_tokens`, `max_completion_tokens`, `frequency_penalty`, `presence_penalty`
-`reasoning_effort`
-`stop` — string or array, forwarded as `generationConfig.stopSequences`
-`response_format: {type: "json_object"}` — injects `responseMimeType`
-`response_format: {type: "json_schema", json_schema: {...}}` — injects `responseMimeType` + `responseSchema` via MITM
-`n` (multiple choices) — fires N parallel cascades, collects into `choices[]` (sync only, capped at 5)
-`conversation` — session ID for multi-turn cascade reuse (custom extension)
-`reasoning_content` — thinking text in assistant message
-`system_fingerprint``fp_<version>` in sync + all streaming chunks
-`service_tier``"default"` in sync + all streaming chunks
-`logprobs: null` — in every choice (sync + streaming)
-`metadata` — accepted in request, ignored
-`finish_reason` — correctly maps Google's `MAX_TOKENS``"length"`, `SAFETY``"content_filter"`, etc.
- ✅ Full `messages[]` history — all user, assistant, system, tool messages forwarded
### Responses-Specific
- ✅ Full streaming event set (all `response.*` events including reasoning summary)
-`temperature`, `top_p`, `max_output_tokens`
-`reasoning_effort` — echoed from client request
-`thinking_signature` for multi-turn thinking chains
-`instructions`, `metadata`, `user` — echoed in response
- ✅ Usage with MITM-intercepted real tokens
-`max_tool_calls` — limits tool calls returned per response
-`conversation` — session reuse
-`previous_response_id`, `store`, `parallel_tool_calls`, `truncation`, `text.format`, `tool_choice` — echoed
-`tools` — echoed from client request (was previously always `[]`)
-`text.format``{format: {type: "json_schema", ...}}` injects `responseMimeType` + `responseSchema` via MITM, echoed in response
### Gemini-Specific
- ✅ Native tool format (no conversion needed)
-`usageMetadata` in sync **and streaming** responses
-`temperature`, `topP`, `topK`, `maxOutputTokens`, `stopSequences`
-`thinkingLevel`
- ✅ Session/conversation reuse
- ✅ Array/multipart `input` — strings, string arrays, `{text: "..."}` object arrays
---
## Fixed Bugs
| # | Bug | Fix |
| --- | -------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------- |
| B1 | Messages history dropped | `extract_chat_input` now calls `build_conversation_with_tools` with ALL messages — full multi-turn via `messages[]` works. |
| B2 | `finish_reason` never `"length"` | `google_to_openai_finish_reason()` helper maps `MAX_TOKENS``"length"`, `SAFETY`/`RECITATION`/etc→`"content_filter"`. Applied to all paths. |
| B3 | `reasoning` always null | `build_response_object` now echoes client's `reasoning_effort` from `RequestParams`. |
| B4 | `tool_choice` always `"auto"` | Changed from `&'static str` to `serde_json::Value`. Echoes whatever the client sent. |
| B5 | `tools` always `[]` | Echoes the client's tools array in the response. |
| B7 | `temperature`/`top_p` wrong | Already defaults to `1.0` via `unwrap_or(1.0)`. Was a false positive — no fix needed. |
### Acceptable / Won't Fix
| # | Bug | Status |
| --- | ----------------------------------------- | ----------------------------------------------------------------------------------------------------------- |
| B6 | `Usage::estimate` fake tokens as fallback | Only triggers on timeout/error paths. Heuristic `len/4` is reasonable for timeouts where output tokens = 0. |
---
## TODO — New Features
### Trivial (all done ✅)
All trivial response shape fixes have been implemented.
### Medium (schema injection via MITM) — all done ✅
All structured output features have been implemented.
### Hard (new features)
| # | Gap | API | Notes |
| --- | ------------------------- | ---- | ---------------------------------------------------------- |
| 7 | **`parallel_tool_calls`** | Both | Accept param, echo in response. Can't enforce server-side. |
### Stretch (research needed)
| # | Gap | API | Notes |
| --- | -------------------------- | ---- | ---------------------------------------------------------------------------------------------------------------------------- |
| 12 | **Image/audio modalities** | Both | LS `sendMessage` is text-only. Need to reverse-engineer proto format for binary payloads. Gemini 3 supports vision natively. |
---
## Won't Implement
| # | Gap | Reason |
| --- | ------------------------------- | ------------------------------------------------------------------------ |
| 9 | `prediction` (Predicted Output) | Inference-level speculative decoding optimization. No Gemini equivalent. |
| 10 | `logprobs` / `top_logprobs` | Gemini never exposes token-level log probabilities. |