- Responses API (streaming): MITM bypass path polls MitmStore directly when custom tools are active, skipping LS step polling entirely. Streams thinking text deltas in real-time as they arrive from the MITM. Handles function calls, text response, and thinking/reasoning events. - Responses API (sync): Same MITM bypass for non-streaming responses. Polls MitmStore for function calls or completed text before falling back to LS path. - Gemini endpoint: MITM bypass polls MitmStore directly for tool call responses, eliminating LS overhead. - MitmStore: Added captured_thinking_text field with set/peek/take methods for real-time thinking text capture from MITM SSE. - MITM proxy: Now captures both thinking_text and response_text from StreamingAccumulator into MitmStore when bypass mode is active.
1.5 KiB
1.5 KiB
Sync All Endpoints + Latency + Thinking Streaming
Phase 1: Sync Responses API (/v1/responses) with LS bypass
Current state:
handle_responses_stream(line 529-859) polls LS steps for text- Doesn't use MitmStore bypass at all
- Still suffers from LS multi-turn overhead when tools are active
Fix:
- Add MITM bypass path (same as completions) — check MitmStore for text + function calls
- For function calls: emit
response.output_item.added(function_call type) + done events - For text: stream from MitmStore
captured_response_text+response_complete
Phase 2: Sync Gemini endpoint (/v1/gemini) with LS bypass
Current state:
handle_gemini(line 57-236) usespoll_for_responsethen checks MitmStore- Already checks
take_any_function_calls()after polling - But
poll_for_responsestill goes through LS steps
Fix:
- When tools are active, poll MitmStore directly instead of
poll_for_response
Phase 3: Latency improvements
- Reduce poll intervals across all handlers
- Add MITM store thinking_text capture for real-time streaming
Phase 4: Real-time thinking streaming investigation
Current state:
- Google SSE includes
thought: trueparts with thinking text streaming_acc.thinking_textaccumulates this- Currently only used for final usage stats, not streamed in real-time
Investigation needed:
- The MITM intercept already captures thinking_text per-chunk
- Need to store thinking_text updates in MitmStore incrementally
- Responses handler can then stream thinking deltas in real-time