Bump to gpt5 models by Qard · Pull Request #169 · braintrustdata/autoevals

Qard · 2026-01-30T17:06:11Z

No description provided.

github-actions · 2026-02-04T18:13:10Z

Braintrust eval report

Autoevals (gpt5-1771970348)

Score	Average	Improvements	Regressions
NumericDiff	37.1% (-1pp)	6 🟢	7 🔴
Time_to_first_token	7.27tok (-1.06tok)	76 🟢	43 🔴
Llm_calls	1.09 (+0)	-	-
Tool_calls	0 (+0)	-	-
Errors	0 (+0)	-	-
Llm_errors	0 (+0)	-	-
Tool_errors	0 (+0)	-	-
Prompt_tokens	317.7tok (+0tok)	-	-
Prompt_cached_tokens	0tok (+0tok)	-	-
Prompt_cache_creation_tokens	0tok (+0tok)	-	-
Completion_tokens	251.92tok (-2.2tok)	57 🟢	51 🔴
Completion_reasoning_tokens	0tok (+0tok)	-	-
Total_tokens	569.62tok (-2.2tok)	57 🟢	51 🔴
Estimated_cost	0$ (0$)	54 🟢	45 🔴
Duration	4.55s (-0.53s)	92 🟢	127 🔴
Llm_duration	8.75s (-1.03s)	76 🟢	43 🔴

ibolmo · 2026-02-11T22:40:29Z

js/llm.fixtures.ts

    object: "chat.completion",
    created: 1741135832,
-    model: "gpt-4o-2024-08-06",
+    model: "gpt-5-mini-2025-08-07",


you should send up a monolith repo PR. you'll likely need to update various expect tests

- Remove temperature=0 from ragas tests (gpt-5 models don't support custom temperature) - Add division by zero guard in ContextRecall for both JS and Python - Mark ContextEntityRecall test as can_fail due to LLM non-determinism Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

GPT-5 models don't support custom temperature values. Removed the default temperature=0 from parseArgs in ragas.ts and marked ContextRecall test as can_fail due to LLM non-determinism. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

GPT-5 models require the Responses API instead of the Chat Completions API. This change automatically detects GPT-5 models (by checking if the model name starts with "gpt-5") and routes them to the appropriate API. Changes: - TypeScript: Added isGPT5Model() helper and conditional routing in cachedChatCompletion() - Python: Added is_gpt5_model() helper and complete_wrapper() in LLMClient.__post_init__() - Both implementations convert Chat Completions params to Responses params (messages → input) - Preserves all optional parameters (tools, temperature, max_tokens, etc.) This fixes the "404 unknown model 'gpt-5-mini'" error that was occurring when trying to use GPT-5 models with the Chat Completions API. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

The Responses API expects tools in a flatter format than Chat Completions API. Chat Completions format: { type: "function", function: { name, description, parameters } } Responses API format: { name, description, parameters } Also transform tool_choice from { type: "function", function: { name } } to just the name string. This fixes the "400 Missing required parameter: 'tools[0].name'" error. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

The Responses API expects a flattened format but still requires the type field: { type: "function", name, description, parameters } Previously was missing the type field, causing "Missing required parameter: 'tools[0].type'" error. Also fixed TypeScript type conversion by using double cast to avoid build errors. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

The Responses API only accepts "none", "auto", or "required" for tool_choice, not specific function names. When Chat Completions API passes { type: "function", function: { name: "..." } }, we now map it to "required" which forces the model to call a tool. This fixes the error: "Invalid value: 'select_choice'. Supported values are: 'none', 'auto', and 'required'." Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

The Responses API has a different response structure: - Uses `output` array instead of `choices` - Different field names (stop_reason vs finish_reason) Also removed max_tokens parameter as it's not supported by Responses API. This fixes the error: "Cannot read properties of undefined (reading 'length')" Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Based on OpenAI documentation, the Responses API returns: - output array with separate items for text and tool calls - Tool calls have type "custom_tool_call" with fields: call_id, name, input - Text output has type "output_text" or "text" Conversion now: 1. Iterates through output array 2. Extracts text content from output_text items 3. Converts custom_tool_call items to Chat Completions tool_calls format 4. Maps: call_id → id, input → arguments This fixes "No tool calls in response" error. Sources: - https://developers.openai.com/cookbook/examples/gpt-5/gpt-5_new_params_and_tools - https://platform.openai.com/docs/guides/migrate-to-responses Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- post_process_response now checks if response is already a dict before calling .dict() - Handle 'function_call' type in addition to 'custom_tool_call' - Use 'arguments' field directly (not 'input') for function calls

- Handle 'function_call' type in addition to 'custom_tool_call' - Use 'arguments' field directly (not 'input') for function calls - post_process_response now checks if response is already a dict - Remove debug output from TypeScript implementation

TypeScript build was failing because the ChatCompletionMessage type requires a refusal property. Set it to null for Responses API conversions.

- Fix Makefile to use python3 instead of python for venv creation - Update JS and Python tests to expect gpt-5-mini as default model - Apply prettier formatting to js/oai.ts Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

The Responses API can return either response objects or dicts depending on the context (e.g., when mocked or proxied). The previous code only checked hasattr(response, "output") which fails for dict responses. Now properly handles both cases: - Check isinstance(response, dict) and "output" in response for dicts - Check hasattr(response, "output") for response objects This fixes KeyError: 'choices' in ragas tests where the unconverted Responses API response was being returned instead of the Chat Completions format. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Refactored the GPT-5 Responses API wrapper to properly handle both sync and async modes: - Extracted conversion logic into convert_responses_to_chat_completion() - Extracted parameter preparation into prepare_responses_params() - Created separate sync and async wrappers based on is_async flag - Async wrapper properly awaits responses_create() call This fixes async test failures where coroutines were not being awaited. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Updated JS tests to mock the Responses API endpoint (/v1/responses) instead of Chat Completions endpoint for GPT-5 models. Changes: - Added default Responses API handler in beforeAll - Updated LLMClassifierFromTemplate tests to use Responses API format - Mocked responses include output array with function_call items Fixes test timeouts when using gpt-5-mini default model. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Added null checks for output field: - Check that output value is not None when checking has_output - Handle case where output_list could be None before iterating Fixes TypeError: 'NoneType' object is not iterable when response has output key but value is None. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

The Chat Completions API expects tool_calls to either be present with a list of tool calls, or absent entirely - not None. Changed to only include tool_calls in message dict when there are actual tool calls present. This fixes: TypeError: 'NoneType' object is not subscriptable at resp["tool_calls"][0] Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Created conftest.py with default Responses API mock fixture - Updated test_factuality to mock Responses API endpoint - Converts Chat Completions test expectations to Responses API format This fixes tests that were failing with "RESPX: not mocked!" errors when calling the Responses API endpoint for GPT-5 models. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Update test mocks and assertions to work with Responses API: Python changes: - Update parameter validation tests to mock /v1/responses endpoint - Verify temperature parameter passes through, max_tokens excluded (not supported by Responses API) - Update wrapper type assertions to check for callable functions instead of bound methods - Add Responses API mock for ragas embedding model test - Remove autouse from conftest fixture to prevent interference with test-specific mocks JavaScript changes: - Update parameter validation test to verify temperature only, not max_tokens - Add explicit model="gpt-4o-mini" to chain of thought and battle tests to use Chat Completions API with existing fixtures All tests now pass: Python 65 passed/2 xfailed, JavaScript 69 passed Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Qard requested review from ankrgyl and ibolmo January 30, 2026 17:06

Qard self-assigned this Jan 30, 2026

Qard added the enhancement New feature or request label Jan 30, 2026

Qard force-pushed the gpt5 branch from 3f0c510 to 11d9f77 Compare February 4, 2026 18:12

Qard force-pushed the gpt5 branch 2 times, most recently from 37bbe98 to 2a023f7 Compare February 5, 2026 20:01

ibolmo reviewed Feb 11, 2026

View reviewed changes

ibolmo approved these changes Feb 11, 2026

View reviewed changes

Qard force-pushed the gpt5 branch 2 times, most recently from 6318654 to 95ee24e Compare February 21, 2026 01:23

Qard and others added 14 commits February 24, 2026 22:00

Bump to gpt5 models

da2ad2e

Fix black formatting

3c86586

Add debug logging to understand Responses API structure

7f6b704

Fix Responses API bugs: handle dict responses and function_call type

fc5c75d

- post_process_response now checks if response is already a dict before calling .dict() - Handle 'function_call' type in addition to 'custom_tool_call' - Use 'arguments' field directly (not 'input') for function calls

Add missing refusal property to ChatCompletionMessage

35c0dc7

TypeScript build was failing because the ChatCompletionMessage type requires a refusal property. Set it to null for Responses API conversions.

Qard force-pushed the gpt5 branch from 4d9dbea to 35c0dc7 Compare February 24, 2026 15:33

Qard and others added 3 commits February 24, 2026 23:56

Qard and others added 5 commits February 25, 2026 00:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bump to gpt5 models#169

Bump to gpt5 models#169
Qard wants to merge 22 commits intomainfrom
gpt5

Qard commented Jan 30, 2026

Uh oh!

github-actions bot commented Feb 4, 2026 •

edited

Loading

Uh oh!

ibolmo Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Qard commented Jan 30, 2026

Uh oh!

github-actions bot commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Braintrust eval report

Uh oh!

ibolmo Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Feb 4, 2026 •

edited

Loading