Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@ Lingua is a universal message format that compiles to provider-specific formats
- **Ask when non-lossy mapping is unclear**: If the universal type cannot represent a provider feature non-lossily, stop and ask for clarification on the intended canonical representation before implementing a workaround.
- **No unapproved fallback logic**: Do not add ad-hoc fallback parsing/translation paths (for example `fallback_*` helpers) without checking with the programmer first.
- **Typed boundaries only**: At provider boundaries, parse into well-defined typed structs/enums. Do not add lenient raw-JSON parsing that guesses defaults for required fields (for example defaulting missing `role` to `user`, lowercasing unknown roles, or inventing empty `content`).
- **Do not handwrite provider-format structs**: Do not manually define Rust structs/enums that represent provider wire formats when generated or canonical provider types already exist. Fix generation or add typed adapters around canonical types instead.
- **Do not inspect `serde_json::Value` directly for provider semantics**: Do not branch on provider-format fields via ad-hoc `Value` map access. Deserialize into typed provider or typed compatibility structs first, then convert.
- **Fix via types or explicit errors**: If fuzzing finds unsupported/ambiguous shapes, either model them explicitly in types/converters or return a clear error. Do not silently coerce invalid input into a "best effort" shape.
- **Typed-boundary CI gate**: CI enforces `make typed-boundary-check-branch BASE=origin/<base-branch>` on pull requests. Running `make typed-boundary-check` locally is recommended for faster feedback, but not required as a pre-commit hook.
- **Typed extras views over raw map access**: If provider extras must be read, deserialize extras into a typed view struct first; do not pluck fields ad-hoc with `map.get(...)`.
Expand Down
54 changes: 54 additions & 0 deletions TODO.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# Import fixtures follow-up TODO

## Failing new cases

- [x] openai-responses-function-call-output-input
- [x] openai-responses-image-attachments
- [x] openai-responses-image-generation-call
- [x] openai-responses-mixed-input-order
- [x] openai-responses-mixed-output-order
- [x] openai-responses-real-world-tool-loop
- [x] openai-responses-reasoning-blocks
- [x] openai-responses-reasoning-only-output
- [x] openai-responses-web-search

## Work items

- [x] Loosen importer pre-check for Responses item arrays
- Current gap: `has_message_structure` rejects arrays that do not have `role` or nested `message.role`.
- Goal: allow Responses-only arrays (`reasoning`, `function_call_output`, `web_search_call`, etc.) to reach typed OpenAI conversion logic.
- Acceptance: mixed/output-only Responses fixtures are not dropped at import pre-check.

- [x] Handle raw string span input as user messages
- Current gap: string-valued `span.input` is ignored.
- Goal: map string input to `Message::User` with string content.
- Acceptance: string-input fixtures (for example image generation and web search) include the expected leading user message.

- [x] Expand lenient text-type parsing for content blocks
- Current gap: lenient parser only accepts `type: "text"`.
- Goal: also accept OpenAI Responses block types `input_text` and `output_text`.
- Acceptance: fixtures containing these block types parse into expected user/assistant text messages.

- [x] Add typed compatibility for `callId` aliasing
- Current gap: some fixtures use `callId` while generated OpenAI types expect `call_id`.
- Goal: normalize or alias `callId` at import boundary before typed conversion.
- Acceptance: tool call and tool result linkage is preserved for both `call_id` and `callId`.

- [x] Add typed compatibility for `function_call_result`
- Current gap: fixtures include `type: "function_call_result"` which is not represented in generated enums.
- Goal: normalize this to the canonical supported shape before conversion, without raw fallback parsing.
- Acceptance: output/input-order and tool-loop fixtures parse tool result messages correctly.

- [x] Add typed compatibility for non-string tool output payloads
- Current gap: fixtures include object-valued `output`, while generated OpenAI types model output as string.
- Goal: normalize object payloads to canonical representation for strict typed conversion.
- Acceptance: function/tool-result fixtures preserve structured output content in imported tool messages.

- [x] Decide and implement reasoning message aggregation behavior
- Current gap: reasoning output items become standalone assistant messages; some fixtures expect reasoning merged with adjacent assistant text.
- Goal: define canonical import behavior for reasoning-plus-message sequences and implement consistently.
- Acceptance: `openai-responses-reasoning-blocks` and `openai-responses-reasoning-only-output` match expected message counts and roles.

- [x] Re-run and verify fixture suite after each fix
- Command: `cargo test -p lingua --test import_fixtures -- --nocapture`
- Process: update one behavior at a time and confirm no regressions in previously passing fixtures.
96 changes: 34 additions & 62 deletions crates/lingua/src/processing/import.rs
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
use crate::providers::anthropic::generated as anthropic;
use crate::providers::openai::convert::ChatCompletionRequestMessageExt;
use crate::providers::openai::generated as openai;
use crate::providers::openai::convert::{
try_parse_responses_items_for_import, ChatCompletionRequestMessageExt,
};
use crate::serde_json;
use crate::serde_json::Value;
use crate::universal::convert::TryFromLLM;
Expand All @@ -22,43 +23,37 @@ pub struct Span {
pub other: serde_json::Map<String, Value>,
}

/// Cheap check to see if a value looks like it might contain messages
/// Returns early to avoid expensive deserialization attempts on non-message data
fn has_message_structure(data: &Value) -> bool {
match data {
// Check if it's an array where ANY element has "role" field or is a choice object
Value::Array(arr) => {
if arr.is_empty() {
return false;
}
// Check if ANY element in the array looks like a message (not just the first)
// This handles mixed-type arrays from Responses API
for item in arr {
if let Value::Object(obj) = item {
// Direct message format: has "role" field
if obj.contains_key("role") {
/// Try to convert a value to lingua messages by attempting multiple format conversions
fn try_converting_to_messages(data: &Value) -> Vec<Message> {
if let Some(messages) = try_parse_responses_items_for_import(data) {
return messages;
}

// Cheap check to see if a value looks like it might contain messages.
// Returns early to avoid expensive deserialization attempts on non-message data.
let has_message_structure = match data {
// Check if it's an array where any element has "role" or nested "message.role".
Value::Array(arr) => arr.iter().any(|item| match item {
Value::Object(obj) => {
if obj.contains_key("role") {
return true;
}
if let Some(Value::Object(msg)) = obj.get("message") {
if msg.contains_key("role") {
return true;
}
// Chat completions response choices format: has "message" field with role inside
if let Some(Value::Object(msg)) = obj.get("message") {
if msg.contains_key("role") {
return true;
}
}
}
false
}
false
}
_ => false,
}),
// Check if it's an object with "role" field (single message)
Value::Object(obj) => obj.contains_key("role"),
_ => false,
}
}
};

/// Try to convert a value to lingua messages by attempting multiple format conversions
fn try_converting_to_messages(data: &Value) -> Vec<Message> {
// Early bailout: if data doesn't have message structure, skip expensive deserializations
if !has_message_structure(data) {
if !has_message_structure {
// Still try nested object search (for wrapped messages like {messages: [...]})
if let Value::Object(obj) = data {
for key in [
Expand Down Expand Up @@ -104,32 +99,6 @@ fn try_converting_to_messages(data: &Value) -> Vec<Message> {
}
}

// Try Responses API format
if let Ok(provider_messages) =
serde_json::from_value::<Vec<openai::InputItem>>(data_to_parse.clone())
{
if let Ok(messages) =
<Vec<Message> as TryFromLLM<Vec<openai::InputItem>>>::try_from(provider_messages)
{
if !messages.is_empty() {
return messages;
}
}
}

// Try Responses API output format
if let Ok(provider_messages) =
serde_json::from_value::<Vec<openai::OutputItem>>(data_to_parse.clone())
{
if let Ok(messages) =
<Vec<Message> as TryFromLLM<Vec<openai::OutputItem>>>::try_from(provider_messages)
{
if !messages.is_empty() {
return messages;
}
}
}

// Try Anthropic format (including role-based system/developer messages).
if let Some(anthropic_messages) = try_anthropic_or_system_messages(data_to_parse) {
if !anthropic_messages.is_empty() {
Expand Down Expand Up @@ -265,7 +234,7 @@ fn parse_user_content(value: &Value) -> Option<UserContent> {
for item in arr {
if let Some(obj) = item.as_object() {
if let Some(Value::String(text_type)) = obj.get("type") {
if text_type == "text" {
if matches!(text_type.as_str(), "text" | "input_text" | "output_text") {
if let Some(Value::String(text)) = obj.get("text") {
parts.push(UserContentPart::Text(TextContentPart {
text: text.clone(),
Expand Down Expand Up @@ -296,7 +265,7 @@ fn parse_assistant_content(value: &Value) -> Option<AssistantContent> {
for item in arr {
if let Some(obj) = item.as_object() {
if let Some(Value::String(text_type)) = obj.get("type") {
if text_type == "text" {
if matches!(text_type.as_str(), "text" | "input_text" | "output_text") {
if let Some(Value::String(text)) = obj.get("text") {
parts.push(crate::universal::AssistantContentPart::Text(
TextContentPart {
Expand Down Expand Up @@ -389,9 +358,8 @@ fn try_choices_array_parsing(data: &Value) -> Option<Vec<Message>> {
for item in arr {
let obj = item.as_object()?;

// Check if this looks like a choice object (has "message" or "finish_reason")
// Note: has_message_structure only checks the first element, so we need to validate
// each element here to ensure the entire array is a valid choices array
// Check if this looks like a choice object (has "message" or "finish_reason").
// We still validate each element here to ensure the entire array is a valid choices array.
if !obj.contains_key("message") && !obj.contains_key("finish_reason") {
return None; // Not a choices array
}
Expand Down Expand Up @@ -426,7 +394,11 @@ pub fn import_messages_from_spans(spans: Vec<Span>) -> Vec<Message> {

for span in spans {
// Try to extract messages from input
if let Some(input) = &span.input {
if let Some(Value::String(input_text)) = &span.input {
messages.push(Message::User {
content: UserContent::String(input_text.clone()),
});
} else if let Some(input) = &span.input {
let input_messages = try_converting_to_messages(input);
messages.extend(input_messages);
}
Expand Down
Loading