Skip to content

Comments

Expose trace interface in scorers#99

Open
Qard wants to merge 3 commits intomainfrom
trace-in-scorer
Open

Expose trace interface in scorers#99
Qard wants to merge 3 commits intomainfrom
trace-in-scorer

Conversation

@Qard
Copy link

@Qard Qard commented Feb 10, 2026

Only possibly controversial thing in here is exposing the btql API publicly when it only supports AST/object mode--there's no builder yet.

Refs braintrustdata/braintrust-sdk#1060

@Qard Qard requested review from clutchski and delner February 10, 2026 18:59
@Qard Qard self-assigned this Feb 10, 2026
@Qard Qard added the enhancement New feature or request label Feb 10, 2026
@Qard Qard force-pushed the trace-in-scorer branch 3 times, most recently from dd33e0e to 29d05d5 Compare February 10, 2026 19:24
# Reconstruct message thread from LLM spans.
# Deduplicates input messages by content hash, always includes output messages.
# @return [Array<Hash>] Array of message hashes
def get_thread

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to do it like this anymore. The way these are generated in the main SDK and in the data plane is by calling invoke on a preprocessor:

I don't know if we have support for these invoke calls in this or other non-main repo SDKs so we might want to skip this method for now?

Copy link
Collaborator

@delner delner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds significant complexity in that it:

  1. Adds caching to global state
  2. Adds multi-threading/synchronization behavior between workers
  3. Couples tracing to Evals and reaches deeply into internals in the process

Many of these things have potential for significant impact on performance and stability, as well as scaling the software without harming the prior two categories. Before we merge, I would like to review the requirements, design goals, and patterns more thoroughly. Let's talk.

# Create a TraceContext for scorers to access span data
# @param eval_span [OpenTelemetry::Trace::Span] The eval span
# @return [TraceContext]
def create_trace_context(eval_span)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this belongs here: Eval::Runner concerns running evals, not building trace contexts.

# @return [Float, Hash, Array] Score value(s)
def call(input, expected, output, metadata = {})
@wrapped_callable.call(input, expected, output, metadata)
def call(input, expected, output, metadata = {}, trace = nil)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does a scorer take a trace context?

Copy link
Collaborator

@delner delner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per our conversation yesterday with @CLowbrow, we should not use a span cache for the purposes of piping trace content between tracing and evals locally because:

  1. Spans may be mutated once they reach Braintrust (become stale locally)
  2. Distributed tracing means the set of spans will be incomplete (need to query the API anyways)
  3. Eval for the trace may not run locally at all (wasted memory)
  4. Performance impact of caching spans is potentially high
  5. This creates a hard coupling between tracing and evals (which are meant to be severable)
  6. The overall complexity of the feature makes it difficult to reason about, refactor, and may affect the stability of the SDK.

Per my analysis, we should:

  1. Remove write_to_cache from SpanProcessor entirely. The tracer generates and submits spans via OTel only (no local buffering for Evals)
  2. Remove SpanRegistry — it only exists to bridge Trace → Eval.
  3. TraceContext.get_spans should always go through BTQL. It can cache the BTQL response locally, but that cache is populated by API results, not by the tracer.
  4. Eval::Context should not own a SpanCache that the tracer writes to. If caching is needed, it should be scoped to TraceContext and populated from BTQL only.
  5. eval.rb should not import anything from trace/.


# Run scorers
# Create TraceContext for scorers (if scorers exist)
trace = scorers.empty? ? nil : create_trace_context(eval_span)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would scorers not exist? (I would think that if you run an Eval you would want to score it?)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might not be using local code scorers though, right? Could just be getting scored in our backend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants