Conversation
dd33e0e to
29d05d5
Compare
| # Reconstruct message thread from LLM spans. | ||
| # Deduplicates input messages by content hash, always includes output messages. | ||
| # @return [Array<Hash>] Array of message hashes | ||
| def get_thread |
There was a problem hiding this comment.
I don't think we want to do it like this anymore. The way these are generated in the main SDK and in the data plane is by calling invoke on a preprocessor:
- https://github.com/braintrustdata/braintrust-sdk/blob/main/js/src/trace.ts#L381
- https://github.com/braintrustdata/braintrust/blob/main/api-ts/src/wrapper/trace.ts#L122
I don't know if we have support for these invoke calls in this or other non-main repo SDKs so we might want to skip this method for now?
delner
left a comment
There was a problem hiding this comment.
This adds significant complexity in that it:
- Adds caching to global state
- Adds multi-threading/synchronization behavior between workers
- Couples tracing to Evals and reaches deeply into internals in the process
Many of these things have potential for significant impact on performance and stability, as well as scaling the software without harming the prior two categories. Before we merge, I would like to review the requirements, design goals, and patterns more thoroughly. Let's talk.
| # Create a TraceContext for scorers to access span data | ||
| # @param eval_span [OpenTelemetry::Trace::Span] The eval span | ||
| # @return [TraceContext] | ||
| def create_trace_context(eval_span) |
There was a problem hiding this comment.
I don't think this belongs here: Eval::Runner concerns running evals, not building trace contexts.
| # @return [Float, Hash, Array] Score value(s) | ||
| def call(input, expected, output, metadata = {}) | ||
| @wrapped_callable.call(input, expected, output, metadata) | ||
| def call(input, expected, output, metadata = {}, trace = nil) |
There was a problem hiding this comment.
Why does a scorer take a trace context?
81e4512 to
267357e
Compare
delner
left a comment
There was a problem hiding this comment.
Per our conversation yesterday with @CLowbrow, we should not use a span cache for the purposes of piping trace content between tracing and evals locally because:
- Spans may be mutated once they reach Braintrust (become stale locally)
- Distributed tracing means the set of spans will be incomplete (need to query the API anyways)
- Eval for the trace may not run locally at all (wasted memory)
- Performance impact of caching spans is potentially high
- This creates a hard coupling between tracing and evals (which are meant to be severable)
- The overall complexity of the feature makes it difficult to reason about, refactor, and may affect the stability of the SDK.
Per my analysis, we should:
- Remove write_to_cache from
SpanProcessorentirely. The tracer generates and submits spans via OTel only (no local buffering for Evals) - Remove
SpanRegistry— it only exists to bridge Trace → Eval. TraceContext.get_spansshould always go through BTQL. It can cache the BTQL response locally, but that cache is populated by API results, not by the tracer.Eval::Contextshould not own aSpanCachethat the tracer writes to. If caching is needed, it should be scoped to TraceContext and populated from BTQL only.eval.rbshould not import anything fromtrace/.
|
|
||
| # Run scorers | ||
| # Create TraceContext for scorers (if scorers exist) | ||
| trace = scorers.empty? ? nil : create_trace_context(eval_span) |
There was a problem hiding this comment.
When would scorers not exist? (I would think that if you run an Eval you would want to score it?)
There was a problem hiding this comment.
Might not be using local code scorers though, right? Could just be getting scored in our backend.
0f19d0b to
a77a931
Compare
Only possibly controversial thing in here is exposing the btql API publicly when it only supports AST/object mode--there's no builder yet.
Refs braintrustdata/braintrust-sdk#1060