-
Notifications
You must be signed in to change notification settings - Fork 62
Open
Description
(Apologies if this isn't the right forum to report ~production issues)
Our eval pipeline is regularly seeing 500 errors reported from the /logs3 endpoint when running evals:
log request failed. Elapsed time: 4.979 seconds. Payload size: 979.
Error: 500 (Internal Server Error): {"Code":"InternalServerError","InternalTraceId":"69a146240000000016ef77ef2c5f16ed","Path":"/logs3","Service":"api"}
Sleeping for 1s
log request failed. Elapsed time: 3.2 seconds. Payload size: 339.
Sleeping for 1s
Error: 500 (Internal Server Error): {"Code":"InternalServerError","InternalTraceId":"69a146270000000006d6dba09f562bc5","Path":"/logs3","Service":"api"}
log request failed. Elapsed time: 2.165 seconds. Payload size: 339.
Sleeping for 1s
Error: TypeError: fetch failed
log request failed. Elapsed time: 2.163 seconds. Payload size: 1322.
Sleeping for 1s
Error: TypeError: fetch failed
log request failed. Elapsed time: 2.16 seconds. Payload size: 339.
It looks like this is causing retries and adding latency. This seems to happen much more often when our evals are very fast (I'm experimenting with adding a caching layer to our LLM calls, which increases the rate of these errors). Could we be overloading a logs server?
Note: We're also experimenting with parallelizing logging, since we found that the switched to serialized logging in eval created a large bottleneck (making our evals take 300% longer): details in #1394 which @CLowbrow is looking at.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels