Skip to content

Conversation

@hyperc54
Copy link
Collaborator

@hyperc54 hyperc54 commented Oct 28, 2025

This PR fixes the issue highlighted in a different PR by @avan06 (thanks!).

The original lines

n_output_frames_original = int(np.floor(audio_original_length * (ANNOTATIONS_FPS / AUDIO_SAMPLE_RATE)))
    return unwrapped_output[:n_output_frames_original, :]  # trim to original audio length

trim the output frames a bit more than appropriate.

The root cause is that ANNOTATIONS_FPS, which represents the number of expected output frames per second, is rounded down to the next integer, dropping the few samples of the last incomplete frame. This makes audio_original_length * (ANNOTATIONS_FPS / AUDIO_SAMPLE_RATE)) underestimates the number of frames needed in the output (not much per frame, but it compounds to several seconds of audio when analysing audios that are >10mn long)

The PR adds a test to make sure that the last frame second (as computed downstream in the pipeline) is close to the actual audio length. (additionally, it would be nice to add a test for the model_frames_to_time method)

@hyperc54 hyperc54 added the bug Something isn't working label Oct 28, 2025
@avan06
Copy link

avan06 commented Oct 29, 2025

Thank you @hyperc54 and the team for identifying the root cause and fixing it!

@hyperc54 hyperc54 merged commit e989e40 into spotify:main Nov 4, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants