Skip to content

feat: add voxtral-server HTTP transcription server (OpenAI Whisper-compatible)#6

Open
kikduck wants to merge 2 commits intoandrijdavid:mainfrom
kikduck:feat/http-server
Open

feat: add voxtral-server HTTP transcription server (OpenAI Whisper-compatible)#6
kikduck wants to merge 2 commits intoandrijdavid:mainfrom
kikduck:feat/http-server

Conversation

@kikduck
Copy link
Copy Markdown

@kikduck kikduck commented Mar 8, 2026

Summary

Add a standalone HTTP server (voxtral-server) with an OpenAI Whisper-compatible API for audio transcription, similar to what whisper.cpp and llama.cpp offer.

The model is loaded once at startup and reused across requests. Inference is serialized via std::mutex (voxtral_context is not thread-safe). This avoids the overhead of loading the ~2.8 GB model for each transcription.

New files

  • src/server.cpp — HTTP server (~550 lines)
  • CMakeLists.txt — New voxtral-server target + FetchContent for cpp-httplib v0.20.0

API Endpoints

Method Path Description
GET /health Health check → {"status":"ok"}
GET /v1/models List loaded model
POST /v1/audio/transcriptions Transcribe audio (OpenAI Whisper-compatible)

Transcription endpoint

Accepts two input methods:

  • Multipart file upload (curl -F "file=@audio.wav") — standard OpenAI Whisper format
  • JSON with base64 ({"audio_base64": "..."}) — convenient for programmatic clients

Returns {"text": "...", "duration": 2.301} (JSON) or plain text.

Additional changes

Build

mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release   # VOXTRAL_BUILD_SERVER=ON by default
cmake --build . -j$(nproc)
# Produces: voxtral, voxtral-server, voxtral-quantize

To disable: cmake .. -DVOXTRAL_BUILD_SERVER=OFF

Usage

./voxtral-server --model path/to/Q4_K_M.gguf --gpu auto --port 8090

# Test
curl http://localhost:8090/health
curl -X POST http://localhost:8090/v1/audio/transcriptions -F "file=@audio.wav"

Design decisions

  • Zero external dependency besides cpp-httplib (header-only, fetched at build time)
  • RAII temp files for uploaded audio (auto-cleaned)
  • CORS enabled for browser clients
  • Signal handling (SIGINT/SIGTERM) for graceful shutdown
  • Configurable: host, port, threads, max-tokens, GPU backend, log level

kikduck added 2 commits March 8, 2026 13:28
clear_kv_cache() and kv_cache_shift_left() used memset/memmove (CPU ops)
on pointers returned by ggml_get_data(). When the KV cache is allocated
on a GPU backend (CUDA, Metal, Vulkan) via ggml_backend_alloc_ctx_tensors,
these pointers are device addresses -- accessing them from the CPU causes
an immediate SIGSEGV.

The encoder was unaffected because it does not use a KV cache
(non-autoregressive). The crash occurred systematically at the decoder
prefill step when calling clear_kv_cache().

Replace:
- clear_kv_cache: memset -> ggml_backend_tensor_memset
- kv_cache_shift_left: memmove/memset -> ggml_backend_tensor_get/set/memset

These ggml backend-agnostic APIs handle CPU and GPU transfers correctly.

Tested on RTX 5090 (Blackwell, SM 12.0) with CUDA 12.8.

Made-with: Cursor
Add a standalone HTTP server (voxtral-server) with an OpenAI
Whisper-compatible API for audio transcription.

Features:
- POST /v1/audio/transcriptions (multipart file upload + JSON base64)
- GET /health, GET /v1/models
- Model loaded once at startup, inference serialized via mutex
- CORS support for browser clients
- Temporary files auto-cleaned via RAII
- Signal handling for graceful shutdown
- cpp-httplib (header-only) fetched via CMake FetchContent

Also adds --stdin interactive mode to the CLI (voxtral), allowing
the model to stay loaded between transcriptions when reading audio
paths from stdin.

Build: cmake .. -DVOXTRAL_BUILD_SERVER=ON (default: ON)
Made-with: Cursor
@kikduck kikduck mentioned this pull request Mar 11, 2026
@khimaros
Copy link
Copy Markdown

cheers for this!

@khimaros
Copy link
Copy Markdown

khimaros commented Mar 29, 2026

when working with a very long audio file, it only seems to convert the first chunk and then spins for a long time as if it's converting the rest, but doesn't generate tokens:

voxtral_I: audio loaded: 48984320 samples (3061.5 s)
voxtral_I: padded audio: 49047040 samples (left=40960, right=21760)
voxtral_I: mel spectrogram: 306544 frames
voxtral_I: encoder chunked: 306544 mel frames, 153270 alloc enc tokens, mel_stride=1500
voxtral_I: encoder chunk 0: mel[0..3000) enc_tokens=1500 skip=0 stride=1500 rope_offset=0
voxtral_I: encoder chunk 1: mel[1500..4500) enc_tokens=1500 skip=750 stride=750 rope_offset=750
voxtral_I: encoder chunk 2: mel[3000..6000) enc_tokens=1500 skip=750 stride=750 rope_offset=1500
voxtral_I: encoder chunk 3: mel[4500..7500) enc_tokens=1500 skip=750 stride=750 rope_offset=2250
voxtral_I: encoder chunk 4: mel[6000..9000) enc_tokens=1500 skip=750 stride=750 rope_offset=3000
voxtral_I: encoder chunk 5: mel[7500..10500) enc_tokens=1500 skip=750 stride=750 rope_offset=3750
voxtral_I: encoder chunk 6: mel[9000..12000) enc_tokens=1500 skip=750 stride=750 rope_offset=4500
voxtral_I: encoder chunk 7: mel[10500..13500) enc_tokens=1500 skip=750 stride=750 rope_offset=5250
voxtral_I: encoder chunk 8: mel[12000..15000) enc_tokens=1500 skip=750 stride=750 rope_offset=6000
voxtral_I: encoder chunk 9: mel[13500..16500) enc_tokens=1500 skip=750 stride=750 rope_offset=6750
voxtral_I: encoder chunk 10: mel[15000..18000) enc_tokens=1500 skip=750 stride=750 rope_offset=7500
voxtral_I: encoder chunk 11: mel[16500..19500) enc_tokens=1500 skip=750 stride=750 rope_offset=8250
voxtral_I: encoder chunk 12: mel[18000..21000) enc_tokens=1500 skip=750 stride=750 rope_offset=9000
voxtral_I: encoder chunk 13: mel[19500..22500) enc_tokens=1500 skip=750 stride=750 rope_offset=9750
voxtral_I: encoder chunk 14: mel[21000..24000) enc_tokens=1500 skip=750 stride=750 rope_offset=10500
voxtral_I: encoder chunk 15: mel[22500..25500) enc_tokens=1500 skip=750 stride=750 rope_offset=11250
voxtral_I: encoder chunk 16: mel[24000..27000) enc_tokens=1500 skip=750 stride=750 rope_offset=12000
voxtral_I: encoder chunk 17: mel[25500..28500) enc_tokens=1500 skip=750 stride=750 rope_offset=12750
voxtral_I: encoder chunk 18: mel[27000..30000) enc_tokens=1500 skip=750 stride=750 rope_offset=13500
voxtral_I: encoder chunk 19: mel[28500..31500) enc_tokens=1500 skip=750 stride=750 rope_offset=14250
voxtral_I: encoder chunk 20: mel[30000..33000) enc_tokens=1500 skip=750 stride=750 rope_offset=15000
voxtral_I: encoder chunk 21: mel[31500..34500) enc_tokens=1500 skip=750 stride=750 rope_offset=15750
voxtral_I: encoder chunk 22: mel[33000..36000) enc_tokens=1500 skip=750 stride=750 rope_offset=16500
voxtral_I: encoder chunk 23: mel[34500..37500) enc_tokens=1500 skip=750 stride=750 rope_offset=17250
voxtral_I: encoder chunk 24: mel[36000..39000) enc_tokens=1500 skip=750 stride=750 rope_offset=18000
voxtral_I: encoder chunk 25: mel[37500..40500) enc_tokens=1500 skip=750 stride=750 rope_offset=18750
voxtral_I: encoder chunk 26: mel[39000..42000) enc_tokens=1500 skip=750 stride=750 rope_offset=19500
voxtral_I: encoder chunk 27: mel[40500..43500) enc_tokens=1500 skip=750 stride=750 rope_offset=20250
voxtral_I: encoder chunk 28: mel[42000..45000) enc_tokens=1500 skip=750 stride=750 rope_offset=21000
voxtral_I: encoder chunk 29: mel[43500..46500) enc_tokens=1500 skip=750 stride=750 rope_offset=21750
voxtral_I: encoder chunk 30: mel[45000..48000) enc_tokens=1500 skip=750 stride=750 rope_offset=22500
voxtral_I: encoder chunk 31: mel[46500..49500) enc_tokens=1500 skip=750 stride=750 rope_offset=23250
voxtral_I: encoder chunk 32: mel[48000..51000) enc_tokens=1500 skip=750 stride=750 rope_offset=24000
voxtral_I: encoder chunk 33: mel[49500..52500) enc_tokens=1500 skip=750 stride=750 rope_offset=24750
voxtral_I: encoder chunk 34: mel[51000..54000) enc_tokens=1500 skip=750 stride=750 rope_offset=25500
voxtral_I: encoder chunk 35: mel[52500..55500) enc_tokens=1500 skip=750 stride=750 rope_offset=26250
voxtral_I: encoder chunk 36: mel[54000..57000) enc_tokens=1500 skip=750 stride=750 rope_offset=27000
voxtral_I: encoder chunk 37: mel[55500..58500) enc_tokens=1500 skip=750 stride=750 rope_offset=27750
voxtral_I: encoder chunk 38: mel[57000..60000) enc_tokens=1500 skip=750 stride=750 rope_offset=28500
voxtral_I: encoder chunk 39: mel[58500..61500) enc_tokens=1500 skip=750 stride=750 rope_offset=29250
voxtral_I: encoder chunk 40: mel[60000..63000) enc_tokens=1500 skip=750 stride=750 rope_offset=30000
voxtral_I: encoder chunk 41: mel[61500..64500) enc_tokens=1500 skip=750 stride=750 rope_offset=30750
voxtral_I: encoder chunk 42: mel[63000..66000) enc_tokens=1500 skip=750 stride=750 rope_offset=31500
voxtral_I: encoder chunk 43: mel[64500..67500) enc_tokens=1500 skip=750 stride=750 rope_offset=32250
voxtral_I: encoder chunk 44: mel[66000..69000) enc_tokens=1500 skip=750 stride=750 rope_offset=33000
voxtral_I: encoder chunk 45: mel[67500..70500) enc_tokens=1500 skip=750 stride=750 rope_offset=33750
voxtral_I: encoder chunk 46: mel[69000..72000) enc_tokens=1500 skip=750 stride=750 rope_offset=34500
voxtral_I: encoder chunk 47: mel[70500..73500) enc_tokens=1500 skip=750 stride=750 rope_offset=35250
voxtral_I: encoder chunk 48: mel[72000..75000) enc_tokens=1500 skip=750 stride=750 rope_offset=36000
voxtral_I: encoder chunk 49: mel[73500..76500) enc_tokens=1500 skip=750 stride=750 rope_offset=36750
voxtral_I: encoder chunk 50: mel[75000..78000) enc_tokens=1500 skip=750 stride=750 rope_offset=37500
voxtral_I: encoder chunk 51: mel[76500..79500) enc_tokens=1500 skip=750 stride=750 rope_offset=38250
voxtral_I: encoder chunk 52: mel[78000..81000) enc_tokens=1500 skip=750 stride=750 rope_offset=39000
voxtral_I: encoder chunk 53: mel[79500..82500) enc_tokens=1500 skip=750 stride=750 rope_offset=39750
voxtral_I: encoder chunk 54: mel[81000..84000) enc_tokens=1500 skip=750 stride=750 rope_offset=40500
voxtral_I: encoder chunk 55: mel[82500..85500) enc_tokens=1500 skip=750 stride=750 rope_offset=41250
voxtral_I: encoder chunk 56: mel[84000..87000) enc_tokens=1500 skip=750 stride=750 rope_offset=42000
voxtral_I: encoder chunk 57: mel[85500..88500) enc_tokens=1500 skip=750 stride=750 rope_offset=42750
voxtral_I: encoder chunk 58: mel[87000..90000) enc_tokens=1500 skip=750 stride=750 rope_offset=43500
voxtral_I: encoder chunk 59: mel[88500..91500) enc_tokens=1500 skip=750 stride=750 rope_offset=44250
voxtral_I: encoder chunk 60: mel[90000..93000) enc_tokens=1500 skip=750 stride=750 rope_offset=45000
voxtral_I: encoder chunk 61: mel[91500..94500) enc_tokens=1500 skip=750 stride=750 rope_offset=45750
voxtral_I: encoder chunk 62: mel[93000..96000) enc_tokens=1500 skip=750 stride=750 rope_offset=46500
voxtral_I: encoder chunk 63: mel[94500..97500) enc_tokens=1500 skip=750 stride=750 rope_offset=47250
voxtral_I: encoder chunk 64: mel[96000..99000) enc_tokens=1500 skip=750 stride=750 rope_offset=48000
voxtral_I: encoder chunk 65: mel[97500..100500) enc_tokens=1500 skip=750 stride=750 rope_offset=48750
voxtral_I: encoder chunk 66: mel[99000..102000) enc_tokens=1500 skip=750 stride=750 rope_offset=49500
voxtral_I: encoder chunk 67: mel[100500..103500) enc_tokens=1500 skip=750 stride=750 rope_offset=50250
voxtral_I: encoder chunk 68: mel[102000..105000) enc_tokens=1500 skip=750 stride=750 rope_offset=51000
voxtral_I: encoder chunk 69: mel[103500..106500) enc_tokens=1500 skip=750 stride=750 rope_offset=51750
voxtral_I: encoder chunk 70: mel[105000..108000) enc_tokens=1500 skip=750 stride=750 rope_offset=52500
voxtral_I: encoder chunk 71: mel[106500..109500) enc_tokens=1500 skip=750 stride=750 rope_offset=53250
voxtral_I: encoder chunk 72: mel[108000..111000) enc_tokens=1500 skip=750 stride=750 rope_offset=54000
voxtral_I: encoder chunk 73: mel[109500..112500) enc_tokens=1500 skip=750 stride=750 rope_offset=54750
voxtral_I: encoder chunk 74: mel[111000..114000) enc_tokens=1500 skip=750 stride=750 rope_offset=55500
voxtral_I: encoder chunk 75: mel[112500..115500) enc_tokens=1500 skip=750 stride=750 rope_offset=56250
voxtral_I: encoder chunk 76: mel[114000..117000) enc_tokens=1500 skip=750 stride=750 rope_offset=57000
voxtral_I: encoder chunk 77: mel[115500..118500) enc_tokens=1500 skip=750 stride=750 rope_offset=57750
voxtral_I: encoder chunk 78: mel[117000..120000) enc_tokens=1500 skip=750 stride=750 rope_offset=58500
voxtral_I: encoder chunk 79: mel[118500..121500) enc_tokens=1500 skip=750 stride=750 rope_offset=59250
voxtral_I: encoder chunk 80: mel[120000..123000) enc_tokens=1500 skip=750 stride=750 rope_offset=60000
voxtral_I: encoder chunk 81: mel[121500..124500) enc_tokens=1500 skip=750 stride=750 rope_offset=60750
voxtral_I: encoder chunk 82: mel[123000..126000) enc_tokens=1500 skip=750 stride=750 rope_offset=61500
voxtral_I: encoder chunk 83: mel[124500..127500) enc_tokens=1500 skip=750 stride=750 rope_offset=62250
voxtral_I: encoder chunk 84: mel[126000..129000) enc_tokens=1500 skip=750 stride=750 rope_offset=63000
voxtral_I: encoder chunk 85: mel[127500..130500) enc_tokens=1500 skip=750 stride=750 rope_offset=63750
voxtral_I: encoder chunk 86: mel[129000..132000) enc_tokens=1500 skip=750 stride=750 rope_offset=64500
voxtral_I: encoder chunk 87: mel[130500..133500) enc_tokens=1500 skip=750 stride=750 rope_offset=65250
voxtral_I: encoder chunk 88: mel[132000..135000) enc_tokens=1500 skip=750 stride=750 rope_offset=66000
voxtral_I: encoder chunk 89: mel[133500..136500) enc_tokens=1500 skip=750 stride=750 rope_offset=66750
voxtral_I: encoder chunk 90: mel[135000..138000) enc_tokens=1500 skip=750 stride=750 rope_offset=67500
voxtral_I: encoder chunk 91: mel[136500..139500) enc_tokens=1500 skip=750 stride=750 rope_offset=68250
voxtral_I: encoder chunk 92: mel[138000..141000) enc_tokens=1500 skip=750 stride=750 rope_offset=69000
voxtral_I: encoder chunk 93: mel[139500..142500) enc_tokens=1500 skip=750 stride=750 rope_offset=69750
voxtral_I: encoder chunk 94: mel[141000..144000) enc_tokens=1500 skip=750 stride=750 rope_offset=70500
voxtral_I: encoder chunk 95: mel[142500..145500) enc_tokens=1500 skip=750 stride=750 rope_offset=71250
voxtral_I: encoder chunk 96: mel[144000..147000) enc_tokens=1500 skip=750 stride=750 rope_offset=72000
voxtral_I: encoder chunk 97: mel[145500..148500) enc_tokens=1500 skip=750 stride=750 rope_offset=72750
voxtral_I: encoder chunk 98: mel[147000..150000) enc_tokens=1500 skip=750 stride=750 rope_offset=73500
voxtral_I: encoder chunk 99: mel[148500..151500) enc_tokens=1500 skip=750 stride=750 rope_offset=74250
voxtral_I: encoder chunk 100: mel[150000..153000) enc_tokens=1500 skip=750 stride=750 rope_offset=75000
voxtral_I: encoder chunk 101: mel[151500..154500) enc_tokens=1500 skip=750 stride=750 rope_offset=75750
voxtral_I: encoder chunk 102: mel[153000..156000) enc_tokens=1500 skip=750 stride=750 rope_offset=76500
voxtral_I: encoder chunk 103: mel[154500..157500) enc_tokens=1500 skip=750 stride=750 rope_offset=77250
voxtral_I: encoder chunk 104: mel[156000..159000) enc_tokens=1500 skip=750 stride=750 rope_offset=78000
voxtral_I: encoder chunk 105: mel[157500..160500) enc_tokens=1500 skip=750 stride=750 rope_offset=78750
voxtral_I: encoder chunk 106: mel[159000..162000) enc_tokens=1500 skip=750 stride=750 rope_offset=79500
voxtral_I: encoder chunk 107: mel[160500..163500) enc_tokens=1500 skip=750 stride=750 rope_offset=80250
voxtral_I: encoder chunk 108: mel[162000..165000) enc_tokens=1500 skip=750 stride=750 rope_offset=81000
voxtral_I: encoder chunk 109: mel[163500..166500) enc_tokens=1500 skip=750 stride=750 rope_offset=81750
voxtral_I: encoder chunk 110: mel[165000..168000) enc_tokens=1500 skip=750 stride=750 rope_offset=82500
voxtral_I: encoder chunk 111: mel[166500..169500) enc_tokens=1500 skip=750 stride=750 rope_offset=83250
voxtral_I: encoder chunk 112: mel[168000..171000) enc_tokens=1500 skip=750 stride=750 rope_offset=84000
voxtral_I: encoder chunk 113: mel[169500..172500) enc_tokens=1500 skip=750 stride=750 rope_offset=84750
voxtral_I: encoder chunk 114: mel[171000..174000) enc_tokens=1500 skip=750 stride=750 rope_offset=85500
voxtral_I: encoder chunk 115: mel[172500..175500) enc_tokens=1500 skip=750 stride=750 rope_offset=86250
voxtral_I: encoder chunk 116: mel[174000..177000) enc_tokens=1500 skip=750 stride=750 rope_offset=87000
voxtral_I: encoder chunk 117: mel[175500..178500) enc_tokens=1500 skip=750 stride=750 rope_offset=87750
voxtral_I: encoder chunk 118: mel[177000..180000) enc_tokens=1500 skip=750 stride=750 rope_offset=88500
voxtral_I: encoder chunk 119: mel[178500..181500) enc_tokens=1500 skip=750 stride=750 rope_offset=89250
voxtral_I: encoder chunk 120: mel[180000..183000) enc_tokens=1500 skip=750 stride=750 rope_offset=90000
voxtral_I: encoder chunk 121: mel[181500..184500) enc_tokens=1500 skip=750 stride=750 rope_offset=90750
voxtral_I: encoder chunk 122: mel[183000..186000) enc_tokens=1500 skip=750 stride=750 rope_offset=91500
voxtral_I: encoder chunk 123: mel[184500..187500) enc_tokens=1500 skip=750 stride=750 rope_offset=92250
voxtral_I: encoder chunk 124: mel[186000..189000) enc_tokens=1500 skip=750 stride=750 rope_offset=93000
voxtral_I: encoder chunk 125: mel[187500..190500) enc_tokens=1500 skip=750 stride=750 rope_offset=93750
voxtral_I: encoder chunk 126: mel[189000..192000) enc_tokens=1500 skip=750 stride=750 rope_offset=94500
voxtral_I: encoder chunk 127: mel[190500..193500) enc_tokens=1500 skip=750 stride=750 rope_offset=95250
voxtral_I: encoder chunk 128: mel[192000..195000) enc_tokens=1500 skip=750 stride=750 rope_offset=96000
voxtral_I: encoder chunk 129: mel[193500..196500) enc_tokens=1500 skip=750 stride=750 rope_offset=96750
voxtral_I: encoder chunk 130: mel[195000..198000) enc_tokens=1500 skip=750 stride=750 rope_offset=97500
voxtral_I: encoder chunk 131: mel[196500..199500) enc_tokens=1500 skip=750 stride=750 rope_offset=98250
voxtral_I: encoder chunk 132: mel[198000..201000) enc_tokens=1500 skip=750 stride=750 rope_offset=99000
voxtral_I: encoder chunk 133: mel[199500..202500) enc_tokens=1500 skip=750 stride=750 rope_offset=99750
voxtral_I: encoder chunk 134: mel[201000..204000) enc_tokens=1500 skip=750 stride=750 rope_offset=100500
voxtral_I: encoder chunk 135: mel[202500..205500) enc_tokens=1500 skip=750 stride=750 rope_offset=101250
voxtral_I: encoder chunk 136: mel[204000..207000) enc_tokens=1500 skip=750 stride=750 rope_offset=102000
voxtral_I: encoder chunk 137: mel[205500..208500) enc_tokens=1500 skip=750 stride=750 rope_offset=102750
voxtral_I: encoder chunk 138: mel[207000..210000) enc_tokens=1500 skip=750 stride=750 rope_offset=103500
voxtral_I: encoder chunk 139: mel[208500..211500) enc_tokens=1500 skip=750 stride=750 rope_offset=104250
voxtral_I: encoder chunk 140: mel[210000..213000) enc_tokens=1500 skip=750 stride=750 rope_offset=105000
voxtral_I: encoder chunk 141: mel[211500..214500) enc_tokens=1500 skip=750 stride=750 rope_offset=105750
voxtral_I: encoder chunk 142: mel[213000..216000) enc_tokens=1500 skip=750 stride=750 rope_offset=106500
voxtral_I: encoder chunk 143: mel[214500..217500) enc_tokens=1500 skip=750 stride=750 rope_offset=107250
voxtral_I: encoder chunk 144: mel[216000..219000) enc_tokens=1500 skip=750 stride=750 rope_offset=108000
voxtral_I: encoder chunk 145: mel[217500..220500) enc_tokens=1500 skip=750 stride=750 rope_offset=108750
voxtral_I: encoder chunk 146: mel[219000..222000) enc_tokens=1500 skip=750 stride=750 rope_offset=109500
voxtral_I: encoder chunk 147: mel[220500..223500) enc_tokens=1500 skip=750 stride=750 rope_offset=110250
voxtral_I: encoder chunk 148: mel[222000..225000) enc_tokens=1500 skip=750 stride=750 rope_offset=111000
voxtral_I: encoder chunk 149: mel[223500..226500) enc_tokens=1500 skip=750 stride=750 rope_offset=111750
voxtral_I: encoder chunk 150: mel[225000..228000) enc_tokens=1500 skip=750 stride=750 rope_offset=112500
voxtral_I: encoder chunk 151: mel[226500..229500) enc_tokens=1500 skip=750 stride=750 rope_offset=113250
voxtral_I: encoder chunk 152: mel[228000..231000) enc_tokens=1500 skip=750 stride=750 rope_offset=114000
voxtral_I: encoder chunk 153: mel[229500..232500) enc_tokens=1500 skip=750 stride=750 rope_offset=114750
voxtral_I: encoder chunk 154: mel[231000..234000) enc_tokens=1500 skip=750 stride=750 rope_offset=115500
voxtral_I: encoder chunk 155: mel[232500..235500) enc_tokens=1500 skip=750 stride=750 rope_offset=116250
voxtral_I: encoder chunk 156: mel[234000..237000) enc_tokens=1500 skip=750 stride=750 rope_offset=117000
voxtral_I: encoder chunk 157: mel[235500..238500) enc_tokens=1500 skip=750 stride=750 rope_offset=117750
voxtral_I: encoder chunk 158: mel[237000..240000) enc_tokens=1500 skip=750 stride=750 rope_offset=118500
voxtral_I: encoder chunk 159: mel[238500..241500) enc_tokens=1500 skip=750 stride=750 rope_offset=119250
voxtral_I: encoder chunk 160: mel[240000..243000) enc_tokens=1500 skip=750 stride=750 rope_offset=120000
voxtral_I: encoder chunk 161: mel[241500..244500) enc_tokens=1500 skip=750 stride=750 rope_offset=120750
voxtral_I: encoder chunk 162: mel[243000..246000) enc_tokens=1500 skip=750 stride=750 rope_offset=121500
voxtral_I: encoder chunk 163: mel[244500..247500) enc_tokens=1500 skip=750 stride=750 rope_offset=122250
voxtral_I: encoder chunk 164: mel[246000..249000) enc_tokens=1500 skip=750 stride=750 rope_offset=123000
voxtral_I: encoder chunk 165: mel[247500..250500) enc_tokens=1500 skip=750 stride=750 rope_offset=123750
voxtral_I: encoder chunk 166: mel[249000..252000) enc_tokens=1500 skip=750 stride=750 rope_offset=124500
voxtral_I: encoder chunk 167: mel[250500..253500) enc_tokens=1500 skip=750 stride=750 rope_offset=125250
voxtral_I: encoder chunk 168: mel[252000..255000) enc_tokens=1500 skip=750 stride=750 rope_offset=126000
voxtral_I: encoder chunk 169: mel[253500..256500) enc_tokens=1500 skip=750 stride=750 rope_offset=126750
voxtral_I: encoder chunk 170: mel[255000..258000) enc_tokens=1500 skip=750 stride=750 rope_offset=127500
voxtral_I: encoder chunk 171: mel[256500..259500) enc_tokens=1500 skip=750 stride=750 rope_offset=128250
voxtral_I: encoder chunk 172: mel[258000..261000) enc_tokens=1500 skip=750 stride=750 rope_offset=129000
voxtral_I: encoder chunk 173: mel[259500..262500) enc_tokens=1500 skip=750 stride=750 rope_offset=129750
voxtral_I: encoder chunk 174: mel[261000..264000) enc_tokens=1500 skip=750 stride=750 rope_offset=130500
voxtral_I: encoder chunk 175: mel[262500..265500) enc_tokens=1500 skip=750 stride=750 rope_offset=131250
voxtral_I: encoder chunk 176: mel[264000..267000) enc_tokens=1500 skip=750 stride=750 rope_offset=132000
voxtral_I: encoder chunk 177: mel[265500..268500) enc_tokens=1500 skip=750 stride=750 rope_offset=132750
voxtral_I: encoder chunk 178: mel[267000..270000) enc_tokens=1500 skip=750 stride=750 rope_offset=133500
voxtral_I: encoder chunk 179: mel[268500..271500) enc_tokens=1500 skip=750 stride=750 rope_offset=134250
voxtral_I: encoder chunk 180: mel[270000..273000) enc_tokens=1500 skip=750 stride=750 rope_offset=135000
voxtral_I: encoder chunk 181: mel[271500..274500) enc_tokens=1500 skip=750 stride=750 rope_offset=135750
voxtral_I: encoder chunk 182: mel[273000..276000) enc_tokens=1500 skip=750 stride=750 rope_offset=136500
voxtral_I: encoder chunk 183: mel[274500..277500) enc_tokens=1500 skip=750 stride=750 rope_offset=137250
voxtral_I: encoder chunk 184: mel[276000..279000) enc_tokens=1500 skip=750 stride=750 rope_offset=138000
voxtral_I: encoder chunk 185: mel[277500..280500) enc_tokens=1500 skip=750 stride=750 rope_offset=138750
voxtral_I: encoder chunk 186: mel[279000..282000) enc_tokens=1500 skip=750 stride=750 rope_offset=139500
voxtral_I: encoder chunk 187: mel[280500..283500) enc_tokens=1500 skip=750 stride=750 rope_offset=140250
voxtral_I: encoder chunk 188: mel[282000..285000) enc_tokens=1500 skip=750 stride=750 rope_offset=141000
voxtral_I: encoder chunk 189: mel[283500..286500) enc_tokens=1500 skip=750 stride=750 rope_offset=141750
voxtral_I: encoder chunk 190: mel[285000..288000) enc_tokens=1500 skip=750 stride=750 rope_offset=142500
voxtral_I: encoder chunk 191: mel[286500..289500) enc_tokens=1500 skip=750 stride=750 rope_offset=143250
voxtral_I: encoder chunk 192: mel[288000..291000) enc_tokens=1500 skip=750 stride=750 rope_offset=144000
voxtral_I: encoder chunk 193: mel[289500..292500) enc_tokens=1500 skip=750 stride=750 rope_offset=144750
voxtral_I: encoder chunk 194: mel[291000..294000) enc_tokens=1500 skip=750 stride=750 rope_offset=145500
voxtral_I: encoder chunk 195: mel[292500..295500) enc_tokens=1500 skip=750 stride=750 rope_offset=146250
voxtral_I: encoder chunk 196: mel[294000..297000) enc_tokens=1500 skip=750 stride=750 rope_offset=147000
voxtral_I: encoder chunk 197: mel[295500..298500) enc_tokens=1500 skip=750 stride=750 rope_offset=147750
voxtral_I: encoder chunk 198: mel[297000..300000) enc_tokens=1500 skip=750 stride=750 rope_offset=148500
voxtral_I: encoder chunk 199: mel[298500..301500) enc_tokens=1500 skip=750 stride=750 rope_offset=149250
voxtral_I: encoder chunk 200: mel[300000..303000) enc_tokens=1500 skip=750 stride=750 rope_offset=150000
voxtral_I: encoder chunk 201: mel[301500..304500) enc_tokens=1500 skip=750 stride=750 rope_offset=150750
voxtral_I: encoder chunk 202: mel[303000..306000) enc_tokens=1500 skip=750 stride=750 rope_offset=151500
voxtral_I: encoder chunk 203: mel[304500..306544) enc_tokens=1020 skip=750 stride=270 rope_offset=152250
voxtral_I: encoder done: 204 chunks, enc_seq_used=153268 (raw=153270)
voxtral_I: encoder time: 74228.4 ms
voxtral_I: running adapter: enc_seq=153268 -> dec_seq=38317
voxtral_I: adapter graph: size=2048 nodes=7
voxtral_I: adapter done: dec_seq_len=38317 (470.84 MB on device)
voxtral_I: adapter time: 169.6 ms
voxtral_I: prompt: 39 tokens, audio_tokens: 38317
voxtral_I: decoder prefill: 38 tokens
voxtral_I: decoder prefill graph: size=8192 nodes=1152
voxtral_I: decoder prefill done
voxtral_I: prefill time: 59.7 ms
voxtral_I: first token: 32
voxtral_I: early stop: 17 consecutive pad tokens after text
voxtral_I: decode time: 1773.6 ms (96 steps, 18.5 ms/step)
voxtral_I: generated 97 tokens

and the returned JSON:

{"text":" "<one and a half sentences from the audio>","duration":80.427}

the part that it does transcribe is considerably more accurate than whisper.cpp, but unfortunately not usable in this form.

the duration doesn't match the audio duration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants