The Transcriber

Transcriber & translator for audio files. Like Otter.ai, but open-source and almost free.

Otter.ai

Otter.ai monthly subscription is $16.99 per user.
You get:

1200 monthly transcription minutes; 90 minutes per conversation

The Transcriber app

Transcription: Replicate AI models cloud-hosting with current prices and models used, 1200 minutes will cost approximately $1.60 - $5.50
At least three times cheaper with the same or even better quality of transcription, in my opinion.
And you pay as you go.

Translation (summarization is deprecated¹; use vasiliadi/ai-summarizer-telegram-bot): Gemini Pro/Flash is free if you use the Gemini API from a project that has billing disabled, without the benefits available in the paid plan.

Hosting: Free tiers or trials from Render, Google Cloud, Oracle Cloud, AWS, Azure, IBM Cloud, or low-cost DigitalOcean, or any provider you like.

Total:²
Pay as you go for 10 hours audio.
Replicate with whisper-diarization + free Gemini API + DigitalOcean = $2.00 + $0.00 + $0.10 = $2.10
Replicate with incredibly-fast-whisper + free Gemini API + DigitalOcean = $0.70 + $0.00 + $0.10 = $0.80

Note

Prices are subject to change without notice.

How to start

AI agent setup (one-liner):

Read https://raw.githubusercontent.com/vasiliadi/transcriber/main/llms.txt and follow the instructions to run transcriber locally.

Docker Compose:

Install Docker Desktop first if Docker is not already installed.

docker compose up --build

The app will be available at http://localhost:80. Secrets are loaded from src/.streamlit/secrets.toml if it exists (see Config).

To stop:

docker compose down

Docker (pre-built image from Docker Hub):

docker run -p 8080:8080 \
  -e REPLICATE_API_TOKEN=your_token \
  -e GEMINI_API_KEY=your_key \
  -e HF_ACCESS_TOKEN=your_token \
  vasiliadi/transcriber:latest

The app will be available at http://localhost:8080.

To stop: press Ctrl+C.

Manual setup:

Install pixi:

# macOS (Homebrew)
brew install pixi

# or via install script
curl -fsSL https://pixi.sh/install.sh | sh

Clone the repo, copy .env.example to .streamlit/secrets.toml, and fill in your API keys (see Config), then run:

pixi run start

Technical details

Running the Whisper model on Replicate is much cheaper than using the OpenAI API for Whisper.

I use four models:

vaibhavs10/incredibly-fast-whisper best for speed
thomasmol/whisper-diarization best for dialogs
openai/gpt-4o-transcribe best for accuracy
victor-upmeet/whisperx best overall

Comparison of the same 45-minute audio file (6 speakers) by model (example)

Limitations

OpenAI Whisper model

OpenAI Speech to text Whisper model

File uploads are currently limited to 25 MB.

To avoid this limitation, I use compression (even though I know the models I'm using apply compression too. In practice, I've encountered a limit when relying on a model's built-in compression). The file size without compression is 63 MB for 45 minutes of audio. However, after compression, the file size is reduced to 4 MB for the same duration. Therefore, by using compression, we can avoid splitting audio into chunks and increase the limit to approximately 3 hours and 45 minutes of audio without losing transcription quality.

But if you still need to transcribe more, you can split the file using pydub's silence.split_on_silence(), silence.detect_silence(), or silence.detect_nonsilent(). These functions are hardware-dependent, but they are about 10 times faster than listening to the entire file.

In my tests, I face three main problems:

These functions are not working as I expect.
If you split only by time, you can cut in the middle of a word.
Post-processing becomes a challenge. It's hard to identify the speaker smoothly, and timestamps may be lost.

All this applies only to very long audio.

Gemini Pro/Flash (no longer actual³)

Gemini Pro/Flash model names and properties

Max output tokens: 8,192

0.75 words per token = ~6,144 words or about 35 minutes of speaking. But for non-English languages, most words are counted as two or more tokens.

The maximum number of output tokens is currently 8,192. Audio post-processing, which includes correction and translation, can only be done for files that are approximately 35 minutes long. Other models have a maximum output of 4,096 tokens or fewer. If you need to process more than 8,192 tokens, you may need to do it in batches, but this will significantly increase the processing time.

Translation in chunks still works, but the quality is a little lower.

Max audio length: approximately 8.4 hours

~~It still works well for summarization.~~

2 queries per minute and 1000 per day for Gemini-1.5-pro. 15 and 1500 for Gemini-1.5-flash

Languages support for translation.

Optional settings

HuggingFace.co

For diarization, all models rely on pyannote.audio solutions. As a user, you must agree to the terms for accessing the models offered by pyannote. Therefore, it is necessary to accept the terms for pyannote/segmentation-3.0 and pyannote/speaker-diarization-3.1 and obtain a HuggingFace API token.

The thomasmol/whisper-diarization model also uses the same models for diarization, but the developer uses his own HuggingFace API token. This means that an additional token is not required.

Text to Speech (deprecated¹)

By default, I use the ElevenLabs eleven_turbo_v2_5 model to generate high-quality audio for summaries in various languages. It's very fast and 50% cheaper than the eleven_multilingual_v2 model. You get 10,000 credits per month for free, which is about 15 generated audios. If you need more, you'll need to purchase a plan or use OpenAI TTS.

OpenAI TTS is a pay-as-you-go service that costs $0.015 / 1K characters.
OpenAI's input is limited to a maximum of 4096 characters. To overcome this limitation, I split the text into chunks using semantic_text_splitter and pydub.

Additionally, the xtts-v2 model is another high-quality multilanguage model, but Coqui, the developer of this model, is shutting down. As a result, I use ElevenLabs or OpenAI.

Config

Example of .env file:

GEMINI_API_KEY="your_api_key"
REPLICATE_API_TOKEN="your_api_key"
HF_ACCESS_TOKEN="your_api_key" # only for incredibly-fast-whisper and whisperx models with enabled diarization
PROXY="" # only if you need to use proxy

All keys are mandatory, but you can fill some of them with placeholder or incorrect values to complete the setup. Using features that require a specific key with an incorrect value will result in an error.

You need to replace the path to the env_file in compose.yaml.

Get Gemini API key
Get Replicate API token
Get HF API tokens, and don't forget to accept the terms for pyannote/segmentation-3.0 and pyannote/speaker-diarization-3.1. This is needed only for the incredibly-fast-whisper model with diarization enabled.

Streamlit Secrets management

PS

Your transcription and Google NotebookLM are a very powerful combination.
Using context caching, you can ask a ton of questions about the topic.

Docs

	Links
Libraries	streamlit replicate Google Gen AI SDK yt-dlp ~~elevenlabs~~ bs4 curl_cffi ~~openai~~ ~~pydub~~ ~~requests~~ ~~semantic_text_splitter~~
Docker	Docker Best Practices Docker Dockerfile reference Dockerfile Linter .dockerignore .dockerignore validator Docker Compose Syntax for environment files in Docker Compose Ways to set environment variables with Compose Compose file version 3 reference
GitHub Actions	Workflow syntax for GitHub Actions Publishing images to Docker Hub and GitHub Packages
Dev Containers	An open specification for enriching containers with development specific content and settings Developing inside a Container
uv	uv pip
pixi	pixi
direnv	direnv
AI	codesight
Speech to Text AI Model Leaderboard	Artificial Analysis

Deploy

Platform	Links
Render	Deploy from GitHub / GitLab / Bitbucket
Google Cloud	Quickstart: Deploy to Cloud Run Tutorial: Deploy your dockerized application on Google Cloud
Oracle Cloud	Container Instances
IBM Cloud	IBM Cloud® Code Engine
AWS	AWS App Runner
Azure	Web App for Containers Deploy a containerized app to Azure
Digital Ocean	How to Deploy from Container Images

Last supported version is 0.1.0 ↩ ↩²
For August 2024 ↩
For May 2026 ↩

Name		Name	Last commit message	Last commit date
Latest commit History 437 Commits
.codesight		.codesight
.github		.github
src		src
.deepsource.toml		.deepsource.toml
.dockerignore		.dockerignore
.envrc		.envrc
.gitattributes		.gitattributes
.gitignore		.gitignore
.markdownlint.json		.markdownlint.json
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
compose.yaml		compose.yaml
llms.txt		llms.txt
model-comparison.png		model-comparison.png
pixi.lock		pixi.lock
pyproject.toml		pyproject.toml
screenshot.png		screenshot.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Transcriber

Otter.ai

The Transcriber app

How to start

Technical details

Limitations

OpenAI Whisper model

Gemini Pro/Flash (no longer actual³)

Optional settings

HuggingFace.co

Text to Speech (deprecated¹)

Config

PS

Docs

Deploy

About

Uh oh!

Releases 17

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

The Transcriber

Otter.ai

The Transcriber app

How to start

Technical details

Limitations

OpenAI Whisper model

Gemini Pro/Flash (no longer actual3)

Optional settings

HuggingFace.co

Text to Speech (deprecated1)

Config

PS

Docs

Deploy

Footnotes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 17

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Gemini Pro/Flash (no longer actual³)

Text to Speech (deprecated¹)

Packages