Logo
Profile

elevenlabs / scribe-v1

Transcribe speech audio into accurate text with word-level timestamps, speaker labels, and audio-event tags across 99 languages.

0.03

Model Input

Input

URL of the audio file to transcribe. Supported formats: mp3, wav, m4a, ogg, aac.

ISO-639 language code of the audio (e.g. eng, spa, fra, deu, jpn). Leave unset to auto-detect the spoken language.

Tag non-speech audio events like laughter and applause inline in the transcript.

Annotate which speaker said each word.

You need to be logged in to run this model and view results.
Log in

Model Output

Output

Unsupported file type

Generated in 3.961 seconds
Logs (1 lines)

Model Example Requests

Examples

rE4yusn7vBH6QqekRLI3Y

Model Details

Model Details

ElevenLabs Scribe v1 turns spoken-audio files into accurate written text. Pass a URL to an audio recording (mp3, wav, m4a, ogg, or aac) and it returns the full transcript as a plain string, with state-of-the-art accuracy across 99 languages. It auto-detects the spoken language, can label who is speaking (diarization), and tags non-speech audio events like laughter and applause — making it a strong default for turning recordings into usable, searchable text.

## Best for - Transcribing meetings, interviews, podcasts, and voice notes into text - Captioning and subtitling source audio with reliable word boundaries - Multilingual transcription where the spoken language is unknown or mixed (99 languages, auto-detected) - Speaker-attributed transcripts of multi-person conversations using diarization - Building searchable archives or downstream NLP from spoken-audio content

## Choose another model when - You want to generate speech from text rather than transcribe it — use a text-to-speech model - You need to translate audio into a different language's text — this transcribes in the spoken language, it is not a speech translator - You need live, streaming transcription of an in-progress call — this processes a complete uploaded file and returns a finished transcript

## Tips - Leave `language_code` unset to auto-detect the spoken language; set it to an ISO-639 code (e.g. `eng`, `spa`, `fra`, `deu`, `jpn`) only when you already know the language and want to skip detection. - Keep `diarize` enabled (default) for multi-speaker recordings; the model attributes each word to a speaker. Set it to `false` for single-speaker audio to skip speaker labeling. - Keep `tag_audio_events` enabled (default) to mark non-speech sounds (laughter, applause) inline; set it to `false` for a clean speech-only transcript. - Use clear, reasonably loud source audio — heavy background noise and overlapping speech reduce accuracy.

## Advanced Configuration - `language_code` (default auto-detect): an ISO-639 language code that forces the transcription language instead of detecting it. Useful when the audio is short or the language is known in advance. - `tag_audio_events` (boolean, default `true`): when `true`, non-speech events such as laughter and applause are tagged inline in the transcript. - `diarize` (boolean, default `true`): when `true`, annotates which speaker said each word.

To run via the ModelRunner JavaScript client: ```js import { modelrunner } from "@modelrunner/client";

const result = await modelrunner.subscribe("elevenlabs/scribe-v1", { input: { audio_url: "https://storage.googleapis.com/falserverless/web-examples/elevenlabs/sample.mp3", diarize: true, tag_audio_events: true, }, }); ```