Whisper Large v3 is OpenAI's open-source, multilingual speech recognition model. Pass a URL to an audio file (mp3, mp4, mpeg, mpga, m4a, wav, or webm) and it returns the full transcript as plain text. It handles 99 languages with automatic language detection, can translate any spoken language into English text, and returns segment- or word-level timestamps plus optional speaker diarization. It is a robust, general-purpose default for turning recordings into searchable, usable text.

## Best for - Transcribing meetings, interviews, podcasts, lectures, and voice notes into text - Multilingual transcription where the spoken language is unknown or mixed (99 languages, auto-detected) - Translating non-English audio directly into English text in one call (`task: "translate"`) - Generating timestamped segments or word-level chunks for captioning and subtitling - Speaker-attributed transcripts of multi-person recordings via diarization

## Choose another model when - You want to generate speech from text rather than transcribe it — use a text-to-speech model - You need live, streaming transcription of an in-progress call — this processes a complete uploaded file and returns a finished transcript - You want maximum-accuracy English transcription with built-in audio-event tagging (laughter, applause) — consider `elevenlabs/scribe-v1`

## Tips - Leave `language` unset to auto-detect the spoken language; set an ISO code (e.g. `en`, `es`, `fr`, `de`, `ja`) only when the language is known, to skip detection. - Set `task` to `translate` to force English output regardless of the source language; keep `transcribe` (default) to output in the spoken language. - Use `chunk_level` to control timestamp granularity: `segment` (default) returns sentence-level chunks, `word` returns per-word timestamps, `none` skips timestamp tokens for a small speed-up. - Enable `diarize` only for multi-speaker audio; it labels who spoke each chunk but adds processing time (and therefore cost, since billing tracks compute time).

## Advanced Configuration - `task` (default `transcribe`): `transcribe` keeps the spoken language; `translate` outputs English. - `chunk_level` (default `segment`): timestamp granularity — `none`, `segment`, or `word`. - `diarize` (boolean, default `false`): annotate which speaker said each chunk. Requires more compute time. - `num_speakers` (default auto): hint the expected speaker count; only used when `diarize` is `true`. - `prompt` (default empty): a text hint to bias transcription toward specific terms or spelling. - `batch_size` (default 64): internal batching; leave at the default unless tuning throughput.

Note on cost: this model bills per transcription request. Pricing is flat per output regardless of audio length.

To run via the ModelRunner JavaScript client: ```js import { modelrunner } from "@modelrunner/client";

const result = await modelrunner.subscribe("openai/whisper", { input: { audio_url: "https://media.modelrunner.ai/iuneUX0YY4AtcsceV9HHp.mp3", task: "transcribe", chunk_level: "segment", }, }); ```

openai / whisper

Input

Additional Settings

Output

Examples

Model Details

openai / whisper

Model Input

Input

Additional Settings

Model Output

Output

Model Example Requests

Examples

Model Details

Model Details