Audio &amp; Music Models

Generate full songs or instrumental music from genre tags and optional lyrics, with duration you control up to 4 minutes.

meta / musicgen

A fast, controllable auto-regressive Transformer for high-fidelity music generation.

MiniMax Speech-02 HD

minimax

Turn text into natural, high-fidelity speech in 30+ languages with 300+ voices plus emotion, speed, pitch, and volume control.

LTX-2.3 Text-to-Audio

lightricks

Generate sound effects, ambience, and spoken-style audio from a text prompt, with duration you control down to the frame.

ElevenLabs Scribe v1

elevenlabs

Transcribe speech audio into accurate text with word-level timestamps, speaker labels, and audio-event tags across 99 languages.

Lyria 2

google

Generate ~30 seconds of high-fidelity instrumental music from a text prompt, as a 48kHz WAV file.

ElevenLabs Sound Effects V2

elevenlabs

Generate sound effects, Foley, and ambience from a text prompt, returning a hosted MP3.

Whisper Large v3

openai

Transcribe or translate speech audio into text across 99 languages, with segment/word timestamps and optional speaker diarization.

Chatterbox TTS

resemble-ai

Turn text into expressive speech and clone any voice from a short reference recording, with fine control over emotional intensity.

ElevenLabs Multilingual v2

elevenlabs

Turn text into natural, expressive speech in 29 languages with ElevenLabs Multilingual v2 voices, with controls for stability, similarity, and style.