Skip to main content
stability-ai avatar

stability-ai / stable-audio-2.5/text-to-audio

Generate long-form music and sound effects from a text prompt — up to ~190 seconds of WAV audio in a single call.

0.2

Model Input

Input

The prompt to generate audio from. Describe genre, instrumentation, mood, and tempo for music, or the source, environment, and materials for sound effects.

Min: 1 - Max: 190

Duration of the generated audio in seconds (1-190). Billing is a flat rate per generation regardless of length.

Additional Settings

Customize your input with more control.

Min: 4 - Max: 8

Number of denoising steps. More steps can improve quality at the cost of speed.

Min: 1 - Max: 25

Classifier-free guidance scale; higher values follow the prompt more strictly.

Random seed for reproducible generation. Leave empty for a random result.

You need to be logged in to run this model and view results.
Log in

Model Output

Output

Loading
Generated in 3.704 seconds
Logs (1 lines)

Model Example Requests

Examples

Model Details

Model Details

Stable Audio 2.5 turns a written description into a finished audio clip — a piece of music, a soundscape, ambience, or a sound effect — up to about 190 seconds long, returned as a WAV file. Write what you want in the `prompt` ("driving synthwave with a punchy kick and arpeggiated bass", "gentle rain on a window with distant thunder", "upbeat corporate acoustic bed, no vocals") and set `seconds_total` to control the length. Its standout strength is long-form generation: unlike short sound-effect models capped at a few seconds, a single call can produce minutes-long tracks suitable for full backing music, and it was trained on a fully licensed dataset for commercial-safe output.

## Best for - Background music and instrumental beds for videos, ads, podcasts, and games - Long-form tracks and loops up to about three minutes from a single text prompt - Ambience and soundscapes (rain, cafe noise, forest, room tone) for scenes - One-off sound effects and foley described in plain language - Royalty-conscious audio where a commercially-safe, licensed-data model matters

## Choose another model when - You want to transform or restyle an existing audio clip rather than generate from text — use an audio-to-audio model - You need natural spoken narration or a specific voice — use a text-to-speech model - You only need a very short one-shot effect and want per-second billing on tiny clips — a per-second sound-effect model may be cheaper

## Tips - `seconds_total` accepts 1–190 seconds; billing is a flat rate per generation, so longer clips cost the same as short ones - Describe genre, instrumentation, mood, and tempo in the prompt for music; describe the source, environment, and materials for sound effects - Say "no vocals" or "instrumental" in the prompt when you want a clean music bed - Raise `num_inference_steps` (up to 8) for a quality bump; raise `guidance_scale` for stricter prompt adherence

## Limitations - Output is a single WAV clip per call; there is no multi-track or stem separation - Very short durations can produce less musically-developed results than longer clips

To run via the ModelRunner JavaScript client: ```js import { modelrunner } from "@modelrunner/client";

const result = await modelrunner.subscribe("stability-ai/stable-audio-2.5/text-to-audio", { input: { prompt: "upbeat lofi hip hop instrumental with a warm vinyl texture", seconds_total: 60, }, }); ```