Skip to main content
microsoft avatar

microsoft / florence-2-large/caption

Generate a concise one-sentence caption describing any photo — no prompt needed.

0

Model Input

Input

URL of the image to caption.

You need to be logged in to run this model and view results.
Log in

Model Output

Output

the microsoft logo

Generated in 1.92 seconds
Logs (1 lines)

Model Example Requests

Examples

Model Details

Model Details

Florence-2 Large Caption looks at a photo and returns a single concise sentence describing what it shows. Give it one image URL and you get back a short, natural caption of the whole scene — what is in it and roughly what is happening — with no prompt, no question, and nothing to tune. It is zero-shot: one image in, one caption string out. Built on Microsoft's Florence-2 vision-language foundation model, it is a fast, deterministic way to turn an image into searchable, human-readable text for alt-text, indexing, or a quick scene summary.

## Best for - Generating alt-text or accessibility captions for photos - Auto-describing images for search, tagging, or content indexing - Getting a quick one-line summary of what a picture shows - Bulk-captioning a dataset of images where a short, consistent description per image is enough

## Choose another model when - You want labeled bounding boxes around the objects in the image — use the Florence-2 Large object-detection variant - You want to read or transcribe text printed in the image — use an OCR model - You want to ask a specific question about the image or get a longer, detailed answer — use a visual question-answering model such as Moondream - You need a long, multi-paragraph description rather than one sentence — use a detailed-captioning model

## Tips - Feed a clear, reasonably high-resolution photo; the caption summarizes the dominant subject and setting, so a clean composition yields a sharper description. - There is nothing to prompt or configure — every run on the same image is deterministic, so you can cache results. - The output is a plain caption sentence (not a list of objects); use the object-detection variant if you need structured detections.

## Limitations - Produces one short sentence focused on the main subject — fine detail, small objects, and secondary elements are often omitted. - May misread unusual scenes, rare objects, or text-heavy images; it describes, it does not transcribe.

To run via the ModelRunner JavaScript client: ```js import { modelrunner } from "@modelrunner/client";

const result = await modelrunner.subscribe("microsoft/florence-2-large/caption", { input: { image_url: "https://media.modelrunner.ai/example-scene.png", }, }); // result.data is a caption string, e.g. "a green volkswagen beetle parked in front of a yellow building" ```