Qwen 2.5 VL 72B Instruction Manual

qwen/qwen2.5-vl-72b-instruct

Qwen2.5-VL, the latest vision-language model in the Qwen2.5 series, delivers enhanced multimodal capabilities, including advanced visual comprehension for object and text recognition, chart and layout analysis, and agent-based dynamic tool orchestration. It processes long-form videos (over 1 hour) with key event detection while enabling precise spatial annotation through bounding boxes or coordinate points. The model specializes in extracting structured data from scanned documents (such as invoices and tables) and achieves state-of-the-art performance across multimodal benchmarks covering image understanding, temporal video analysis, and agent task evaluations.

Price

Input	$0.8 per million tokens
Output	$0.8 per million tokens

Use the following code example to integrate our API:

1from openai import OpenAI
2
3client = OpenAI(
4    api_key="<Your API Key>",
5    base_url="https://api.highwayapi.ai/openai"
6)
7
8response = client.chat.completions.create(
9    model="qwen/qwen2.5-vl-72b-instruct",
10    messages=[
11        {"role": "system", "content": "You are a helpful assistant."},
12        {"role": "user", "content": "Hello, how are you?"}
13    ],
14    max_tokens=32768,
15    temperature=0.7
16)
17
18print(response.choices[0].message.content)

Information

Provider

Alibaba

Quantification

bf16

Supported Features

Context length

32768

Maximum output

32768

serverless

Support

Input Capabilities

text, image, video

Output Capabilities

text