2026-03-24

vllm + Doocker 服务化部署Qwen3-ASR

简介

Qwen3-ASR 系列包括 Qwen3-ASR-1.7B 和 Qwen3-ASR-0.6B，支持 52 种语言和方言的语言识别与语音识别（ASR）。两者均利用大规模语音训练数据以及其基础模型 Qwen3-Omni 强大的音频理解能力。实验表明，1.7B 版本在开源 ASR 模型中达到业界领先水平，并可与最强的商业闭源 API 相媲美。主要特性如下：

一体化：Qwen3-ASR-1.7B 和 Qwen3-ASR-0.6B 支持 30 种语言和 22 种中文方言的语言识别与语音识别，同时涵盖来自多个国家和地区的英语口音。
卓越且高效：Qwen3-ASR 系列模型在复杂声学环境和具有挑战性的文本模式下仍能保持高质量、鲁棒的识别效果。Qwen3-ASR-1.7B 在开源和内部基准测试中均表现出色；而 0.6B 版本则在精度与效率之间取得良好平衡，在并发数为 128 时吞吐量可达 2000 倍。两者均支持单模型统一进行流式/离线推理，并可处理长音频转录。
新颖且强大的强制对齐方案：我们推出了 Qwen3-ForcedAligner-0.6B，支持对最多 5 分钟的语音在 11 种语言中任意单元进行时间戳预测。评估显示，其时间戳精度超越了基于端到端（E2E）的强制对齐模型。
全面的推理工具包：除了开源 Qwen3-ASR 系列的架构和权重外，我们还发布了一个功能强大、特性完备的推理框架，支持基于 vLLM 的批处理推理、异步服务、流式推理、时间戳预测等功能。

拉取官方代码

git clone https://github.com/QwenLM/Qwen3-ASR.git

编写Dockerfile

为什么要自己通过代码打包？因为这样可以修改代码实现自己的接口。

FROM nvcr.io/nvidia/pytorch:24.05-py3

ARG DEBIAN_FRONTEND=noninteractive
ARG INSTALL_FLASH_ATTN=true
ARG APT_MIRROR=archive.ubuntu.com

ENV CUDA_HOME=/usr/local/cuda
ENV PATH=/usr/local/cuda/bin:${PATH}
ENV LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH}
ENV PIP_NO_CACHE_DIR=1
ENV PYTHONUNBUFFERED=1

RUN sed -i "s|archive.ubuntu.com|${APT_MIRROR}|g; s|security.ubuntu.com|${APT_MIRROR}|g" /etc/apt/sources.list \
    && printf 'Acquire::Retries "5";\nAcquire::http::Timeout "30";\nAcquire::https::Timeout "30";\n' > /etc/apt/apt.conf.d/99-retries \
    && apt-get update \
    && apt-get install -y --no-install-recommends \
    git \
    ffmpeg \
    libsndfile1 \
    ca-certificates \
    software-properties-common \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /workspace/Qwen3-ASR
COPY . /workspace/Qwen3-ASR

# Install Python 3.12 without conda, then create an isolated venv.
RUN apt-get update \
    && add-apt-repository -y ppa:deadsnakes/ppa \
    && apt-get update \
    && apt-get install -y --no-install-recommends \
       python3.12 \
       python3.12-dev \
       python3.12-venv \
    && rm -rf /var/lib/apt/lists/*

RUN python3.12 -m venv /opt/py312
ENV PATH=/opt/py312/bin:${PATH}
ENV MAX_JOBS=1

RUN python -V && pip -V

RUN pip install -U pip setuptools wheel
RUN pip install --index-url https://download.pytorch.org/whl/cu128 torch==2.9.1
RUN pip install -e ".[vllm]"

RUN if [ "${INSTALL_FLASH_ATTN}" = "true" ]; then \
      pip install -U flash-attn --no-build-isolation; \
    fi

EXPOSE 8000

CMD ["qwen-asr-serve", "Qwen/Qwen3-ASR-1.7B", "--gpu-memory-utilization", "0.8", "--host", "0.0.0.0", "--port", "8000"]

Docker 打包命令

docker buildx build \
  --platform linux/amd64 \
  -f Dockerfile \
  -t qwen3-asr:nvcr2405-amd64 \
  --build-arg APT_MIRROR=mirrors.aliyun.com \
  --build-arg INSTALL_FLASH_ATTN=true \
  --load \
  .

编写 docker-compose.yml

services:
  qwen3-asr:
    image: qwen3-asr:nvcr2405-amd64
    container_name: qwen-asr
    runtime: nvidia
    environment:
      CUDA_VISIBLE_DEVICES: "5"
    ports:
      - "8023:8000"
    volumes:
      - /mnt/disk2/modelscope/models/Qwen/Qwen3-ASR-1.7B:/models/Qwen3-ASR-1.7B # 映射你自己的模型路径
    command: >
      qwen-asr-serve /models/Qwen3-ASR-1.7B
      --gpu-memory-utilization 0.55
      --host 0.0.0.0
      --port 8000
      --served-model-name Qwen-ASR
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    restart: "unless-stopped"

推理接口

接口一：Chat Completions 音频转写接口

请求示例:

curl http://ip:port/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer 你的Key" \
  -d '{
    "model":"Qwen-ASR",
    "stream":false,
    "messages":[
      {
        "role":"user",
        "content":[
          {
            "type":"audio_url",
            "audio_url":{
              "url":"https://oss.xxxx.cn/sandbox/audio/speakers_example.wav"
            }
          }
        ]
      }
    ]
  }'

当 stream=false 时，服务端一次性返回完整识别结果。
当 stream=true 时，服务端以流式方式逐步返回转写内容。

响应示例:

{
    "id": "chatcmpl-96e389edd092b463",
    "object": "chat.completion",
    "created": 1773814629,
    "model": "Qwen-ASR",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "language Chinese<asr_text>嗯，那么今天我们就简单的进行一下。"
            }
        }
    ],
    "usage": {
        "prompt_tokens": 687,
        "total_tokens": 825,
        "completion_tokens": 138,
        "prompt_tokens_details": null
    }
}

使用说明:

音频文件需通过 URL 可访问。
推荐使用常见音频格式，如 wav、mp3 等。
若音频较长，建议根据业务需求选择是否开启流式输出。

接口二：Audio Transcriptions 专用转写接口

请求示例:

curl http://ip:port/v1/audio/transcriptions \
  -H "Authorization: Bearer 你的Key" \
  -H "Content-Type: multipart/form-data" \
  -F file="@/Users/lucent/Desktop/asr_en.wav" \
  -F model="Qwen-ASR" \
  -F stream=true

当 stream=true 时，结果会分片返回，适用于实时展示转写过程。
当 stream=false 时，返回完整文本结果。

响应示例:

{
    "text": "language English<asr_text>Oh yeah, yeah. He wasn't even that big when I started listening to him, but and his solo music didn't do overly well, but he did very well when he started writing for other people.",
    "usage": {
        "type": "duration",
        "seconds": 16
    }
}

使用说明:

file 参数必须为可读取的本地音频文件。
适合服务端直传文件的业务场景。
如果需要边识别边展示结果，可开启 stream=true。
相较于 Chat Completions 接口，该接口更聚焦于音频转写本身，调用形式更简洁。

两个接口的区别

维度	Chat Completions 接口	Audio Transcriptions 接口
请求路径	`/v1/chat/completions`	`/v1/audio/transcriptions`
输入方式	音频 URL	本地文件上传
Content-Type	`application/json`	`multipart/form-data`
是否支持流式	支持	支持
支持文件格式	wav、mp3	wav、mp3