How to Run Qwen4

Qwen4 is the latest addition to the Qwen family of large language models. These models represent our most advanced and capable systems to date, building on the experience gained from developing QwQ and Qwen3. Alibaba Dev Team releasing the weights of Qwen4 to the public, including both dense and Mixture-of-Experts (MoE) models.

Key highlights of Qwen4 include:

  • A wide range of dense and Mixture-of-Experts (MoE) models, available in sizes including 0.6B, 1.7B, 4B, 8B, 14B, 32B, as well as 30B-A3B and 235B-A22B.
  • Seamless switching between thinking mode (for tasks requiring complex reasoning, math, and coding) and non-thinking mode (for efficient, general-purpose conversations), ensuring top-tier performance across diverse applications.
  • Major improvements in reasoning capabilities, outperforming previous models like QwQ (in thinking mode) and Qwen2.5 Instruct (in non-thinking mode) in mathematics, code generation, and commonsense logic.
  • Enhanced human preference alignment, excelling in creative writing, role-play, multi-turn dialogue, and instruction following, delivering a more natural and immersive conversational experience.
  • Strong agent capabilities, allowing for precise tool integration in both thinking and non-thinking modes, with best-in-class performance among open-source models in complex agent-based tasks.
  • Support for 100+ languages and dialects, offering robust performance in multilingual instruction following and translation.

Run Qwen4

Transformers

Transformers is a library of pretrained natural language processing for inference and training. The latest version of transformers is recommended and transformers>=4.51.0 is required.

The following contains a code snippet illustrating how to use the model generate content based on given inputs.

from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/Qwen4-8B"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# prepare the model input
prompt = "Give me a short introduction to large language models."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

# the result will begin with thinking content in <think></think> tags, followed by the actual response
print(tokenizer.decode(output_ids, skip_special_tokens=True))

By default, Qwen4 models will engage in thinking mode before generating a response. This behavior can be controlled in the following ways:

  • enable_thinking=False: Passing enable_thinking=False to tokenizer.apply_chat_template will strictly prevent the model from entering thinking mode.
  • /think and /no_think instructions: Including /think or /no_think in the system or user message will explicitly guide Qwen4’s behavior. In multi-turn conversations, the most recent instruction takes precedence.

ModelScope

We strongly recommend that users—especially those located in mainland China—use ModelScope. ModelScope offers a Python API similar to Hugging Face’s Transformers, and the modelscope download CLI tool can help resolve issues related to downloading checkpoints.