moonshotai/Kimi-K2-Instruct

Kimi's latest and most powerful open-source model.

LLMFP8NEW
LLM

Kimi's latest and most powerful open-source model.

Kimi-K2-Instruct

1. Model Introduction

Kimi K2 is a state-of-the-art mixture-of-experts (MoE) language model with 32 billion activated parameters and 1 trillion total parameters. Trained with the Muon optimizer, Kimi K2 achieves exceptional performance across frontier knowledge, reasoning, and coding tasks while being meticulously optimized for agentic capabilities.

Key Features

  • Large-Scale Training: Pre-trained a 1T parameter MoE model on 15.5T tokens with zero training instability.
  • MuonClip Optimizer: We apply the Muon optimizer to an unprecedented scale, and develop novel optimization techniques to resolve instabilities while scaling up.
  • Agentic Intelligence: Specifically designed for tool use, reasoning, and autonomous problem-solving.

Model Variants

  • Kimi-K2-Base: The foundation model, a strong start for researchers and builders who want full control for fine-tuning and custom solutions.
  • Kimi-K2-Instruct: The post-trained model best for drop-in, general-purpose chat and agentic experiences. It is a reflex-grade model without long thinking.

Image 12: Evaluation Results

2. Model Summary

ArchitectureMixture-of-Experts (MoE)
Total Parameters1T
Activated Parameters32B
Number of Layers (Dense layer included)61
Number of Dense Layers1
Attention Hidden Dimension7168
MoE Hidden Dimension (per Expert)2048
Number of Attention Heads64
Number of Experts384
Selected Experts per Token8
Number of Shared Experts1
Vocabulary Size160K
Context Length128K
Attention MechanismMLA
Activation FunctionSwiGLU

3. Evaluation Results

Instruction model evaluation results

BenchmarkMetricKimi K2 InstructDeepSeek-V3-0324Qwen3-235B-A22B (non-thinking)Claude Sonnet 4 (w/o extended thinking)Claude Opus 4 (w/o extended thinking)GPT-4.1Gemini 2.5 Flash Preview (05-20)
Coding Tasks
LiveCodeBench v6 (Aug 24 - May 25)Pass@153.746.937.048.547.444.744.7
OJBenchPass@127.124.011.315.319.619.519.5
MultiPL-EPass@185.783.178.288.689.686.785.6
SWE-bench Verified (Agentless Coding)Single Patch w/o Test (Acc)51.836.639.450.253.040.832.6
SWE-bench Verified (Agentic Coding)Single Attempt (Acc)65.838.834.472.7*72.5*54.6
Multiple Attempts (Acc)71.680.279.4*
SWE-bench Multilingual (Agentic Coding)Single Attempt (Acc)47.325.820.951.031.5
TerminalBenchInhouse Framework (Acc)30.035.543.28.3
Terminus (Acc)25.016.36.630.316.8
Aider-PolyglotAcc60.055.161.856.470.752.444.0
Tool Use Tasks
Tau2 retailAvg@470.669.157.075.081.874.864.3
Tau2 airlineAvg@456.539.026.555.560.054.542.5
Tau2 telecomAvg@465.832.522.145.257.038.616.9
AceBenchAcc76.572.770.576.275.680.174.5
Math & STEM Tasks
AIME 2024Avg@6469.659.4*40.1*43.448.246.561.3
AIME 2025Avg@6449.546.724.7*33.1*33.9*37.046.6
MATH-500Acc97.494.0*91.2*94.094.492.495.4
HMMT 2025Avg@3238.827.511.915.915.919.434.7
CNMO 2024Avg@1674.374.748.660.457.656.675.0
PolyMath-enAvg@465.159.551.952.849.854.049.9
ZebraLogicAcc89.084.037.7*73.759.358.557.9
AutoLogiAcc89.588.983.389.886.188.284.1
GPQA-DiamondAvg@875.168.4*62.9*70.0*74.9*66.368.2
SuperGPQAAcc57.253.750.255.756.550.849.6
Humanity's Last Exam (Text Only)-4.75.25.75.87.13.75.6
General Tasks
MMLUEM89.589.487.091.592.990.490.1
MMLU-ReduxEM92.790.589.293.694.292.490.6
MMLU-ProEM81.181.2*77.383.786.681.879.4
IFEvalPrompt Strict89.881.183.2*87.687.488.084.3
Multi-ChallengeAcc54.131.434.046.849.036.439.5
SimpleQACorrect31.027.713.215.922.842.323.3
LivebenchPass@176.472.467.674.874.669.867.8

• Bold denotes global SOTA, and underlined denotes open-source SOTA.

• Data points marked with * are taken directly from the model's tech report or blog.

• All metrics, except for SWE-bench Verified (Agentless), are evaluated with an 8k output token length. SWE-bench Verified (Agentless) is limited to a 16k output token length.

• Kimi K2 achieves 65.8% pass@1 on the SWE-bench Verified tests with bash/editor tools (single-attempt patches, no test-time compute). It also achieves a 47.3% pass@1 on the SWE-bench Multilingual tests under the same conditions. Additionally, we report results on SWE-bench Verified tests (71.6%) that leverage parallel test-time compute by sampling multiple sequences and selecting the single best via an internal scoring model.

• To ensure the stability of the evaluation, we employed avg@k on the AIME, HMMT, CNMO, PolyMath-en, GPQA-Diamond, EvalPlus, Tau2.

• Some data points have been omitted due to prohibitively expensive evaluation costs.


Base model evaluation results

BenchmarkMetricShotKimi K2 BaseDeepseek-V3-BaseQwen2.5-72BLlama 4 Maverick
General Tasks
MMLUEM5-shot87.887.186.184.9
MMLU-proEM5-shot69.260.662.863.5
MMLU-redux-2.0EM5-shot90.289.587.888.2
SimpleQACorrect5-shot35.326.510.323.7
TriviaQAEM5-shot85.184.176.079.3
GPQA-DiamondAvg@85-shot48.150.540.849.4
SuperGPQAEM5-shot44.739.234.238.8
Coding Tasks
LiveCodeBench v6Pass@11-shot26.322.921.125.1
EvalPlusPass@1-80.365.666.065.5
Mathematics Tasks
MATHEM4-shot70.260.161.063.0
GSM8kEM8-shot92.191.790.486.3
Chinese Tasks
C-EvalEM5-shot92.590.090.980.9
CSimpleQACorrect5-shot77.672.150.553.5

Похожие модели

Начните с 300+ моделей,

только в Atlas Cloud.