Analyzing DeepSeek-V3 Model Performance

10 Minutes

Analyzing DeepSeek: An Introduction

Deep learning models have witnessed remarkable progress in recent years, with large-scale transformer-based models achieving state-of-the-art performance across various natural language processing (NLP) tasks. Deepseek-R1/V3, as one of the latest iterations of such models, introduces sophisticated architectural improvements and optimized deployment strategies to enhance inference efficiency. However, as model size and complexity continue to grow, understanding the computational and memory access patterns of these models becomes critical for optimizing inference performance. This paper aims to provide a comprehensive analysis of the inference efficiency of Deepseek-R1/V3, focusing on both theoretical and empirical aspects.

The article is organized into four main sections. The first section, Model Architecture, provides a detailed overview of the Deepseek-R1/V3 model structure, including the key operators it employs, the Multi-head Latent Attention mechanism, and the computation process within the Mixture of Experts (MoE) module. The second section, Roofline Analysis, investigates the computational intensity and memory access characteristics of all operators in the Deepseek-R1/V3 model, aiming to determine which operators are compute-bound and which are memory-bound. The third section, Distributed Deployment, explores the deployment strategies used to enable efficient inference at scale, covering Expert Parallelism, Tensor Parallelism, and Data Parallelism. 

By combining architectural insights, deployment strategies, and low-level performance analysis, this paper aims to offer a comprehensive understanding of the inference efficiency of Deepseek-R1/V3 and provide valuable guidance for optimizing large-scale model deployment and execution.

Model Architecture

Introduction to Model Architecture

Deepseek-r1/v3 is a large-scale language model designed with sophisticated transformer-based architecture to achieve high efficiency and state-of-the-art performance. Its architecture integrates several key innovations, including Multi-head Latent Attention (MLA) and Mixture of Experts (MoE), which enable enhanced scalability and computational efficiency. Understanding the internal structure and computation flow of Deepseek-r1/v3 is essential for optimizing its performance and deployment.

  1. Overview of the Model Structure

The left side of the diagram illustrates the overall structure of the Deepseek-r1/v3 model. It is composed of multiple layers, including:

  • VocabParallelEmbedding: Handles token embeddings using parallelized operations to enhance      efficiency during embedding lookup.
  • Dense Decoder Layers: The model contains a few dense decoder layers that follow a standard transformer block structure, consisting of layer normalization, MLA, and a feedforward MLP layer.
  • MoE Decoder Layers: Most decoder layers are MoE-based, where each layer selects a subset of specialized experts for each token, improving computational efficiency and model capacity.
  • Final RMSNorm: A final layer normalization is applied before generating output logits.
  1. Multi-head Latent Attention (MLA)

The upper right section of the diagram details the computation flow within the MLA mechanism, which operates differently during Prefill and Decode stages:

  • Prefill Stage: In this stage, the MLA layer processes all input tokens simultaneously, enabling efficient batched processing. The computation involves key, query, and value projections, rotary positional embeddings (RoPE), and softmax-based attention.
  • Decode Stage: During decoding, MLA operates incrementally, handling one token at a time while leveraging cached key-value pairs from previous steps. This stage introduces an optimized mechanism for incremental attention computation to minimize latency.
  1. Mixture of Experts (MoE)

The lower middle section shows the internal structure of the MoE layer. The model dynamically selects a small number of specialized experts for each token using a gating mechanism:

  • A gating network determines which experts to activate.
  • Each activated expert performs independent feedforward computation.
  • Outputs from the selected experts are combined to form the final MoE output.

This mechanism allows the model to scale to larger parameter counts while keeping computational costs manageable.

  1. Feedforward Network (MLP)

The lower right section illustrates the computation flow of the MLP layer, which follows a standard two-layer structure with activation functions and parallelized matrix multiplications. This design ensures high throughput and low memory latency.

Model Hyperparameters

Below listed all hyperparameters of Deepseek-R1/V3, which we could use in the later sections for roofline calculation.

Hyperparameters Table
Hyperparameters Description Typical Value
Bmax Maximum serving batch size 8
Lmax Maximum sequence length 4096 * 4
Svocab Vocabulary Size 129280
Hhidden_dim Dimension of hidden embedding of token 7168
Hinter_dim Dimension of hidden embedding inside MLP 18432
MoE Hyperparameters Table
Hyperparameters Description Typical Value
Nrouted_experts Number of routed experts in each layer 256
Nshared_experts Number of shared experts across layers 1
Nactivated_experts Number of activated experts in each layer 8
MLA Hyperparameters Table
Hyperparameters Description Typical Value
Hq_lora_rank Compressed dimension of query 1536
Hkv_lora_rank Compressed dimension of key / value 512
Hqk_nope_head_dim Dimension of query / key in each head (Non-positional embedding) 128

Roofline Analysis

Basic

Technical Article Table
Operation Formula Input Shape Output Shape Computation FLOPs Memory Read Memory Write
Matmul Y = WX X∈K, W∈N Y∈N B⋅2MKN B⋅MK + KN B⋅MN
RoPE Y = RoPE(X) X∈M Y∈M B⋅6M B⋅2M B⋅2M
LayerNorm Y = LayerNorm(X) X∈M Y∈M B⋅8M + 4 B⋅M B⋅M
Softmax Y = Softmax(X) X∈M Y∈M B⋅4M B⋅M B⋅M

MLA (Normal Version)

In the Prefill stage of the Multi-head Latent Attention (MLA) operator:

First, in step (1), the input tensor of shape Hhidden_dim is multiplied by the weight matrix Wkv, resulting in a tensor of shape Hkv_lora_rank+Hqk_rope_head_dim. Part of this output related to Hkv_lora_rank is stored in the "kv cache".

Then, step (2) applies the Rotary Position Embedding (RoPE) operation to the relevant part of the output from step (1) with shape Hqk_rope_head_dim, and the result is stored in the "rope cache".

In step (3), another multiplication with the weight matrix Wkv is performed on the original input, generating a tensor with a more complex shape involving Nhead, Hqk_rope_head_dim and Hv_head_dim.

Next, step (4) multiplies the input by the weight matrix Wqa, getting an output of shape Hq_lora_rank, which is further processed in step (5) by multiplying with Wqb to obtain a tensor of shape NheadHqk_head_dim. This tensor is then split into multiple parts related to different heads, and for each part, the RoPE operation is applied in step (6).

After that, in step (7), element-wise multiplication (Mul) is carried out between the RoPE-processed output from step (6) and a corresponding part from the output of step (3). The result of this is used in another element-wise multiplication in step (8).

Step (9) performs an addition operation on the output of step (8). Then, step (10) applies the Softmax function to the output of step (9) to get a probability - like tensor.

In step (11), another element - wise multiplication is done, and the resulting tensor is transformed in terms of its dimension related to heads.

Finally, in step (12), the output is multiplied by the weight matrix Wo, and the final output tensor has the shape Hhidden_dim, which is consistent with the initial input's batch and sequence length dimensions while having the appropriate hidden dimension.

Below, we summarize the detail computation process, computation FLOPs, memory R/W of MLA (normal version), and we also gave the computation intensity for reference.

Technical Article Table
Operation Formula Input Shape Output Shape Computation FLOPs Memory Read Memory Write
Matmul Y = WX X∈K, W∈N Y∈N B⋅2MKN B⋅MK + KN B⋅MN
RoPE Y = RoPE(X) X∈M Y∈M B⋅6M B⋅2M B⋅2M
LayerNorm Y = LayerNorm(X) X∈M Y∈M B⋅8M + 4 B⋅M B⋅M
Softmax Y = Softmax(X) X∈M Y∈M B⋅4M B⋅M B⋅M

MoE

The Mixture of Experts (MOE) calculation process starts with Gate operation (1). It takes an input tensor of shape Hhidden_dim and produces a tensor of shape Nrouted_expert. This operation involves a linear projection where the input is multiplied by a weight matrix (if applicable) to generate gate values.

For the specialized Expert part, first, operation (2) and (3) both perform matrix - vector multiplications. Operation (2) takes the input Hhidden_dim and multiplies it with a weight matrix of shape Hinter_dim to get an output of shape Hinter_dim. Similarly, operation (3) does the same with the same - shaped input and weight matrix. Then, operation (4) applies the SiLU activation function to the output of operation (2), keeping the output shape the same. Next, operation (5) performs an element-wise dot-product multiplication between the outputs of operation (3) and the output after the SiLU activation (operation 4). Finally, operation (6) conducts another matrix - vector multiplication on the output of operation (5), transforming the shape back to Hhidden_dim.

After processing through specialized experts, the results along with the output from the shared part are combined using an element-wise addition operation (7) to produce the final output of shape Hhidden_dim.

Below, we summarize the detailed computation process, computation FLOPs, memory R/W of MoE, and we also gave the computation intensity for reference.

MLP

The computation process of the MLP (Multi Layer Perceptron) starts with an input tensor of shape Hhidden_dim. 

In step (1), the input tensor is multiplied by the weight matrix W1. This matrix multiplication operation transforms the input into a tensor of shape Hinter_dim.

 Simultaneously, in step (2), the same input tensor is multiplied by the weight matrix W3, also resulting in a tensor of shape Hinter_dim. 

Next, in step (3), the output from step (1) goes through the SiLU (Sigmoid Linear Unit) activation function. The SiLU function introduces non-linearity while keeping the output shape as Hinter_dim. 

After that, in step (4), an element - wise dot - product multiplication (DotMul) is performed between the output of the SiLU activation (step 3) and the output from step (2).  The result of this operation still has the shape Hinter_dim. 

Finally, in step (5), the output from step (4) is multiplied by the weight matrix W2. This last matrix - vector multiplication converts the tensor back to the shape Hhidden_dim, which is the final output of the MLP operator. 

Below, we summarize the detailed computation process, computation FLOPs, memory R/W of MLP, and we also gave the computation intensity for reference.