🦙 Llama 3.2 3B Architecture Visualization

Accurate tensor shapes and weight matrix dimensions

📊 Model Configuration

Architecture Pre-LayerNorm Transformer vocab_size 128,256
hidden_size 3072 num_hidden_layers 28
num_attention_heads 24 (GQA: 8 KV heads) head_dim 128
intermediate_size 8192 (MLP / SwiGLU) rope_theta 500,000

📦 Weight Matrix Dimensions (per layer × 28)

Component Weight Name Shape Description
Embedding embed_tokens.weight [128256, 3072] Vocabulary → Hidden
Self-Attention self_attn.q_proj.weight [3072, 3072] Query: 24 heads × 128 dim
self_attn.k_proj.weight [1024, 3072] Key: 8 heads × 128 dim (GQA)
self_attn.v_proj.weight [1024, 3072] Value: 8 heads × 128 dim (GQA)
self_attn.o_proj.weight [3072, 3072] Output projection
MLP (SwiGLU) mlp.gate_proj.weight [8192, 3072] Gate projection
mlp.up_proj.weight [8192, 3072] Up projection
mlp.down_proj.weight [3072, 8192] Down projection
LayerNorm input_layernorm.weight [3072] RMSNorm (no bias)
post_attention_layernorm.weight [3072] RMSNorm (no bias)
Final Norm model.norm.weight [3072] Final RMSNorm
LM Head lm_head.weight [128256, 3072] Hidden → Vocabulary

🎛️ Configuration

e.g., "I love you" = 3 tokens

📋 Legend + Tensor Shape Reference

🔄 RoPE - Rotary Position Embedding
⚡ Attention Score Computation
SiLU - SwiGLU Activation
⊕ Residual Add Node