Accurate tensor shapes and weight matrix dimensions
| Architecture | Pre-LayerNorm Transformer | vocab_size | 128,256 |
| hidden_size | 3072 | num_hidden_layers | 28 |
| num_attention_heads | 24 (GQA: 8 KV heads) | head_dim | 128 |
| intermediate_size | 8192 (MLP / SwiGLU) | rope_theta | 500,000 |
| Component | Weight Name | Shape | Description |
|---|---|---|---|
| Embedding | embed_tokens.weight | [128256, 3072] | Vocabulary → Hidden |
| Self-Attention | self_attn.q_proj.weight | [3072, 3072] | Query: 24 heads × 128 dim |
| self_attn.k_proj.weight | [1024, 3072] | Key: 8 heads × 128 dim (GQA) | |
| self_attn.v_proj.weight | [1024, 3072] | Value: 8 heads × 128 dim (GQA) | |
| self_attn.o_proj.weight | [3072, 3072] | Output projection | |
| MLP (SwiGLU) | mlp.gate_proj.weight | [8192, 3072] | Gate projection |
| mlp.up_proj.weight | [8192, 3072] | Up projection | |
| mlp.down_proj.weight | [3072, 8192] | Down projection | |
| LayerNorm | input_layernorm.weight | [3072] | RMSNorm (no bias) |
| post_attention_layernorm.weight | [3072] | RMSNorm (no bias) | |
| Final Norm | model.norm.weight | [3072] | Final RMSNorm |
| LM Head | lm_head.weight | [128256, 3072] | Hidden → Vocabulary |