BERT Model Parameter Calculation for 8-Layer, 768-Dim, 8 Attention Heads

Math Problem Statement

Suppose you are pretraining a BERT model with 8 layers, 768-dim hidden states, 8 attention heads, and a sub-word vocabulary of size 40k. Also, your feed-forward hidden layer is of dimension 3072. What will be the number of parameters of the model? You can ignore the bias terms, and other parameters used corresponding to the final loss computation from the final encoder representation. The BERT model can take at most 512 tokens in the input.

Solution

To calculate the number of parameters in a BERT model with the provided configuration, we need to consider the following components:

Embedding Layer:
- BERT uses two embeddings:
  - Token embeddings: Each token in the vocabulary (size 40k) is mapped to a 768-dimensional vector.
  - Positional embeddings: BERT takes a maximum of 512 tokens as input, so we need 512 positional embeddings, each of size 768.
  - Segment embeddings: These are typically used to differentiate between two sentences, each of size 768. Since there are two segments, we need 2 segment embeddings.
The number of parameters in the embedding layer: $\text{Token embeddings} = 40000 \times 768 = 30,720,000$ $\text{Position embeddings} = 512 \times 768 = 393,216$ $\text{Segment embeddings} = 2 \times 768 = 1,536$ Total embedding parameters: $30,720,000 + 393,216 + 1,536 = 31,114,752$
Self-Attention Layers: BERT uses multi-head self-attention, where each head projects the hidden state into key, query, and value vectors. Since there are 8 heads, the size of the key, query, and value vectors for each head is $\frac{768}{8} = 96$ .
- Q, K, V projections: For each attention head, the projection from the hidden state (768 dimensions) to the key, query, and value vectors requires: $\text{Q/K/V projections} = 768 \times 768 = 589,824 \text{ (for each projection)}$ There are 3 such projections (for Q, K, and V) per attention head, and 8 attention heads in total, so: $3 \times 768 \times 768 = 1,769,472$ Total for all heads: $1,769,472 \times 8 = 14,155,776$
- Output projection: After concatenating the results of the attention heads, the combined vector (768 dimensions) is projected back to a 768-dimensional space: $\text{Output projection} = 768 \times 768 = 589,824$
Total attention parameters per layer: $14,155,776 + 589,824 = 14,745,600$
Feed-forward layers: Each transformer layer has a feed-forward network consisting of two linear transformations with a ReLU activation in between. The size of the hidden layer is 3072: $\text{First layer} = 768 \times 3072 = 2,359,296$ $\text{Second layer} = 3072 \times 768 = 2,359,296$ Total feed-forward parameters per layer: $2,359,296 + 2,359,296 = 4,718,592$
Total parameters per transformer layer: Each transformer layer consists of the self-attention mechanism and the feed-forward layers: $\text{Parameters per layer} = 14,745,600 + 4,718,592 = 19,464,192$
Total parameters across 8 layers: Since the model has 8 transformer layers: $\text{Total for 8 layers} = 19,464,192 \times 8 = 155,713,536$
Total Parameters of the Model: Now, summing the parameters from the embedding layer and the 8 transformer layers: $\text{Total parameters} = 31,114,752 + 155,713,536 = 186,828,288$

Thus, the total number of parameters in the BERT model is 186,828,288.

Would you like more details or have any questions?

Tip:

When calculating model parameters, always carefully distinguish between layer dimensions (e.g., hidden state size, embedding size) and the number of heads or layers, as these significantly impact the parameter count.

Ask a new question for Free

By Image

Drop file here or Click Here to upload

Math Problem Analysis

Mathematical Concepts

Linear Algebra
Matrix Multiplication
Embeddings
Attention Mechanism

Formulas

Token embeddings = vocabulary_size * embedding_dim
Position embeddings = max_input_length * embedding_dim
Q/K/V projections = hidden_dim * hidden_dim
Feed-forward layer: hidden_dim * feedforward_dim

Theorems

Self-Attention Mechanism
Feed-Forward Neural Network

Suitable Grade Level

Undergraduate/Graduate Level

Related Recommendation

Calculating BERT Model Parameters: 8 Layers, 768-Dimensions, and 8 Attention Heads

Calculating Trainable Parameters in Neural Networks with Recurrent Layers

Calculate Cubes Needed for a 12-Layer 3D Tetrahedral Stack

Free Parameters in ETS(MNM) Model on Quarterly Data

Calculate H2(in) in Neural Network using Backpropagation with Bias