Math Problem Statement

Suppose you are pretraining a BERT model with 8 layers, 768-dim hidden states, 8 attention heads, and a sub-word vocabulary of size 40k. Also, your feed-forward hidden layer is of dimension 3072. What will be the number of parameters of the model? You can ignore the bias terms, and other parameters used corresponding to the final loss computation from the final encoder representation. The BERT model can take at most 512 tokens in the input.

Solution

Ask a new question for Free

By Image

Drop file here or Click Here to upload

Math Problem Analysis

Mathematical Concepts

Linear Algebra
Matrix Multiplication
Embeddings
Attention Mechanism

Formulas

Token embeddings = vocabulary_size * embedding_dim
Position embeddings = max_input_length * embedding_dim
Q/K/V projections = hidden_dim * hidden_dim
Feed-forward layer: hidden_dim * feedforward_dim

Theorems

Self-Attention Mechanism
Feed-Forward Neural Network

Suitable Grade Level

Undergraduate/Graduate Level