Math Problem Statement

Suppose you are pretraining a BERT model with 8 layers, 768-dim hidden states, 8 attention heads, and a sub-word vocabulary of size 40k. Also, your feed-forward hidden layer is of dimension 3072. What will be the number of parameters of the model? You can ignore the bias terms, and other parameters used corresponding to the final loss computation from the final encoder representation. The BERT model can take at most 512 tokens in the input.

Solution

Ask a new question for Free

By Image

Drop file here or Click Here to upload

Math Problem Analysis

Mathematical Concepts

Linear Algebra
Matrix Multiplication
Parameter Calculation

Formulas

Token Embedding Params = Vocabulary Size × Embedding Dimension
Position Embedding Params = Sequence Length × Embedding Dimension
Attention Params = (3 × Input Dimension × Head Dimension × Number of Heads) + (Output Dimension × Input Dimension)
FFN Params = (Input Dimension × Hidden Dimension) + (Hidden Dimension × Output Dimension)

Theorems

-

Suitable Grade Level

Graduate Level