Math Problem Statement
Suppose you are pretraining a BERT model with 8 layers, 768-dim hidden states, 8 attention heads, and a sub-word vocabulary of size 40k. Also, your feed-forward hidden layer is of dimension 3072. What will be the number of parameters of the model? You can ignore the bias terms, and other parameters used corresponding to the final loss computation from the final encoder representation. The BERT model can take at most 512 tokens in the input.
Solution
Ask a new question for Free
By Image
Drop file here or Click Here to upload
Math Problem Analysis
Mathematical Concepts
Linear Algebra
Matrix Multiplication
Parameter Calculation
Formulas
Token Embedding Params = Vocabulary Size × Embedding Dimension
Position Embedding Params = Sequence Length × Embedding Dimension
Attention Params = (3 × Input Dimension × Head Dimension × Number of Heads) + (Output Dimension × Input Dimension)
FFN Params = (Input Dimension × Hidden Dimension) + (Hidden Dimension × Output Dimension)
Theorems
-
Suitable Grade Level
Graduate Level
Related Recommendation
BERT Model Parameter Calculation for 8-Layer, 768-Dim, 8 Attention Heads
Calculating Trainable Parameters in Neural Networks with Recurrent Layers
Adapting Calibration Inputs for Baichuan-7B Model Architecture
Matrix Dimension Calculation for a Neural Network
How to Determine the Number of Rows in a Resulting Matrix