Math Problem Statement
Suppose you are pretraining a BERT model with 8 layers, 768-dim hidden states, 8 attention heads, and a sub-word vocabulary of size 40k. Also, your feed-forward hidden layer is of dimension 3072. What will be the number of parameters of the model? You can ignore the bias terms, and other parameters used corresponding to the final loss computation from the final encoder representation. The BERT model can take at most 512 tokens in the input.
Solution
Ask a new question for Free
By Image
Drop file here or Click Here to upload
Math Problem Analysis
Mathematical Concepts
Linear Algebra
Matrix Multiplication
Embeddings
Attention Mechanism
Formulas
Token embeddings = vocabulary_size * embedding_dim
Position embeddings = max_input_length * embedding_dim
Q/K/V projections = hidden_dim * hidden_dim
Feed-forward layer: hidden_dim * feedforward_dim
Theorems
Self-Attention Mechanism
Feed-Forward Neural Network
Suitable Grade Level
Undergraduate/Graduate Level
Related Recommendation
Calculating BERT Model Parameters: 8 Layers, 768-Dimensions, and 8 Attention Heads
Calculating Trainable Parameters in Neural Networks with Recurrent Layers
How to Determine the Number of Rows in a Resulting Matrix
Golden Ratio in Human Body Measurements: Solve and Visualize
Matrix Dimension Calculation for a Neural Network