Discover the power of Transformer positional embeddings and take your website to new heights! Learn how to optimize your content with this groundbreaking technology. Join us on a transformational journey and unlock the full potential of your website. Leave your competitors in awe with cutting-edge techniques and stay ahead in the digital landscape. Don’t miss out on this opportunity to revolutionize your online presence. Start your transformation today!
Word ordering often determines the meaning of a sentence. How to utilize the position information of a word sequence is solved by positional embeddings. This blog will be on how positional embeddings is selected, why is it needed and the implementation to generate it.
This is part of a series of blogs I am writing to understand Transformer architecture from the Attention is all you need paper [1]. I will update this section with the new blogs.
- Demystifying the Attention Logic of Transformers: Unraveling the Intuition and Implementation by visualization
- Transformer positional embeddings (This blog)
- Attention is All You Need: Understanding Transformer Decoder and putting all pieces together.
First, we create feature vectors from the input sentence by using word2vec (or another tokenizer). Let’s assume we are using a word2vec which converts a word to a 512-dimensional vector.
But we also need to add some information to let the model know the position of the word. This is called positional encodings. With the help of positional encodings, the transformer cannot process part of sentences partially without considering about the sequential nature of a sentence.
Now, one easy way can be is just to assign 1 to the first word, 2 to second, and so on. But in this approach, the model during inference might get a sentence that is longer than any it saw during training. Also, for a longer sentence, there will be large values to add which takes more memory.
We can take a range then like add 0 for first work and 1 for last, anything in between we split the range [0,1] and get the values. For example, for a 3-word sentence we can do 0 for the first word, 0.5 for the second, and 1 for the third; for a 4-word sentence, it would be 0,0.33, 0.66, 1 respectively. The problem with this is that the position difference delta is not constant. In the first example, it was 0.5 but in the second case, it was 0.33.
Sinusoidal Positional Encodings
In Attention is all you need paper [1]; the authors have used sine and cosine functions (sinusoidal functions) to generate the positional encodings. The functions are defined as below.
pos: pos is the position of the word in the text. In the example above, when generating the positional encodings, I
will have pos=0
, should
will have pos=1
, sleep
will have pos=2
and so on. The d_model
parameter is the input vector dimension (dimension of the output of the tokenizer). For our example it is 512. i
is the index of the positional encodings.
Let’s understand this with a simple example. Let’s assume the input vector dimension is 4 (in our old example it is 512). So, for any position of a text (pos
) we need to generate 4 values, out of those 4 values, the even ones we generate with the first equation, and the odd ones we generate with the second equation.
import numpy as np
import matplotlib.pyplot as plt
def get_sinusoidal_embedding(position, d_model):
"""
position: Position/index of the word in the text.
d_model: input vector dimension.
"""
embedding = np.zeros(d_model)
for i in range(d_model):
if i % 2 == 0: # even indices
embedding[i] = np.sin(position / (10000 ** (i / d_model))) # assume i=2i from the equations
else: # odd indices
embedding[i] = np.cos(position / (10000 ** ((i - 1) / d_model)))
return embedding
Note: The above code is not generic or optimal, it’s just to understand the concept.
Now, another question that should come to mind is how this makes sure that the position difference delta is linearly scaling. We can prove that as below.