Select Page

Transformer positional embeddings

Transformer positional embeddings

Written by sfmantra

November 14, 2023

Discover the power of Transformer positional embeddings and take your website to new heights! Learn how to optimize your content with this groundbreaking technology. Join us on a transformational journey and unlock the full potential of your website. Leave your competitors in awe with cutting-edge techniques and stay ahead in the digital landscape. Don’t miss out on this opportunity to revolutionize your online presence. Start your transformation today!

 

Word ordering often determines the meaning of a sentence. How to utilize the position information of a word sequence is solved by positional embeddings. This blog will be on how positional embeddings is selected, why is it needed and the implementation to generate it.

This is part of a series of blogs I am writing to understand Transformer architecture from the Attention is all you need paper [1]. I will update this section with the new blogs.

  1. Demystifying the Attention Logic of Transformers: Unraveling the Intuition and Implementation by visualization
  2. Transformer positional embeddings (This blog)
  3. Attention is All You Need: Understanding Transformer Decoder and putting all pieces together.

First, we create feature vectors from the input sentence by using word2vec (or another tokenizer). Let’s assume we are using a word2vec which converts a word to a 512-dimensional vector.

Create vectorize representation of the words. This example we are using word2vec, but you can try any one of the tokenizers.

But we also need to add some information to let the model know the position of the word. This is called positional encodings. With the help of positional encodings, the transformer cannot process part of sentences partially without considering about the sequential nature of a sentence.

Now, one easy way can be is just to assign 1 to the first word, 2 to second, and so on. But in this approach, the model during inference might get a sentence that is longer than any it saw during training. Also, for a longer sentence, there will be large values to add which takes more memory.

We can take a range then like add 0 for first work and 1 for last, anything in between we split the range [0,1] and get the values. For example, for a 3-word sentence we can do 0 for the first word, 0.5 for the second, and 1 for the third; for a 4-word sentence, it would be 0,0.33, 0.66, 1 respectively. The problem with this is that the position difference delta is not constant. In the first example, it was 0.5 but in the second case, it was 0.33.

Sinusoidal Positional Encodings

In Attention is all you need paper [1]; the authors have used sine and cosine functions (sinusoidal functions) to generate the positional encodings. The functions are defined as below.

Positional encoding used in the paper [1]

pos: pos is the position of the word in the text. In the example above, when generating the positional encodings, I will have pos=0 , should will have pos=1 , sleep will have pos=2 and so on. The d_model parameter is the input vector dimension (dimension of the output of the tokenizer). For our example it is 512. i is the index of the positional encodings.

Let’s understand this with a simple example. Let’s assume the input vector dimension is 4 (in our old example it is 512). So, for any position of a text (pos) we need to generate 4 values, out of those 4 values, the even ones we generate with the first equation, and the odd ones we generate with the second equation.

Positional encodings computation. In the image I have shown the input vector size as 4 so we have 4 values for each word. In the original example since the input vector dimension is 512, positional encodings length should also be 512.
import numpy as np
import matplotlib.pyplot as plt

def get_sinusoidal_embedding(position, d_model):
"""
position: Position/index of the word in the text.
d_model: input vector dimension.
"""

embedding = np.zeros(d_model)
for i in range(d_model):
if i % 2 == 0: # even indices
embedding[i] = np.sin(position / (10000 ** (i / d_model))) # assume i=2i from the equations
else: # odd indices
embedding[i] = np.cos(position / (10000 ** ((i - 1) / d_model)))
return embedding

Note: The above code is not generic or optimal, it’s just to understand the concept.

Now, another question that should come to mind is how this makes sure that the position difference delta is linearly scaling. We can prove that as below.

Related Articles

Exploring Salesforce Identity License

Exploring Salesforce Identity License In today's world, where digital security and user management are vital, Salesforce offers a strong solution with its Identity License. This license is designed for users who need to manage identity and access across various...

Salesforce Sandbox

What Is a Salesforce Sandbox? A Salesforce Sandbox environment lets you to test new configuration, code, and automation outside of your production Instance; it's similar to a clone of your production instance, including a portion of your metadata and data depending on...

0 0 votes
Article Rating

About The Author

0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x
Share This