Understanding Encoders-Decoders with an Attention-based mechanism

Written by sfmantra



November 14, 2023

Discover the power of Transformer positional embeddings and take your website to new heights! Learn how to optimize your content with this groundbreaking technology. Join us on a transformational journey and unlock the full potential of your website. Leave your competitors in awe with cutting-edge techniques and stay ahead in the digital landscape. Don’t miss out on this opportunity to revolutionize your online presence. Start your transformation today!

Word ordering often determines the meaning of a sentence. How to utilize the position information of a word sequence is solved by positional embeddings. This blog will be on how positional embeddings is selected, why is it needed and the implementation to generate it.

This is part of a series of blogs I am writing to understand Transformer architecture from the Attention is all you need paper [1]. I will update this section with the new blogs.

Demystifying the Attention Logic of Transformers: Unraveling the Intuition and Implementation by visualization
Transformer positional embeddings (This blog)
Attention is All You Need: Understanding Transformer Decoder and putting all pieces together.

First, we create feature vectors from the input sentence by using word2vec (or another tokenizer). Let’s assume we are using a word2vec which converts a word to a 512-dimensional vector.

Create vectorize representation of the words. This example we are using word2vec, but you can try any one of the tokenizers.

But we also need to add some information to let the model know the position of the word. This is called positional encodings. With the help of positional encodings, the transformer cannot process part of sentences partially without considering about the sequential nature of a sentence.

Now, one easy way can be is just to assign 1 to the first word, 2 to second, and so on. But in this approach, the model during inference might get a sentence that is longer than any it saw during training. Also, for a longer sentence, there will be large values to add which takes more memory.

We can take a range then like add 0 for first work and 1 for last, anything in between we split the range [0,1] and get the values. For example, for a 3-word sentence we can do 0 for the first word, 0.5 for the second, and 1 for the third; for a 4-word sentence, it would be 0,0.33, 0.66, 1 respectively. The problem with this is that the position difference delta is not constant. In the first example, it was 0.5 but in the second case, it was 0.33.

Sinusoidal Positional Encodings

In Attention is all you need paper [1]; the authors have used sine and cosine functions (sinusoidal functions) to generate the positional encodings. The functions are defined as below.

Positional encoding used in the paper [1]

pos: pos is the position of the word in the text. In the example above, when generating the positional encodings, I will have pos=0 , should will have pos=1 , sleep will have pos=2 and so on. The d_model parameter is the input vector dimension (dimension of the output of the tokenizer). For our example it is 512. i is the index of the positional encodings.

Let’s understand this with a simple example. Let’s assume the input vector dimension is 4 (in our old example it is 512). So, for any position of a text (pos) we need to generate 4 values, out of those 4 values, the even ones we generate with the first equation, and the odd ones we generate with the second equation.

Positional encodings computation. In the image I have shown the input vector size as 4 so we have 4 values for each word. In the original example since the input vector dimension is 512, positional encodings length should also be 512.

import numpy as np
import matplotlib.pyplot as plt
def get_sinusoidal_embedding(position, d_model):
    """
    position: Position/index of the word in the text.
    d_model: input vector dimension.
    """
    embedding = np.zeros(d_model)
    for i in range(d_model):
        if i % 2 == 0: # even indices
            embedding[i] = np.sin(position / (10000 ** (i / d_model))) # assume i=2i from the equations
        else: # odd indices
            embedding[i] = np.cos(position / (10000 ** ((i - 1) / d_model)))
    return embedding

Note: The above code is not generic or optimal, it’s just to understand the concept.

Now, another question that should come to mind is how this makes sure that the position difference delta is linearly scaling. We can prove that as below.

Simplifying Automation with Schedule Trigger Flows in Salesforce

Jan 27, 2025

Simplifying Automation with Schedule Trigger Flows in Salesforce Salesforce is packed with tools to automate tasks and processes, and one of the most powerful is the Schedule Trigger Flow. This feature allows you to automate actions based on a schedule, such as...

Understanding Salesforce User License

Jan 27, 2025

Understanding Salesforce User License Salesforce is a powerful and versatile platform, but before you start using it, you need to understand its licensing structure. User licenses determine the level of access, functionality, and customization options available to...

Introduction To Batch Apex In Salesforce

Jan 27, 2025

Introduction To Batch Apex In Salesforce When working with Salesforce, handling large datasets or performing resource-intensive operations can be challenging due to governor limits. This is where Batch Apex comes to the rescue. Batch Apex allows developers to...

Understanding Encoders-Decoders with an Attention-based mechanism

Written by sfmantra

November 14, 2023

Sinusoidal Positional Encodings

Related Articles

Simplifying Automation with Schedule Trigger Flows in Salesforce

Understanding Salesforce User License

Introduction To Batch Apex In Salesforce

Recent Posts

Archives