Behind the Magic: How Tensors Drive Transformers

By AIMaharshiBhrugu Bhatt

April 26, 2025

0

49

Transformers have modified the way in which synthetic intelligence works, particularly in understanding language and studying from information. On the core of those fashions are tensors (a generalized sort of mathematical matrices that assist course of data) . As information strikes via the completely different elements of a Transformer, these tensors are topic to completely different transformations that assist the mannequin make sense of issues like sentences or pictures. Studying how tensors work inside Transformers might help you perceive how at present’s smartest AI programs truly work and suppose.

What This Article Covers and What It Doesn’t

✅ This Article IS About:

The movement of tensors from enter to output inside a Transformer mannequin.
Guaranteeing dimensional coherence all through the computational course of.
The step-by-step transformations that tensors endure in numerous Transformer layers.

❌ This Article IS NOT About:

A common introduction to Transformers or deep studying.
Detailed structure of Transformer fashions.
Coaching course of or hyper-parameter tuning of Transformers.

How Tensors Act Inside Transformers

A Transformer consists of two essential parts:

Encoder: Processes enter information, capturing contextual relationships to create significant representations.
Decoder: Makes use of these representations to generate coherent output, predicting every component sequentially.

Tensors are the elemental information buildings that undergo these parts, experiencing a number of transformations that guarantee dimensional coherence and correct data movement.

Picture From Analysis Paper: Transformer normal archictecture

Enter Embedding Layer

Earlier than coming into the Transformer, uncooked enter tokens (phrases, subwords, or characters) are transformed into dense vector representations via the embedding layer. This layer capabilities as a lookup desk that maps every token vector, capturing semantic relationships with different phrases.

Picture by writer: Tensors passing via Embedding layer

For a batch of 5 sentences, every with a sequence size of 12 tokens, and an embedding dimension of 768, the tensor form is:

Tensor form: [batch_size, seq_len, embedding_dim] → [5, 12, 768]

After embedding, positional encoding is added, guaranteeing that order data is preserved with out altering the tensor form.

Modified Picture from Analysis Paper: Scenario of the workflow

Multi-Head Consideration Mechanism

One of the crucial vital parts of the Transformer is the Multi-Head Consideration (MHA) mechanism. It operates on three matrices derived from enter embeddings:

Question (Q)
Key (Ok)
Worth (V)

These matrices are generated utilizing learnable weight matrices:

Wq, Wk, Wv of form [embedding_dim, d_model] (e.g., [768, 512]).
The ensuing Q, Ok, V matrices have dimensions
[batch_size, seq_len, d_model].

Picture by writer: Desk displaying the shapes/dimensions of Embedding, Q, Ok, V tensors

Splitting Q, Ok, V into A number of Heads

For efficient parallelization and improved studying, MHA splits Q, Ok, and V into a number of heads. Suppose we now have 8 consideration heads:

Every head operates on a subspace of d_model / head_count.

Picture by writer: Multihead Consideration

The reshaped tensor dimensions are [batch_size, seq_len, head_count, d_model / head_count].
Instance: [5, 12, 8, 64] → rearranged to [5, 8, 12, 64] to make sure that every head receives a separate sequence slice.

Picture by writer: Reshaping the tensors

So every head will get the its share of Qi, Ki, Vi

Picture by writer: Every Qi,Ki,Vi despatched to completely different head

Consideration Calculation

Every head computes consideration utilizing the system:

As soon as consideration is computed for all heads, the outputs are concatenated and handed via a linear transformation, restoring the preliminary tensor form.

Picture by writer: Concatenating the output of all heads

Residual Connection and Normalization

After the multi-head consideration mechanism, a residual connection is added, adopted by layer normalization:

Residual connection: Output = Embedding Tensor + Multi-Head Consideration Output
Normalization: (Output − μ) / σ to stabilize coaching
Tensor form stays [batch_size, seq_len, embedding_dim]

Feed-Ahead Community (FFN)

Within the decoder, Masked Multi-Head Consideration ensures that every token attends solely to earlier tokens, stopping leakage of future data.

Modified Picture From Analysis Paper: Masked Multi Head Consideration

That is achieved utilizing a decrease triangular masks of form [seq_len, seq_len] with -inf values within the higher triangle. Making use of this masks ensures that the Softmax perform nullifies future positions.

Cross-Consideration in Decoding

For the reason that decoder doesn’t absolutely perceive the enter sentence, it makes use of cross-attention to refine predictions. Right here:

The decoder generates queries (Qd) from its enter ([batch_size, target_seq_len, embedding_dim]).
The encoder output serves as keys (Ke) and values (Ve).
The decoder computes consideration between Qd and Ke, extracting related context from the encoder’s output.

Modified Picture From Analysis Paper: Cross Head Consideration

Conclusion

Transformers use tensors to assist them study and make sensible choices. As the info strikes via the community, these tensors undergo completely different steps—like being was numbers the mannequin can perceive (embedding), specializing in essential elements (consideration), staying balanced (normalization), and being handed via layers that study patterns (feed-forward). These modifications hold the info in the precise form the entire time. By understanding how tensors transfer and alter, we are able to get a greater concept of how AI fashions work and the way they will perceive and create human-like language.

Behind the Magic: How Tensors Drive Transformers

What This Article Covers and What It Doesn’t

How Tensors Act Inside Transformers

Enter Embedding Layer

Multi-Head Consideration Mechanism

Splitting Q, Ok, V into A number of Heads

Consideration Calculation

Residual Connection and Normalization

Feed-Ahead Community (FFN)

Cross-Consideration in Decoding

Conclusion

Related Articles

North Korean Hackers Deploy 197 npm Packages to Unfold Up to date OtterCookie Malware

Bodily Intelligence raises $600M to advance robotic basis fashions

Metric Deception: When Your Greatest KPIs Disguise Your Worst Failures

LEAVE A REPLY Cancel reply

Latest Articles

North Korean Hackers Deploy 197 npm Packages to Unfold Up to date OtterCookie Malware

Bodily Intelligence raises $600M to advance robotic basis fashions

Metric Deception: When Your Greatest KPIs Disguise Your Worst Failures

♉ Taurus Monster Able to Cost – SoliDRawinGs SG1648・ STL File for 3D printing・Cults

Tomiris Hacker Group Unveils New Instruments and Strategies for World Assaults

About US