Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

Did you mean to paste a different image? That diagram shows a much older design than the transformer coined in the 2017 paper. It doesn’t include token embeddings, doesn’t show multi-head attention, the part that looks like the attention mechanism doesn’t do self-attention and misses the query/key/value weight matrices, and the part that looks like the fully-connected layer is not done in the way depicted and doesn’t hook into self-attention in that way. Position embeddings and the way the blocks are stacked are also absent.

The FT article at least describes most of those aspects more accurately, although I was disappointed to see they got the attention mask wrong.



sort by: page size:

Ah, I didn’t notice the picture came from Jürgen Schmidhuber. I understand his arguments, and his accomplishments are significant, but his 90s designs were not transformers, and lacked substantial elements that make them so efficient to train. He does have a bit of a reputation claiming that many recent discoveries should be attributed to, or give credit to, his early designs, which, while not completely unfounded, is mostly stretching the truth. Schmidhuber’s 2021 paper is interesting, but describes a different design, which while interesting, is not how the GPT family (or Llama 2, etc.) were trained.

The transformer absolutely uses many things that have been initially suggested in many previous papers, but its specific implementation and combination is what makes it work well. Talking about the query/key/value system, if the fully-connected layer is supposed to be some combination of the key and value weight matrices, the dimensionality is off (the embedding typically has the same vector size as the value (well, the combined size of values for each attention head, but the image doesn’t have attention heads) so that each transformer block has the same input structure), the query weight matrix is missing, and while the dotted lines are not explained in the image, the way the weights are optimized doesn’t seem to match what is shown.


The actual title "Why the Original Transformer Figure Is Wrong, and Some Other Interesting Historical Tidbits About LLMs" is way more representative of what this post is about...

As to the figure being wrong, it's kind of a nit-pick: "While the original transformer figure above (from Attention Is All Your Need, https://arxiv.org/abs/1706.03762) is a helpful summary of the original encoder-decoder architecture, there is a slight discrepancy in this figure.

For instance, it places the layer normalization between the residual blocks, which doesn't match the official (updated) code implementation accompanying the original transformer paper. The variant shown in the Attention Is All Your Need figure is known as Post-LN Transformer."


I have to agree. The article summary says

> Transformer block: Guesses the next word. It is formed by an attention block and a feedforward block.

But the diagram shows transformer blocks chained in sequence. So the next transformer block in the sequence would only receive a single word as the input? Does not make sense.


> Today’s large language models all adopt the Multi-head self-attention structure proposed in Google’s paper “Attention is all you need”. This is what people later call the Transformer structure.

Uhh... From the abstract:

> We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. —Vaswani, et al. 2017


I have the opposite: my most hated illustration.

It's the standard diagram of how Transformer language model works (https://www.researchgate.net/figure/Transformer-Language-Mod...). When I tried to figure out transformers, I saw it in every single paper, and it didn't help almost at all. I think I finally got a good understanding only when I looked at a few implementations.


Nah, the paper explicitly states that their system is not recurrent nor convolutional:

> To the best of our knowledge, however, the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using RNNs or convolution.


On the one hand, this looks really useful.

On the other hand:

> There are various forms of attention / self-attention, Transformer (Vaswani et al., 2017) relies on the scaled dot-product attention: given a query matrix , a key matrix and a value matrix , the output is a weighted sum of the value vectors, where the weight assigned to each value slot is determined by the dot-product of the query with the corresponding key

There HAS to be a better way of communicating this stuff. I'm honestly not even sure where to start decoding and explaining that paragraph.

We really need someone with the explanatory skills of https://jvns.ca/ to start helping people understand this space.


GPT included a picture of the variation of the transformer model that they made.

GPT2 outlined the changes they made to the model in an acceptably moderate detail.

GPT3 references another paper saying "we use alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer" with no detail added on the changes they made.

How are you to reproduce these results at all? You could attempt to include the changes as they references the sparse transformer paper, but you could possibly do it in a different way, and there would be no way to verify the results that they gave whatsoever due to changes in implementation.

A bit disappointing.


They cite the paper where the architecture was introduced. If you go to that paper, you'll see that it mostly consists of a very detailed and careful comparison with Transformer-XL.

In the new paper, they plug their memory system into vanilla BERT. This makes the resulting model essentially nothing like Transformer-XL, which was a strictly decoder-only generative language model.


Unfortunately, they didn't give many details in the paper. It's frustrating, to say the least. Yay reproducibility. They say

> We use the same model and architecture as GPT-2 [RWC+19], including the modified initialization, pre-normalization, and reversible tokenization described therein, with the exception that we use alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer.

In the referenced paper (sparse transformer) they showed a bunch of different sparsity patterns, and I believe they're referring to either their banded block diagonal sparsity or a true banded diagonal pattern (local windows like bigbird and some other papers). Unfortunately, that paper also was light on details and the repo they open sourced alongside it is inscrutible.


> we implemented customized int8 kernels for matrix multiplications and attention

I would be curious how this differs from [1] which is supported in Huggingface’s transformers library.

[1] https://arxiv.org/abs/2208.07339


Yeah I thought the same -- it struck me at first blush as if it was some kind of super simple architecture that didn't use transformers, and then in the diagram i saw they used BERT to produce the embeddings!

They present it as an article about transformers in general, not ones using Flash Attention. Anyway maybe they're presenting per token memory requirement instead of the requirement for the entire sequence at once.

They're using a CNN for the decoder, not a transformer.

Nice. I guess I'm 8 days behind on the SOTA...

But I'd note that it is build on top of a CNN base (ResNet or RetinaNet) and that the Attention-only system performed slightly worse than the one including the CNN layers.

Also, this isn't really a Transformer architecture, even though it uses Attention.

But maybe this is too much nitpicking? I agree that Attention is a useful primitive - my point is that the Transformer architecture is too specific.

(Also, this is a really nice paper in that it lays out the hyperparameters and training schedules they used. And that Appendix is amazing!)


The same transformer diagram from the original paper, replicated everywhere. Nobody got time for redrawing.

BTW, take a look at "sentence transformers" library, a nice interface on top of Hugging Face for this kind of operations (reusing, fine-tuning).

https://www.sbert.net/


Thanks for pointing this out. That was my mistake – my brain must have swapped out "different transformer architectures" with "different model architectures".

I just updated the guide: https://github.com/brexhq/prompt-engineering/commit/3a3ac17a...


> Another think I simply did not get from the original Transformer paper is that the learning of self-attention happens in the linear layers.

You can replace the KVQ kernels with any parametric computation that allows you to pass gradients through, and you will have learning.

I think some newer language model architectures use residual blocks here, and some vision transformers use FeedForward networks for the kernels.


> I have zero AI/ML knowledge

This may make it difficult to explain and I already see many incorrect explanations here and even more lazy ones (why post the first Google result? You're just adding noise)

> Steve Yegge on Medium thinks that the team behind Transformers deserves a Nobel

First, Yegge needs to be able to tell me what Attention and Transformers are. More importantly, he needs to tell me who invented them.

That actually gets to our important point and why there are so many bad answers here and elsewhere. Because you're both missing a lot of context as well as there being murky definitions. This is also what makes it difficult to ELI5. I'll try, then try to give you resources to get an actually good answer.

== Bad Answer (ELI5) ==

A transformer is an algorithm that considers the relationship of all parts of a piece of data. It does this through 4 mechanisms and in two parts. The first part is composed of a normalization block and an attention block. The normalization block scales the data and ensures that the data is not too large. Then the attention mechanism takes all the data handed to it and considers how it is all related to one another. This is called "self-attention" when we only consider one input and it is called "cross-attention" when we have multiple inputs and compare. Both of these create a relationship that are similar to creating a lookup table. The second block is also composed of a normalization block followed by a linear layer. The linear layer reprocesses all the relationships it just learned and gives it context. But we haven't stated the 4th mechanism! This is called a residual layer or "skip" layer. This allows the data to pass right on by each of the above parts without being processed and this little side path is key to getting things to train efficiently.

Now that doesn't really do the work justice or give a good explanation of why or how things actually work. ELI5 isn't a good way to understand things for usage, but it is an okay place to start and learn abstract concepts. For the next level up I suggest Training Compact Transformers[0]. It'll give some illustrations and code to help you follow along. It is focused on vision transformers, but it is all the same. The next level I suggest Karpathy's video on GPT[1], where you will build transformers and he goes in a bit more depth. Both these are good for novices and people with little mathematical background. For more lore and understanding why we got here and the confusion over the definition of attention I suggest Lilian Wang's blog[2] (everything she does is gold). For a lecture and more depth I suggest Pascal Poupart's class. Lecture 19[3] is the one on attention and transformers but you need to at minimum watch Lecture 18 but if you actually have no ML experience or knowledge then you should probably start from the beginning.

The truth is that not everything can be explained in simple terms, at least not if one wants an adequate understanding. That misquotation of Einstein (probably originating from Nelson) is far from accurate and I wouldn't expect someone that introduced a highly abstract concept with complex mathematics (to such a degree that physicists argued he was a mathematician) would say something so silly. There is a lot lost when distilling a concept and neither the listener nor speaker should fool themselves into believing this makes them knowledgeable (armchair expertise is a frustrating point on the internet and has gotten our society in a lot of trouble).

[0] https://medium.com/pytorch/training-compact-transformers-fro...

[1] https://www.youtube.com/watch?v=kCc8FmEb1nY

[2] https://lilianweng.github.io/posts/2018-06-24-attention/

[3] https://www.youtube.com/watch?v=OyFJWRnt_AY

next

Legal | privacy