Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

For those who are interested in how the Transformer is taking over, please read this thread by Karpathy where he talks about a consolidation in ML:

https://twitter.com/karpathy/status/1468370605229547522

And of course one of the early classic papers in the field, as a bonus:

https://papers.nips.cc/paper/2017/file/3f5ee243547dee91fbd05...

(The paper is mentioned in the article)



sort by: page size:


> There's transparent evidence that transformer-based LLMs existed before ChatGPT or Llama.

Transformers were first published about in "Attention is all you need" in 2017 by google https://arxiv.org/abs/1706.03762

> There was originally GPT-2, which was not a public release but did get reverse engineered into GPT-Neo and GPT-J (among others).

GPT 2 had a proper release in 2019 with the code, structure, checkpoints, etc. - https://openai.com/research/gpt-2-1-5b-release

> Besides that, there was BERT and T5 from Google, Longformers, XLNet, and dozens of other LLMs that predate the modern ones.

Yes all the big players had various research models. We simply crossed the line into large-scale usefulness just recently (outside of classification/summerization), and like you said OpenAI was the first to really make it accessible to the public.


Fascinating work, very promising.

Can you summarise how the model in your paper differs from this one ?

https://github.com/huggingface/transformers/issues/27011


> Today’s large language models all adopt the Multi-head self-attention structure proposed in Google’s paper “Attention is all you need”. This is what people later call the Transformer structure.

Uhh... From the abstract:

> We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. —Vaswani, et al. 2017


>On a side note it feels like each command takes longer to process than the previous - almost like it is re-doing everything for each command (and that is how it keeps state).

That's because it's probably redoing everything. But that's probably to keep the implementation simple. They are probably just appending the new input and re-running the whole network.

The typical data dependency structure in a transformer architecture is the following :

outputt0 outputt1 outputt2 outputt3 | outputt4

featL4t0 featL4t1 featL4t2 featL4t3 | featL4t4

featL3t0 featL3t1 featL3t2 featL3t3 | featL3t4

featL2t0 featL2t1 featL2t2 featL2t3 | featL2t4

featL1t0 featL1t1 featL1t2 featL1t3 | featL1t4

input_t0 input_t1 input_t2 input_t3 | input_t4

The features at layer Li at time tj only depends on the features of the layer L(i-1) at times t<=tj.

If you append some new input at the next time t4 and recompute everything from scratch it doesn't change any feature values for time < t4.

To compute the features and output at time t4 you need all the values of the previous times for all layers.

The alternative to recomputing would be preserving the previously generated features, and incrementally building the last chunk by stitching it to the previous features. If you have your AI assistant running locally that something you can do, but when you are serving plenty of different sessions, you will quickly run out of memory.

With simple transformers, the time horizon of the transformer used to be limited because the attention of the transformer was scaling quadratically (in compute), but they are probably using an attention that scale in O(n*log(n)) something like the Reformer, which allows them to handle very long sequence for cheap, and probably explain the boost in performance compared to previous GPTs.


This is a great article. The author breaks down key advancements in AI--from Nvidia's release of CUDA in 2006, deep learning using CNNs in 2012, followed by recent language translation models and their shortcomings.

The coverage of transformers is particularly well-done, including their novel approach to language translation and their generality.

Quanta Magazine has covered transformers as well [1,2].

[1] Will Transformers Take Over Artificial Intelligence? - https://www.quantamagazine.org/will-transformers-take-over-a...

[2] How Transformers Seem to Mimic Parts of the Brain - https://www.quantamagazine.org/how-ai-transformers-mimic-par...


A few more interesting papers not mentioned in the article:

"Faith and Fate: Limits of Transformers on Compositionality"

https://arxiv.org/abs/2305.18654

"Comparing Humans, GPT-4, and GPT-4V On Abstraction and Reasoning Tasks":

https://arxiv.org/abs/2311.09247

"Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve"

https://arxiv.org/abs/2309.13638

"Pretraining Data Mixtures Enable Narrow Model Selection Capabilities in Transformer Models"

https://arxiv.org/abs/2311.00871


>> But Transformers have one core problem. In a transformer, every token can look back at every previous token when making predictions.

> Lately I've been wondering... is this a problem, or a strength?

Exactly. There are lot of use cases where perfect recall is important. And earlier data may be more or less incompressible, such as if an LLM is working on a large table of data.

Maybe we'll end up with different architectures being used for different applications. E.g. simple chat may be OK with an RNN type architecture.

I've also seen people combine Mamba and Transformer layers. Maybe that's a good tradeoff for some other applications.


I'm not sure that there's anything truly new in this article. Transformer-based models are very compute- and parameter-heavy, yes - but that's because they're optimized for generality and easily parallelizable training, even at a very large scale. The ML community has always been aware of this as an unaddressed issue. Once compute cost for both training and inference becomes a relevant metric, there's lots of things you can do as far as model architecture goes to make smaller and leaner varieties more applicable.

I am a researcher on the AI/Systems side and I wanted to chime in. Transformers are amazing for language, and have broken all the SOTA is many areas (at the start of the year, some people may have wondered if CNNs are dead [they are not as I see it]). The issue with Transformer models is the insane amount of data they need. There is some amazing progress on using unsupervised methods to help, but that just saves you on data costs. You still need an insane about of GPU horsepower to train these things. I think this will be a bottleneck to progress. The average university researcher (unless from tier 1 school with large funding/donors) are going to pretty much get locked out. That basically leaves the 5-6 key corporate labs to take things forward on the transformer front.

RL, which I think this particular story is about, is an odd-duck. I have papers on this and I personally have mixed feelings. I am a very applications/solutions-oriented researcher and I am a bit skeptical about how pragmatic the state of the field is (e.g. reward function specification). The argument made by the OpenAI founder on RL not being amenable to taking advantage of large datasets is a pretty valid point.

Finally, you raise interesting points on running multiple complex DNNs. Have you tried hooking things to ROS and using that as a scaffolding (I'm not a robotics guy .. just dabble in that as a hobby so curious what the solutions are). Google has something called MediaPipe, which is intriguing but maybe not what you need. I've seen some NVIDIA frameworks but they basically do pub-sub in a sub-optimal way. Curious what your thoughts are on what makes existing solutions insufficient (I feel they are too!)


I've been reading this paper with pseudocode for various transformers and finding it helfpul: https://arxiv.org/abs/2207.09238

"This document aims to be a self-contained, mathematically precise overview of transformer architectures and algorithms (not results). It covers what transformers are, how they are trained, what they are used for, their key architectural components, and a preview of the most prominent models."


TIL LSTMs are out, Transformers are in.

Here are a few resources that might help get you started:

1. A thread explaining the internal working of transformers: https://twitter.com/hippopedoid/status/1641432291149848576?s...

2. Paper by DeepMind which provides pseudo-code for important algorithms for Transformer models: https://arxiv.org/pdf/2207.09238.pdf

3. Another thread specifically on large language models: https://twitter.com/cwolferesearch/status/164044611134855577...

Once again these are not courses per se, but do provide intuitive explanations for how transformers work. There is also the nanoGPT series of videos by Karpathy on youtube. First video here: https://www.youtube.com/watch?v=kCc8FmEb1nY


The author has more experience in the field than me, so gotta defer to him for the most part, and while I generally agree with the post, I disagree strongly with one point. The author frames this new era of AI about using transformer models (and diffusion models) but transformer models have been around for a while and have been useful from before GPT3, the model the author claims as the starting point for this new AI. BERT is a transformer model that came out in 2018 and is a very useful transformer style model, which showed the promise of transformers before GPT3.

Edit: Going back through the post, the author’s slide has Transformers labeled as 2017, so he is aware of the history and he’s just emphasizing that GPT3 was the first transformer model that he thinks had something interesting and related to the current AI explosion. I think BERT style models would be worth a mention in the post as the first transformer models found to be widely useful.


(DID NOT READ THE ENTIRE PAPER, only the abstract, the definition of the language and some of the experiments)

Note sure how useful this is in the larger context of transformers. Transformers (and deep networks in general) are often used when the logic to be used in solving a problem is largely unknown. Example -- How do you write a RASP program that identifies names in a document?

They do have some simple RASP examples in the paper of things that a transformer model can accomplish (Symbolic Reasoning in Transformers) but, again, this is usually something that the model can do as a result of the task it was originally trained for, not a task in and of itself.


Well, the transformer was invented at Google, language models were decades old. But scaling it was not done before, and preparing the dataset at this size, and babysitting the model so its loss doesn't explode during training - all were innovations that required lots of work and money to be achieved, so we can just copy the same formula without redoing all the steps.

It seems like this field is at about the same stage of progress as image recognition was in the 90s when researchers were trying to getting a handle on MNIST-type tasks.

I wonder how much the language embeddings learned by the transformer are reflected in the actual physical structure of the brain? Could it be that the transformer is making the same sort of representations as those in the brain, or is it learning entirely new representations? My guess is that it's doing something quite different from what the brain is doing, although I wouldn't rule out some sort of convergence. Either way, this is a fascinating branch of research both for AI and the cognitive sciences.


This is not the consensus among ML researchers. Transformers are showing strong generalisation[1] and their performance continues to surprise us as they scale[2].

The Socratic paper is not about “higher intelligence”, it’s about demonstrating useful behaviour purely by connecting several large models via language.

[1] https://arxiv.org/abs/2201.02177

[2] https://arxiv.org/abs/2204.02311


This! The best resource I've found to explain transformers, that made them clear to me. I wish all deep learning papers were written like this, using pseudocode.
next

Legal | privacy