There is annotated version of the paper. It also has a link to a GitHub repo in there.
Google “The annotated transformer”. It’s off nlp.seas.harvard.edu.
For those in the field, the progress of model performance feels just as rapid. Language models today are incredibly powerful compared to language models in the early 2010s.
I think the advent of retrieval models (retrieval transformers) will continue this compute trend in a more efficient manner. They allow focusing of the compute onto indexing of the knowledge.
Is it, though? Codex was trained on Python code mostly, and given the cutoff date, it has all the relevant ML bits. And if you ask it about things like transformers, self-attention etc, it's clear that papers on that were a part of its training set, as well. The question is whether it can reason well enough about it to produce meaningful improvements.
FWIW if you ask it how transformers could be improved, it basically gives a recap of various relevant papers that it knows: Longformer, BigBird etc. Amazingly, all the papers cited actually exist and have the specified titles, names, and dates. But then if you ask it for some ideas of its own:
> Can you think of other novel improvement ideas that haven't been investigated yet, to the best of your knowledge? What makes the most sense based on your understanding of the subject matter.
As an AI language model, my knowledge is derived from the text data I have been trained on, which includes research papers and articles published up to September 2021. While it's difficult to guarantee the novelty of any idea, here are a few suggestions for potential improvements that may not have been thoroughly investigated yet:
Context-Adaptive Layers: Instead of using the same set of layers for all input sequences, consider using context-adaptive layers that are dynamically selected or generated based on the input's characteristics. This could allow the model to adapt its processing strategy to different input types, such as formal vs. informal text, or narrative vs. expository text, which might improve its performance on diverse tasks.
Multi-modal Fusion: Develop techniques to better fuse information from multiple modalities, such as text, images, and audio, within the Transformer architecture. Improved multi-modal fusion could lead to enhanced performance on tasks that require understanding and generating content across different modalities, such as visual question answering or image captioning.
Meta-Learning for Architecture Search: Employ meta-learning techniques to search for the best Transformer architecture for a specific task or domain. By learning to optimize the architecture itself, the model might be able to discover more effective configurations or components that have not been considered before.
Transformer Modules for Specific Tasks: Design specialized Transformer modules tailored for specific NLP tasks, such as syntactic parsing, named entity recognition, or sentiment analysis. These task-specific modules could be combined with the generic Transformer architecture to create a more powerful and adaptable model.
Hierarchical Transformers: Explore hierarchical architectures where multiple layers of Transformers work together to process information at different levels of abstraction. This could help the model capture both low-level patterns, such as local dependencies, and high-level patterns, such as long-term dependencies and semantic relationships, more effectively.
Enhanced Tokenization Strategies: Investigate novel tokenization strategies that better capture the structure and meaning of the input text. This could involve using character-level, subword-level, or even syntax-based tokenization to better represent the input for the Transformer model.
Regularization Techniques: Develop novel regularization techniques specifically tailored for the Transformer architecture. This could help improve generalization and prevent overfitting, especially when training large-scale models with billions of parameters.
Most of these papers you list are about the model, and there is the original Transformer paper, and most of the others are some variations of the Transformer.
I think to get into the field, to get a good overview, you should also look a bit beyond the Transformer. E.g. RNNs/LSTMs are still a must learn, even though Transformers might be better in many tasks. And then all those memory-augmented models, e.g. Neural Turing Machine and follow-ups, are important too.
It also helps to know different architectures, such as just language models (GPT), attention-based encoder-decoder (e.g. original Transformer), but then also CTC, hybrid HMM-NN, transducers (RNN-T).
Diffusion models is also another recent different kind of model.
But then, what comes really short in this list, are papers on the training aspect. Most of the papers you list do supervised training, using cross entropy loss. However, there are many others:
You have CLIP in here, specifically to combine text and image modalities.
There is the whole field on unsupervised or self-supervised training methods. Language model training (next label prediction) is one example, but there are others.
And then there is the big field on reinforcement learning, which is probably also quite relevant for AGI.
(DID NOT READ THE ENTIRE PAPER, only the abstract, the definition of the language and some of the experiments)
Note sure how useful this is in the larger context of transformers. Transformers (and deep networks in general) are often used when the logic to be used in solving a problem is largely unknown. Example -- How do you write a RASP program that identifies names in a document?
They do have some simple RASP examples in the paper of things that a transformer model can accomplish (Symbolic Reasoning in Transformers) but, again, this is usually something that the model can do as a result of the task it was originally trained for, not a task in and of itself.
reply