Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

Under their language modeling section, they include the Transformer and have listed some papers from 2019, so it seems to be reasonably up to date.


sort by: page size:

Worringly, I am it sure the people working on this really understand what a Transformer is

Quote from them:

“ There is still active research in non-transformer based language models though, such as Amazon’s AlexaTM 20B which outperforms GPT-3“

Quote from said paper

“ For AlexaTM 20B, we used the standard Transformer model architecture“

(Its just an encoder decoder transformer)


There is annotated version of the paper. It also has a link to a GitHub repo in there. Google “The annotated transformer”. It’s off nlp.seas.harvard.edu.

Transformers have been built for translation since the very first iteration. 2017 paper.

"Transformers for Natural Language Processing" by Denis Rothman on Packt is a good one.

Make sure to get the 2nd edition for the coverage of newer models, particularly the ones from Hugging Face.

Everything I've read from Denis is really good. He's great at explaining complex topics in ways that are straightforward and digestible.


For those in the field, the progress of model performance feels just as rapid. Language models today are incredibly powerful compared to language models in the early 2010s.

I think the advent of retrieval models (retrieval transformers) will continue this compute trend in a more efficient manner. They allow focusing of the compute onto indexing of the knowledge.


The transformer model was introduced in the 2017 paper “Attention is all you need”, which applied it to machine translation of human language.

Paper:

CTRL: A Conditional Transformer Language Mode for Controllable Generation (Largest Publicly Available Language Model)

https://einstein.ai/presentations/ctrl.pdf

Link to model release:

https://www.github.com/salesforce/ctrl


Is it, though? Codex was trained on Python code mostly, and given the cutoff date, it has all the relevant ML bits. And if you ask it about things like transformers, self-attention etc, it's clear that papers on that were a part of its training set, as well. The question is whether it can reason well enough about it to produce meaningful improvements.

FWIW if you ask it how transformers could be improved, it basically gives a recap of various relevant papers that it knows: Longformer, BigBird etc. Amazingly, all the papers cited actually exist and have the specified titles, names, and dates. But then if you ask it for some ideas of its own:

> Can you think of other novel improvement ideas that haven't been investigated yet, to the best of your knowledge? What makes the most sense based on your understanding of the subject matter.

As an AI language model, my knowledge is derived from the text data I have been trained on, which includes research papers and articles published up to September 2021. While it's difficult to guarantee the novelty of any idea, here are a few suggestions for potential improvements that may not have been thoroughly investigated yet:

Context-Adaptive Layers: Instead of using the same set of layers for all input sequences, consider using context-adaptive layers that are dynamically selected or generated based on the input's characteristics. This could allow the model to adapt its processing strategy to different input types, such as formal vs. informal text, or narrative vs. expository text, which might improve its performance on diverse tasks.

Multi-modal Fusion: Develop techniques to better fuse information from multiple modalities, such as text, images, and audio, within the Transformer architecture. Improved multi-modal fusion could lead to enhanced performance on tasks that require understanding and generating content across different modalities, such as visual question answering or image captioning.

Meta-Learning for Architecture Search: Employ meta-learning techniques to search for the best Transformer architecture for a specific task or domain. By learning to optimize the architecture itself, the model might be able to discover more effective configurations or components that have not been considered before.

Transformer Modules for Specific Tasks: Design specialized Transformer modules tailored for specific NLP tasks, such as syntactic parsing, named entity recognition, or sentiment analysis. These task-specific modules could be combined with the generic Transformer architecture to create a more powerful and adaptable model.

Hierarchical Transformers: Explore hierarchical architectures where multiple layers of Transformers work together to process information at different levels of abstraction. This could help the model capture both low-level patterns, such as local dependencies, and high-level patterns, such as long-term dependencies and semantic relationships, more effectively.

Enhanced Tokenization Strategies: Investigate novel tokenization strategies that better capture the structure and meaning of the input text. This could involve using character-level, subword-level, or even syntax-based tokenization to better represent the input for the Transformer model.

Regularization Techniques: Develop novel regularization techniques specifically tailored for the Transformer architecture. This could help improve generalization and prevent overfitting, especially when training large-scale models with billions of parameters.


Most of these papers you list are about the model, and there is the original Transformer paper, and most of the others are some variations of the Transformer.

I think to get into the field, to get a good overview, you should also look a bit beyond the Transformer. E.g. RNNs/LSTMs are still a must learn, even though Transformers might be better in many tasks. And then all those memory-augmented models, e.g. Neural Turing Machine and follow-ups, are important too.

It also helps to know different architectures, such as just language models (GPT), attention-based encoder-decoder (e.g. original Transformer), but then also CTC, hybrid HMM-NN, transducers (RNN-T).

Diffusion models is also another recent different kind of model.

But then, what comes really short in this list, are papers on the training aspect. Most of the papers you list do supervised training, using cross entropy loss. However, there are many others:

You have CLIP in here, specifically to combine text and image modalities.

There is the whole field on unsupervised or self-supervised training methods. Language model training (next label prediction) is one example, but there are others.

And then there is the big field on reinforcement learning, which is probably also quite relevant for AGI.


For those who are interested in how the Transformer is taking over, please read this thread by Karpathy where he talks about a consolidation in ML:

https://twitter.com/karpathy/status/1468370605229547522

And of course one of the early classic papers in the field, as a bonus:

https://papers.nips.cc/paper/2017/file/3f5ee243547dee91fbd05...

(The paper is mentioned in the article)


The ULMFiT, ELMO, and OpenAI Transformer papers are all quite readable and linked from the article. Sebastian and I also wrote an introduction to ULMFiT here: http://nlp.fast.ai/classification/2018/05/15/introducting-ul...

In their Transformer section they have implementations of:

   - kNN-LM: Generalization through Memorization
   - Feedback Transformer
   - Switch Transformer
Which are all from recent, highly interesting papers

Is it weird they mentioned these examples and not, OpenAI, Anthropic, Gemini etc.?

> Transformer-based language models — which include BERT, RoBERTa, and IBM’s Slate and Granite family of models

Why would they not mention the most popular transformer based language models?


(DID NOT READ THE ENTIRE PAPER, only the abstract, the definition of the language and some of the experiments)

Note sure how useful this is in the larger context of transformers. Transformers (and deep networks in general) are often used when the logic to be used in solving a problem is largely unknown. Example -- How do you write a RASP program that identifies names in a document?

They do have some simple RASP examples in the paper of things that a transformer model can accomplish (Symbolic Reasoning in Transformers) but, again, this is usually something that the model can do as a result of the task it was originally trained for, not a task in and of itself.


A few more interesting papers not mentioned in the article:

"Faith and Fate: Limits of Transformers on Compositionality"

https://arxiv.org/abs/2305.18654

"Comparing Humans, GPT-4, and GPT-4V On Abstraction and Reasoning Tasks":

https://arxiv.org/abs/2311.09247

"Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve"

https://arxiv.org/abs/2309.13638

"Pretraining Data Mixtures Enable Narrow Model Selection Capabilities in Transformer Models"

https://arxiv.org/abs/2311.00871


Likewise, Alexander Rush (http://nlp.seas.harvard.edu/rush.html) at HarvardNLP provides an excellent web page, "The Annotated Transformer" (http://nlp.seas.harvard.edu/2018/04/03/attention.html), which basically provides a line by line discussion of the code and how it relates to the Transformer model.

* Code for The Annotated Transformer blog post (GitHub): https://github.com/harvardnlp/annotated-transformer

Rush also presents a workshop paper on this model (http://aclweb.org/anthology/W18-2509).

Of course all of that is in reference to the original Google Brain/Research paper, "Attention Is All You Need"

* arXiv landing page: https://arxiv.org/abs/1706.03762

* PDF: https://arxiv.org/pdf/1706.03762.pdf


"The idea is nearly 30 years old and has been used for large language models before, such as Google's Switch Transformer."

Innovation! :)


In 2022, most people using NLP use transformers from huggingface. The tokenizer used is written in Rust and used transparently from Python.

Also, transformer-based language models were already being used by Spacy, right?
next

Legal | privacy