Hacker Read

carbocation · 2019-02-03 22:30:22+00:00

Under their language modeling section, they include the Transformer and have listed some papers from 2019, so it seems to be reasonably up to date.

zwaps | karma 5084 | avg karma 5.04 · | 2023-05-14 23:16:43

Worringly, I am it sure the people working on this really understand what a Transformer is

Quote from them:

“ There is still active research in non-transformer based language models though, such as Amazon’s AlexaTM 20B which outperforms GPT-3“

Quote from said paper

“ For AlexaTM 20B, we used the standard Transformer model architecture“

(Its just an encoder decoder transformer)

reply

nborwankar | karma 498 | avg karma 3.83 · | 2023-05-17 21:03:53

There is annotated version of the paper. It also has a link to a GitHub repo in there. Google “The annotated transformer”. It’s off nlp.seas.harvard.edu.

_giorgio_ | karma 110 | avg karma 0.5 · | 2024-04-23 13:03:23

Transformers have been built for translation since the very first iteration. 2017 paper.

benzesandbetter | karma 111 | avg karma 1.12 · | 2023-03-15 14:35:54

"Transformers for Natural Language Processing" by Denis Rothman on Packt is a good one.

Make sure to get the 2nd edition for the coverage of newer models, particularly the ones from Hugging Face.

Everything I've read from Denis is really good. He's great at explaining complex topics in ways that are straightforward and digestible.

reply

axg11 | karma 3459 | avg karma 6.3 · | 2022-02-15 13:33:05

For those in the field, the progress of model performance feels just as rapid. Language models today are incredibly powerful compared to language models in the early 2010s.

I think the advent of retrieval models (retrieval transformers) will continue this compute trend in a more efficient manner. They allow focusing of the compute onto indexing of the knowledge.

reply

shwaj | karma 228 | avg karma 1.97 · | 2023-08-16 12:12:56

The transformer model was introduced in the 2017 paper “Attention is all you need”, which applied it to machine translation of human language.

baylearn | karma 4109 | avg karma 8.45 · | 2019-09-13 11:02:54+00:00

Paper:

CTRL: A Conditional Transformer Language Mode for Controllable Generation (Largest Publicly Available Language Model)

https://einstein.ai/presentations/ctrl.pdf

Link to model release:

https://www.github.com/salesforce/ctrl

reply

int_19h | karma 21203 | avg karma 1.69 · | 2023-04-13 20:01:35

Is it, though? Codex was trained on Python code mostly, and given the cutoff date, it has all the relevant ML bits. And if you ask it about things like transformers, self-attention etc, it's clear that papers on that were a part of its training set, as well. The question is whether it can reason well enough about it to produce meaningful improvements.

FWIW if you ask it how transformers could be improved, it basically gives a recap of various relevant papers that it knows: Longformer, BigBird etc. Amazingly, all the papers cited actually exist and have the specified titles, names, and dates. But then if you ask it for some ideas of its own:

> Can you think of other novel improvement ideas that haven't been investigated yet, to the best of your knowledge? What makes the most sense based on your understanding of the subject matter.

As an AI language model, my knowledge is derived from the text data I have been trained on, which includes research papers and articles published up to September 2021. While it's difficult to guarantee the novelty of any idea, here are a few suggestions for potential improvements that may not have been thoroughly investigated yet:

Context-Adaptive Layers: Instead of using the same set of layers for all input sequences, consider using context-adaptive layers that are dynamically selected or generated based on the input's characteristics. This could allow the model to adapt its processing strategy to different input types, such as formal vs. informal text, or narrative vs. expository text, which might improve its performance on diverse tasks.

Multi-modal Fusion: Develop techniques to better fuse information from multiple modalities, such as text, images, and audio, within the Transformer architecture. Improved multi-modal fusion could lead to enhanced performance on tasks that require understanding and generating content across different modalities, such as visual question answering or image captioning.

Meta-Learning for Architecture Search: Employ meta-learning techniques to search for the best Transformer architecture for a specific task or domain. By learning to optimize the architecture itself, the model might be able to discover more effective configurations or components that have not been considered before.

Transformer Modules for Specific Tasks: Design specialized Transformer modules tailored for specific NLP tasks, such as syntactic parsing, named entity recognition, or sentiment analysis. These task-specific modules could be combined with the generic Transformer architecture to create a more powerful and adaptable model.

Hierarchical Transformers: Explore hierarchical architectures where multiple layers of Transformers work together to process information at different levels of abstraction. This could help the model capture both low-level patterns, such as local dependencies, and high-level patterns, such as long-term dependencies and semantic relationships, more effectively.

Enhanced Tokenization Strategies: Investigate novel tokenization strategies that better capture the structure and meaning of the input text. This could involve using character-level, subword-level, or even syntax-based tokenization to better represent the input for the Transformer model.

Regularization Techniques: Develop novel regularization techniques specifically tailored for the Transformer architecture. This could help improve generalization and prevent overfitting, especially when training large-scale models with billions of parameters.

reply

albertzeyer | karma 10334 | avg karma 5.8 · | 2023-02-03 06:34:18

Most of these papers you list are about the model, and there is the original Transformer paper, and most of the others are some variations of the Transformer.

I think to get into the field, to get a good overview, you should also look a bit beyond the Transformer. E.g. RNNs/LSTMs are still a must learn, even though Transformers might be better in many tasks. And then all those memory-augmented models, e.g. Neural Turing Machine and follow-ups, are important too.

It also helps to know different architectures, such as just language models (GPT), attention-based encoder-decoder (e.g. original Transformer), but then also CTC, hybrid HMM-NN, transducers (RNN-T).

Diffusion models is also another recent different kind of model.

But then, what comes really short in this list, are papers on the training aspect. Most of the papers you list do supervised training, using cross entropy loss. However, there are many others:

You have CLIP in here, specifically to combine text and image modalities.

There is the whole field on unsupervised or self-supervised training methods. Language model training (next label prediction) is one example, but there are others.

And then there is the big field on reinforcement learning, which is probably also quite relevant for AGI.

reply

omarhaneef | karma 3565 | avg karma 4.8 · | 2022-03-10 13:15:31

For those who are interested in how the Transformer is taking over, please read this thread by Karpathy where he talks about a consolidation in ML:

https://twitter.com/karpathy/status/1468370605229547522

And of course one of the early classic papers in the field, as a bonus:

https://papers.nips.cc/paper/2017/file/3f5ee243547dee91fbd05...

(The paper is mentioned in the article)

reply

jph00 | karma 7468 | avg karma 8.69 · | 2018-07-09 20:16:46+00:00

The ULMFiT, ELMO, and OpenAI Transformer papers are all quite readable and linked from the article. Sebastian and I also wrote an introduction to ULMFiT here: http://nlp.fast.ai/classification/2018/05/15/introducting-ul...

timohear | karma 12 | avg karma 1.71 · | 2021-01-30 11:48:13

In their Transformer section they have implementations of:

   - kNN-LM: Generalization through Memorization
   - Feedback Transformer
   - Switch Transformer

Which are all from recent, highly interesting papers

kingkongjaffa | karma 1255 | avg karma 1.88 · | 2024-03-13 20:19:08

Is it weird they mentioned these examples and not, OpenAI, Anthropic, Gemini etc.?

> Transformer-based language models — which include BERT, RoBERTa, and IBM’s Slate and Granite family of models

Why would they not mention the most popular transformer based language models?

reply

xmaayy | karma 135 | avg karma 3.38 · | 2021-06-16 15:14:10+00:00

(DID NOT READ THE ENTIRE PAPER, only the abstract, the definition of the language and some of the experiments)

Note sure how useful this is in the larger context of transformers. Transformers (and deep networks in general) are often used when the logic to be used in solving a problem is largely unknown. Example -- How do you write a RASP program that identifies names in a document?

They do have some simple RASP examples in the paper of things that a transformer model can accomplish (Symbolic Reasoning in Transformers) but, again, this is usually something that the model can do as a result of the task it was originally trained for, not a task in and of itself.

reply

lossolo | karma 2553 | avg karma 2.06 · | 2023-12-27 11:32:22

A few more interesting papers not mentioned in the article:

"Faith and Fate: Limits of Transformers on Compositionality"

https://arxiv.org/abs/2305.18654

"Comparing Humans, GPT-4, and GPT-4V On Abstraction and Reasoning Tasks":

https://arxiv.org/abs/2311.09247

"Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve"

https://arxiv.org/abs/2309.13638

"Pretraining Data Mixtures Enable Narrow Model Selection Capabilities in Transformer Models"

https://arxiv.org/abs/2311.00871

reply

vstuart | karma 80 | avg karma 1.14 · | 2018-10-18 15:57:23+00:00

Likewise, Alexander Rush (http://nlp.seas.harvard.edu/rush.html) at HarvardNLP provides an excellent web page, "The Annotated Transformer" (http://nlp.seas.harvard.edu/2018/04/03/attention.html), which basically provides a line by line discussion of the code and how it relates to the Transformer model.

* Code for The Annotated Transformer blog post (GitHub): https://github.com/harvardnlp/annotated-transformer

Rush also presents a workshop paper on this model (http://aclweb.org/anthology/W18-2509).

Of course all of that is in reference to the original Google Brain/Research paper, "Attention Is All You Need"

* arXiv landing page: https://arxiv.org/abs/1706.03762

* PDF: https://arxiv.org/pdf/1706.03762.pdf

reply

1vuio0pswjnm7 | karma 11940 | avg karma 2.07 · | 2023-07-13 21:49:21

"The idea is nearly 30 years old and has been used for large language models before, such as Google's Switch Transformer."

Innovation! :)

reply

wdroz | karma 430 | avg karma 2.36 · | 2022-10-20 03:05:04

In 2022, most people using NLP use transformers from huggingface. The tokenizer used is written in Rust and used transparently from Python.

pantulis | karma 2000 | avg karma 2.08 · | 2023-02-24 04:01:01

Also, transformer-based language models were already being used by Spacy, right?