Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

I see releases like this so often these days.

I am early in my journey but I’m stumbling on the basic structure of these models.

Is this structurally a vanilla transformer (or encoder/decoder) with tweaks to the tokenizer, the loss function, the hyper parameters, and the method of training?

Is whatever this is representative of most of the publicized releases? For instance the recent Orca 2 paper didn’t seem to have any “structural” changes. Is there a better term for these distinctions?

I don’t mean to downplay the importance of those changes, I am merely trying to understand in a very broad sense what changes have what impacts.



sort by: page size:

They cite the paper where the architecture was introduced. If you go to that paper, you'll see that it mostly consists of a very detailed and careful comparison with Transformer-XL.

In the new paper, they plug their memory system into vanilla BERT. This makes the resulting model essentially nothing like Transformer-XL, which was a strictly decoder-only generative language model.


Yeah, I think the (worrying) confusion is that Amazon calls it a seq2seq model, which was the name of a SOTA RNN from Google a while back.

Ofc now, seq2seq just means what you said (an encoder/decoder model, which is actually what a “truly vanilla” transformer would be anyway).

The fact that any serious researcher thinks any other serious researchers are using models without self attention is the real red flag here.

No one is trying to use other models anymore because they do not scale. There’s enough variety within transformers that you could argue we need a new level of taxonomy, but transformers are basically it for now.


I’m not deeply familiar with all these papers, but two things stand out to me

The model architectures are different, and in the very latest paper they scale these not-transformer models to sequence length of 64k, where the paper you linked only considers up to 8k


Worringly, I am it sure the people working on this really understand what a Transformer is

Quote from them:

“ There is still active research in non-transformer based language models though, such as Amazon’s AlexaTM 20B which outperforms GPT-3“

Quote from said paper

“ For AlexaTM 20B, we used the standard Transformer model architecture“

(Its just an encoder decoder transformer)



My understanding is they are all still transformers. The tweaks are more about quantization that better to generalize over data more efficiently (so less parameters requires) and improvement of the training data/process itself.

Otherwise I'd like to know specifically whats better/improved between models themselves.


Did you mean to paste a different image? That diagram shows a much older design than the transformer coined in the 2017 paper. It doesn’t include token embeddings, doesn’t show multi-head attention, the part that looks like the attention mechanism doesn’t do self-attention and misses the query/key/value weight matrices, and the part that looks like the fully-connected layer is not done in the way depicted and doesn’t hook into self-attention in that way. Position embeddings and the way the blocks are stacked are also absent.

The FT article at least describes most of those aspects more accurately, although I was disappointed to see they got the attention mask wrong.


Fascinating work, very promising.

Can you summarise how the model in your paper differs from this one ?

https://github.com/huggingface/transformers/issues/27011


Under their language modeling section, they include the Transformer and have listed some papers from 2019, so it seems to be reasonably up to date.

> Later decoder-only Transformer was shown to achieve great performance in language modeling tasks, like in GPT and BERT.

Actually, BERT is an encoder-only architecture, not decoder-only. Aside from trying to solve the same problem, GPT and BERT are quite different. This kind of confusion on now "classic" transformer models makes me kind of dubitative that the more recent and exotic ones are described very accurately...

(Clicking on the link with more details on BERT actually doesn't dispel much of the confusion; it stresses the fact that unlike GPT it's bidirectional, and indeed bidirectional is the "B" in BERT, but that's quite a disingenuous choice of terms itself - it's not "bidirectional" as in Bi-LSTM, that go left-to-right and right-to-left separately, it does the whole sequence at once; that was the real innovation of BERT).

Scrolling down to Transformer-XL starts talking about segments, from the context I _think_ it means that the input text is split into segments that are dealt with separately to cut down on the O(N^2) dependency of the transformer, but I would have assumed this kind of information to be written in a survey article.

IMHO, review articles are really great and useful, because they allow to cut through the BS that every paper has to add to get published, unify notations, and summarize the main points clearly. This article does a commendable job on the second point and, partly, on the first, but sadly lacks the third. Given the enormous task that it certainly was to compile this list, it would probably have profited from treating fewer models but putting things a bit more into perspective...


> There's transparent evidence that transformer-based LLMs existed before ChatGPT or Llama.

Transformers were first published about in "Attention is all you need" in 2017 by google https://arxiv.org/abs/1706.03762

> There was originally GPT-2, which was not a public release but did get reverse engineered into GPT-Neo and GPT-J (among others).

GPT 2 had a proper release in 2019 with the code, structure, checkpoints, etc. - https://openai.com/research/gpt-2-1-5b-release

> Besides that, there was BERT and T5 from Google, Longformers, XLNet, and dozens of other LLMs that predate the modern ones.

Yes all the big players had various research models. We simply crossed the line into large-scale usefulness just recently (outside of classification/summerization), and like you said OpenAI was the first to really make it accessible to the public.


The original transformer is an encoder decoder model, where the decoder model is what leads to first GPT model. Except you need to feed the encoder states to the decoder attention module in the original proposal, it is basically the same decoder only model. I would argue the decoder only model is even simpler in that regards.

When it comes to the core attention mechanism it is surprisingly stable comparing to other techniques in neural networks. There is the qkv projection, then dot product attention, then two layer of ffn. Arguably the most influential changes regarding attention itself is the multi query/grouped attention, but that is still imo, a reasonably small change.

If you look back into the convolutional NNs, their shapes and operators just changes every six months back in the day.

At the same time, the original transformer today is still a useful architecture, even in production, some bert models must be hanging around still.

Not that I am saying it didn’t change at all, but the core stays very much stable across countless revisions. If you read the original transformer paper, you already understood 80% of what LLama model does, the same thing can’t be said for other models is what I meant


The reason it's not discussed much is that what goes on downstream of tokenization is extremely opaque. It's lots of layers of the transformer network so the overall structure is documented but what exactly those numbers mean is hard to figure out.

There's an article here where the structure of an image generation network is explored a bit:

https://openai.com/research/sparse-transformer

They have a visualization of what the different layers are paying attention to.

There are also some good explanations of transformers elsewhere online. This one is old but I found it helpful:

http://jalammar.github.io/illustrated-transformer/


Yeah I thought the same -- it struck me at first blush as if it was some kind of super simple architecture that didn't use transformers, and then in the diagram i saw they used BERT to produce the embeddings!

Hi everyone! Creator of Transformers.js here :) ...

Thanks so much to everyone for sharing! It's awesome to see the positive feedback from the community. As you'll see from the demo, everything runs inside the browser!

As of 2023/03/16, the library supports BERT, ALBERT, DistilBERT, T5, T5v1.1, FLAN-T5, GPT2, BART, CodeGen, Whisper, CLIP, Vision Transformer, and VisionEncoderDecoder models, for a variety of tasks including: masked language modelling, text classification, text-to-text generation, translation, summarization, question answering, text generation, automatic speech recognition, image classification, zero-shot image classification, and image-to-text. Of course, we plan to add many more models and tasks in the near future!

Try out some of the other models/tasks from the "Task" dropdown (like the code-completion or speech-to-text demos).

---

To respond to some comments about poor translation/generation quality, many of the models are actually quite old (e.g., T5 is from 2020)... and if you run the same prompt through the PyTorch version of the model, you will get similar outputs. The purpose of the library/project is to bring these models to the browser; we didn't train the models, so, poor quality can (mostly) be blamed on the original model.

Also, be sure to play around with the generation parameters... as with many LLMs, generation parameters matter a lot.

---

If you want to keep up-to-date with the development, check us out on twitter: https://twitter.com/xenovacom :)


Unfortunately, they didn't give many details in the paper. It's frustrating, to say the least. Yay reproducibility. They say

> We use the same model and architecture as GPT-2 [RWC+19], including the modified initialization, pre-normalization, and reversible tokenization described therein, with the exception that we use alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer.

In the referenced paper (sparse transformer) they showed a bunch of different sparsity patterns, and I believe they're referring to either their banded block diagonal sparsity or a true banded diagonal pattern (local windows like bigbird and some other papers). Unfortunately, that paper also was light on details and the repo they open sourced alongside it is inscrutible.


For those who are interested in how the Transformer is taking over, please read this thread by Karpathy where he talks about a consolidation in ML:

https://twitter.com/karpathy/status/1468370605229547522

And of course one of the early classic papers in the field, as a bonus:

https://papers.nips.cc/paper/2017/file/3f5ee243547dee91fbd05...

(The paper is mentioned in the article)


> Do you rearchitect the code upon new research breakthroughs?

Great question. So we now have implementations of Transformer in both the PyTorch and TensorFlow version of the library. It did require some rearchitecting, particularly for inference, but at heart the model is basically a sequence-to-sequence model. I wrote a blog post describing the process of implementing it here: http://nlp.seas.harvard.edu/2018/04/03/attention.html .

> Why use this over the Transformer model in tensor2tensor?

OpenNMT is a bit more accessible to tensor2tensor which is a very powerful library but requires buying into a heavyweight framework. For instance OpenNMT uses plain text files, whereas tensor2tensor manages the entire data pipeline. I also personally find TensorFlow to be difficult for developing new research code.


> Phi-1-5 is a Transformer with 1.3 billion parameters. It was trained using the same data sources as Phi-1, augmented with a new data source that consists of various NLP synthetic texts. When assessed against benchmarks testing common sense, language understanding, and logical reasoning, Phi-1.5 demonstrates a nearly state-of-the-art performance among models with fewer than 10 billion parameters. Phi-1.5 can write poems, draft emails, create stories, summarize texts, write Python code (such as downloading a Hugging Face transformer model), etc.

> Phi-2 is a Transformer with 2.7 billion parameters that shows dramatic improvement in reasoning capabilities and safety measures compared to Phi-1-5, however it remains relatively small compared to other transformers in the industry. With the right fine-tuning and customization, these SLMs are incredibly powerful tools for applications both on the cloud and on the edge.

next

Legal | privacy