Hacker Read

riedel · 2023-09-27 11:25:42

Can anyone provide details about the training of the model. What data is it based on? Common Crawl? (Being a french company the also rather focus on English language tasks) Where was it trained and with how many resources? They mention Leonardo. I was in an interesting meeting at the German Research Ministry last week where people where complaining that the EuroHPC resources were not sufficient atm to train decent LLMs. I guess the guys in the end went also to CoreWeave in the US.

coder543 | karma 5402 | avg karma 3.58 · | 2023-09-26 09:52:40

Following the links, it might be using the Marian NMT framework under the hood? I'm wondering where specifically the model weights are coming from, and it would be interesting to read more about the training process[0].

I'm excited to see a more open translation system that seems to perform pretty well and runs locally.

It definitely needs to add support for more languages, and based on other models I've seen recently... I have to wonder if building dedicated models for each pair of languages is still the best choice. I believe SeamlessM4T just uses a single model (available in different sizes), and I have definitely seen that Whisper only uses a single model for multiple language pairs as well (although it was only specifically trained to translate into English). Similarly, virtually all of the LLMs are multilingual. It seems like a single model is able to learn shared insights that apply across languages, reducing the amount of training data needed for each additional language (or increasing the accuracy with the same amount of data), but I admit that I could be wrong. This has just been my perception of how things are developing.

[0]: Some training info here, it looks like: https://github.com/browsermt/students/tree/master/train-stud...

reply

_emacsomancer_ | karma 3868 | avg karma 2.65 · | 2023-03-02 12:51:31

They also provide a little detail about the models used:

> The base LLM models are based on either BART or DeBERTa (which are open source and hosted on Hugging Face), with heavy retraining based on our own data from search results

reply

testbed | karma 2 | avg karma 1.0 · | 2019-12-05 16:54:45

I can't seem to find pre-trained models for other languages (French). How long did you training take for those in English? Do you think it makes sense to start from the English one? Thank you!

andrenatal1 | karma 74 | avg karma 5.69 · | 2022-06-03 04:23:27

Yes, the extension is being developed here [1] and the engine[2]/wasm[3] wrapper here.

The models and training pipeline are in the urls your posted and the evaluation of those models are hosted here [4]

[1] https://github.com/mozilla/firefox-translations [2] https://github.com/browsermt/bergamot-translator [3] https://github.com/browsermt/bergamot-translator/tree/main/w... [4] https://github.com/mozilla/firefox-translations-models/tree/...

reply

ban-lan-gen | karma 5 | avg karma 0.56 · | 2023-09-08 10:48:13

Just curious what kinds of models are used in your project?

https://arxiv.org/abs/2306.08891 One paper seems to suggest trained specialized model can outperform LLM in some tasks.

reply

emadm | karma 401 | avg karma 2.88 · | 2023-06-18 12:28:04

We (Stability AI) trained it on our TPUs with input from the OpenLM team as an OpenLLaMA collaboration.

The 20b model is 780b tokens in, lots of learnings so we can optimise future runs.

Hopefully these will be useful bases for continued research, we will have some SFT/RLHF variants in due course from our Carper AI lab.

reply

bxguff | karma | avg karma · | 2023-09-15 12:51:48

the LLM ouroboros starts with models being used to create training data for models.

rsaha7 | karma 44 | avg karma 2.0 · | 2024-04-09 14:12:16

Thanks for the feedback!

1. The largest model that we have tested is Llama2 13B. For the first phase, we focussed on fine-tuning LLMs in the 1B-13B range. For our next phase, we will focus on 13B-45B'ish -- for this we will have to incorporate distributed techniques.

2. Following incorporation of distributed training techniques, we will be able to run MoE based models, such as Mixtral.

reply

og_kalu | karma 4856 | avg karma 2.4 · | 2023-03-17 01:19:29

Basically side by side comparisons with GLM-130b (Bilingual Chinese/English non Instruct tuned model), chatGPT(3.5), Google Translate, Deepl Translate and the SOTA NLLB-1.3 distilled along with the human translations. LLMs(GLM/cGPT) win by a lot.

ttul | karma 8355 | avg karma 3.91 · | 2023-09-04 13:23:36

There are many MoE architectures and I suppose we don’t know for sure which OpenAI is using. The “selection” of the right mix of models is something that a network learns and it’s not a complex process. Certainly no more complex than training an LLM.

doctoboggan | karma 4553 | avg karma 5.46 · | 2021-04-13 00:47:16

I got their model running with the included training data and it’s quite impressive. Some of my more extreme poor examples don’t produce reasonable results but many do.

jph00 | karma 7468 | avg karma 8.69 · | 2018-07-09 19:34:01

Well spotted - this is where I first created the algorithm that became ULMFiT! I wanted to show an example of transfer learning outside of computer vision for the course but couldn't find anything compelling. I was pretty sure a language model would work really well in NLP so tried it out, and was totally shocked when the very first model best the previous state of the art!

Sebastian (author of this article) saw the lesson, and was kind enough to complete lots of experiments to test out the approach more carefully, and did a great job of writing up the results in a paper, which was then accepted by the ACL.

reply

svarnypetr | karma 4 | avg karma 0.44 · | 2017-05-18 06:16:15+00:00

Agree and, as I work for Seznam, especially agree on the ability to train models properly to satisfy local users. Not to mention that the user can choose what should be the language of the results.

Jugurtha | karma 3589 | avg karma 1.49 · | 2021-01-02 02:57:41+00:00

Nice! Do you mind elaborating more on how you went about training your models, where do you generate the outputs (web page? spreadsheet?), and the difficulties you encountered or the tedious/repetitive parts?

rajangdavis | karma 445 | avg karma 1.32 · | 2019-09-11 00:12:36

From how they set up the training, I think this is a nontrivial task. Also, from a casual read through, it looks like it is generally focused on Arxiv papers.

To their credit, the authors included the models used and the metrics they used to validate their model. They also have detailed notes on the architecture for training which, at a quick glance, doesn't look easy to replicate unless you can borrow some GPU's in the cloud.

reply

romanzubenko | karma 476 | avg karma 8.98 · | 2023-03-24 10:17:52

Today Databricks announced [0] 6b parameter model from EleutherAI finetuned on Alpaca dataset. According to their CEO[1], training took 3 hours, and costed $30. They didn't release any details on how it was trained, but likely with LoRa.

[0] https://www.databricks.com/blog/2023/03/24/hello-dolly-democ... [1] https://twitter.com/alighodsi/status/1639251347777388544

reply

lysecret | karma 853 | avg karma 2.12 · | 2022-12-02 03:12:29

Hmmm. Not sure why they use M3 data when there is already M4 where a deep learning model won. I know because I reimplemented it as a toy version in python here: https://github.com/leanderloew/ES-RNN-Pytorch

It was actually very cool because the model was a melt of exponential smoothing and dl.

reply

potatoman22 | karma 578 | avg karma 1.5 · | 2024-03-17 11:03:12

It's interesting the 6b model outperforms the 33b model. I wonder if it means the 33b model needs more training data? It was pretrained on ~1 million C programs, compared to DeepSeek-Coder, which was trained on 2 trillion tokens, which is a few orders of magnitude more data.

I'm also curious about how this compares to non-LLM solutions.

reply

dgivney | karma 130 | avg karma 2.1 · | 2018-05-04 01:48:49+00:00

Hey mix this looks really interesting. How did you train the model?

I've got a corpus of certificates from Government Compliance Program that I need to construct into a similar model.

reply