Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

Can anyone provide details about the training of the model. What data is it based on? Common Crawl? (Being a french company the also rather focus on English language tasks) Where was it trained and with how many resources? They mention Leonardo. I was in an interesting meeting at the German Research Ministry last week where people where complaining that the EuroHPC resources were not sufficient atm to train decent LLMs. I guess the guys in the end went also to CoreWeave in the US.


sort by: page size:

Following the links, it might be using the Marian NMT framework under the hood? I'm wondering where specifically the model weights are coming from, and it would be interesting to read more about the training process[0].

I'm excited to see a more open translation system that seems to perform pretty well and runs locally.

It definitely needs to add support for more languages, and based on other models I've seen recently... I have to wonder if building dedicated models for each pair of languages is still the best choice. I believe SeamlessM4T just uses a single model (available in different sizes), and I have definitely seen that Whisper only uses a single model for multiple language pairs as well (although it was only specifically trained to translate into English). Similarly, virtually all of the LLMs are multilingual. It seems like a single model is able to learn shared insights that apply across languages, reducing the amount of training data needed for each additional language (or increasing the accuracy with the same amount of data), but I admit that I could be wrong. This has just been my perception of how things are developing.

[0]: Some training info here, it looks like: https://github.com/browsermt/students/tree/master/train-stud...


They also provide a little detail about the models used:

> The base LLM models are based on either BART or DeBERTa (which are open source and hosted on Hugging Face), with heavy retraining based on our own data from search results


I can't seem to find pre-trained models for other languages (French). How long did you training take for those in English? Do you think it makes sense to start from the English one? Thank you!

Yes, the extension is being developed here [1] and the engine[2]/wasm[3] wrapper here.

The models and training pipeline are in the urls your posted and the evaluation of those models are hosted here [4]

[1] https://github.com/mozilla/firefox-translations [2] https://github.com/browsermt/bergamot-translator [3] https://github.com/browsermt/bergamot-translator/tree/main/w... [4] https://github.com/mozilla/firefox-translations-models/tree/...


Just curious what kinds of models are used in your project?

https://arxiv.org/abs/2306.08891 One paper seems to suggest trained specialized model can outperform LLM in some tasks.


We (Stability AI) trained it on our TPUs with input from the OpenLM team as an OpenLLaMA collaboration.

The 20b model is 780b tokens in, lots of learnings so we can optimise future runs.

Hopefully these will be useful bases for continued research, we will have some SFT/RLHF variants in due course from our Carper AI lab.


the LLM ouroboros starts with models being used to create training data for models.

Thanks for the feedback!

1. The largest model that we have tested is Llama2 13B. For the first phase, we focussed on fine-tuning LLMs in the 1B-13B range. For our next phase, we will focus on 13B-45B'ish -- for this we will have to incorporate distributed techniques.

2. Following incorporation of distributed training techniques, we will be able to run MoE based models, such as Mixtral.


Basically side by side comparisons with GLM-130b (Bilingual Chinese/English non Instruct tuned model), chatGPT(3.5), Google Translate, Deepl Translate and the SOTA NLLB-1.3 distilled along with the human translations. LLMs(GLM/cGPT) win by a lot.

There are many MoE architectures and I suppose we don’t know for sure which OpenAI is using. The “selection” of the right mix of models is something that a network learns and it’s not a complex process. Certainly no more complex than training an LLM.

I got their model running with the included training data and it’s quite impressive. Some of my more extreme poor examples don’t produce reasonable results but many do.

Well spotted - this is where I first created the algorithm that became ULMFiT! I wanted to show an example of transfer learning outside of computer vision for the course but couldn't find anything compelling. I was pretty sure a language model would work really well in NLP so tried it out, and was totally shocked when the very first model best the previous state of the art!

Sebastian (author of this article) saw the lesson, and was kind enough to complete lots of experiments to test out the approach more carefully, and did a great job of writing up the results in a paper, which was then accepted by the ACL.


Agree and, as I work for Seznam, especially agree on the ability to train models properly to satisfy local users. Not to mention that the user can choose what should be the language of the results.

Nice! Do you mind elaborating more on how you went about training your models, where do you generate the outputs (web page? spreadsheet?), and the difficulties you encountered or the tedious/repetitive parts?

From how they set up the training, I think this is a nontrivial task. Also, from a casual read through, it looks like it is generally focused on Arxiv papers.

To their credit, the authors included the models used and the metrics they used to validate their model. They also have detailed notes on the architecture for training which, at a quick glance, doesn't look easy to replicate unless you can borrow some GPU's in the cloud.


Today Databricks announced [0] 6b parameter model from EleutherAI finetuned on Alpaca dataset. According to their CEO[1], training took 3 hours, and costed $30. They didn't release any details on how it was trained, but likely with LoRa.

[0] https://www.databricks.com/blog/2023/03/24/hello-dolly-democ... [1] https://twitter.com/alighodsi/status/1639251347777388544


Hmmm. Not sure why they use M3 data when there is already M4 where a deep learning model won. I know because I reimplemented it as a toy version in python here: https://github.com/leanderloew/ES-RNN-Pytorch

It was actually very cool because the model was a melt of exponential smoothing and dl.


It's interesting the 6b model outperforms the 33b model. I wonder if it means the 33b model needs more training data? It was pretrained on ~1 million C programs, compared to DeepSeek-Coder, which was trained on 2 trillion tokens, which is a few orders of magnitude more data.

I'm also curious about how this compares to non-LLM solutions.


Hey mix this looks really interesting. How did you train the model?

I've got a corpus of certificates from Government Compliance Program that I need to construct into a similar model.

next

Legal | privacy