Dunno, I couldn’t help but exfiltrate the model from Azure first thing in the morning (ugh) and I’m staring at “data/model/modeling_mixformer_sequential.py”
it looks like standard OOP Multi Head Attention but has both self and cross attention instead of just self attention (also standard, just personal preference to avoid cross attention, they are equivalent if you know what you’re doing)
except for the Mixformer from 2203.11082, the ParallelBlock looks like just MHA, MLP, Residual, and the rotary embedding from 2104.09864 is a big chunk of it. Otherwise just einsum you saw already.
The main reason this model is more capable is A) high quality dataset of text many brilliant humans worked very hard to produce and won’t get paid for and B) Microsoft broke the [almost surely felonious](https://www.justice.gov/atr/page/file/1091651/download ) terms of use of Microsoft OpenAI by using “synthetic” data aka “Microsoft OpenAI Output no one else is allowed to use but Microsoft” (obligatory Cartel evidence link: https://i.postimg.cc/MGqPvPz5/cartel-microsoft-microsoft-ope...) to develop their model (they used GPT-3.5-Turbo to generate many responses and used GPT-4 to outsource the hard part of AI (quality evaluation) 100,000 times) … also they have a lot more GPUs than we do.
If not for the fact Microsoft OpenAI was bought and paid for by Microsoft already, breaking the [almost surely felonious](https://www.justice.gov/atr/page/file/1091651/download) Microsoft OpenAI Terms of use by using the output to develop a Microsoft model would deserve a[n almost surely unwinnable](https://en.wikipedia.org/wiki/Illegal_agreement?wprov=sfti1) lawsuit from OpenAI, but I doubt Microsoft OpenAI will sue Microsoft, that would bite the hand that feeds them.
It’s a risky maneuver which leaks GPT-4’s most precious and unique capability (being heavily manually fine tuned on human evaluation of text) and just goes to show how much you can get away with if you have enough money and power…
There sure is a lot of entitlement in those comments. The whole point of GGML is it's a framework you can use to build a model. If people want one they can use the framework to make one but they would rather just complain.
On another related note, what is the architecture of phi? I wonder if there are any big impediments to implementing it in ggml? I find it telling that we get a big lecture in the readme about the societal implications and blabla all CYA stuff but no up-front summary of what the model is, just "transformer based". Personally I care much more about that than some ridiculous stuff about bias.
https://github.com/ggerganov/llama.cpp/issues/3146
reply