Read: We're forcing someone's hand in acquiring us.
Groq is still under a 30 request per minute rate-limit, which drops to 10 requests per minute if you have all day usage.
Billing has been "coming soon" this whole time, and while they've built out hype enabling features like function calling, somehow they can't setup a Stripe webhook to collect money for realistic rate limits.
They couldn't scream "we can't service the tiniest bit of our demand" any louder at this point.
_
Edit: For anyone looking for fast inference without the smoke and mirrors, I've been using Fireworks.ai in production and it's great. 200 tk/s - 300 tk/s is closer to Groq than it is to OpenAI and co.
And as a bonus they support PEFT with serverless pricing.
I run an AI story telling site and an AI ideation platform.
The story telling site alone averaged 27k requests a day this week, so about double what their current request limit is, and honestly not even that popular of a site.
You can't run much more than a toy project on their current rate limits.
Note that SRAM density doesn't scale at the same rate as logic density, and Groqs "secret sauce" is putting a ton of SRAM on their chips. Their stuff won't nesseserily see the full benefits of switching to denser nodes if the bottleneck is how much SRAM they can pack onto each chip.
They're calling the lie on needing bleeding edge hardware for performance.
5 yr old silicon (14 nm!!) and no hbm.
Their secret sauce seems to be an ahead-of-time compiler that statically lays out entire computation, enabling zero contention at runtime. Basically, they stamp out all non-determinism.
cost effective in what sense? groq doesn't achieve high efficiency, only low latency. but that's not done in a cost-effective way. compare sambanova achieving the same performance with 8 chips instead of 568, and with higher precision.
Most important, even ignoring latency, is throughput (tokens) per $$$. And according to their own benchmark [1] (famous last words :)), they're quite cost efficient.
From skimming the link above, it seems like they accepted it's extremely difficult (maybe impossible) to generate high ILP from a VLIW compiler on complex hardware (what Itanium tried to do).
So they attacked the italicized portion and simplified the hardware. Mostly by eliminating memory-layer non-determinism / using time-sync'd global memory instructions as part of the ISA(?).
This apparently reduced the difficulty of the compiler problem to something manageable (but no doubt still "fun")... and voila, performance.
The problem set groq is restricted to (known size tensor manipulation) lends itself to much easier solutions than full blown ILP. The general problem of compiling arbitrary Turing complete algorithms to the arch is NP hard. Tensor manipulation... That's a different story.
It's not really a lie though. They require 20x more chips (storing all the weights in sram instead of hbm is expensive!) than Nvidia GPUs for a ~2x speed increase. Overall the power cost is more expensive for groq than GPUs.
I wonder if the use of eDRAM (https://en.wikipedia.org/wiki/EDRAM), which is essentially embedding DRAM into a chip made on a logic process would be a good idea here.
EDRAM is essentially a tradeoff between SRAM and DRAM, offering much greater density at the cost of somewhat worse throughput and latency.
There were a couple of POWER cpus that used EDRAM as L3 cache, but it seems to have fallen out of favor.
So unless there are new Croq datacenters coming, this is only interesting for North American users. Otherwise H100 based latency optimized solutions would be faster - in particular for time-to-first-token sensitive applications.
I said "and extremely low latency" because I know they are different. Groq's TTFT is still consistently competitive with any other provider, and lower than most of them. Here's some benchmarks: https://github.com/ray-project/llmperf-leaderboard#70b-model...
Man, I want to appreciate a nice new hardware approach, but they say such BS that it is hard to read about them:
> “There might need to be a new term, because by the end of next year we’re going to deploy enough LPUs that compute-wise, it’s going to be the equivalent of all the hyperscalers combined,” he said. “We already have a non-trivial portion of that.”
Really? Does anyone seriously believe they are going to be the equivalent of all hyperscalers in compute next year? (Where Meta alone is at 1 million H100 equivalents.) In the same article where they say it's too hard for them to sell chips? And when they literally don't have a setup to even accept a credit card today?
Interesting, I guess that is why I never got a response back from them about buying their stuff.
My guess is that they realized that just selling hardware is a lot harder than running it themselves. Deploying this level of compute is non-trivial, with very high rates of failure, as well as huge supply chain issues. If you have to sell the hardware and support people buying it, that is a world of trouble.
> no-one wants to take the risk of buying a whole bunch of hardware
I do!
Nobody has stated it yet, but this is probably great news for tenstorrent.
Disclosure: building a cloud compute provider starting with AMD MI300x, and eventually any other high end hardware that our customers are asking for.
That's interesting, something that I've been really wanting to get into as well, but where I am there is literally no venture capital to raise for this at the moment. I'd be interested to know more and/or bounce some ideas though.
Extremely capital intensive, but also requires relationships in the industry at many levels. Luckily, I happen to have both and they are crazy enough to put their trust in me. I feel very grateful for that.
I've got about a dozen people signed up. Just working through some hardware issues right now (see above about high rate of failures), and hope to have this resolved next week, so that I can get people onto them and doing their testing.
We got them, then gave them to a customer to use for a week, got them back and now we are having some hardware issues that we are in the process of sorting out. I literally haven't had more than a few hours time on them yet.
They _should_ work fine for both training and inference, but since nobody has done much in the way of public in-depth benchmarks yet... I was hoping to get people to do it for us in order to stay as unbiased as possible.
noticing now: strange that my previous comment was downvoted. Would be nice to understand what someone didn't like about what I said!
The thing that nobody talks about is that there is a high rate of failures on this high end equipment. I've heard as high as 20%, in the first month. I'm not even talking about AMD here.
If anyone thinks they can just buy some accelerators and throw them into a rack and expect them to work flawlessly... they've got some hard lessons to learn.
This will be less of an issue as we grow as we will have plenty of stock to pull from, but it is a real bummer as we are starting as a proof of concept first. We started working on this business last August, before anyone knew whether or not AMD would even change course on AI.
The good news is that we onboarded a customer the day that we announced our availability, we passed that PoC challenge with flying colors and closed significant additional funding immediately after that. Onwards and upwards, just have to roll with the punches.
Ooooh yeah, from experience, even the usual (thorough) suspects who over-over-over-spec cooling and power (as anyone here has heard an HPE Apollo 6500 at full fan speed can attest) have a hard time getting all the interconnects, pcie, firmware stuff up to snuff and running H24. Once it's setup it's amazing but the bringup can be rocky.
And I'm not even talking 100/400G network, wonderful wonderful hardware, good luck debugging and getting all the RoCE/RDMA/GPUDirect/StorageDirect/NCCL working (already a bit of pain on nvidia, with a large installed base...).
Either you want to learn all this stuff (for reasons) or you're dumping a lot of money on fast-evolving tech.
A lot. It's the same reason amazon doesn't sell servers and instead gives you access to a single instance that everyone pretends is the same but in reality is massively transient.
We give full bare metal access. If you just want one GPU, we give you a VM with PCIe pass through. If you take a whole box, we can give BMC and can give you access to the PDU itself, to hard reboot things remotely. It is as if you own the whole machine yourself.
Good question. Not much. If anything, what I'm doing is even harder because I will have multiple sources for the hardware. I have to deal with all of the hardware and data center issues, as well as the customers who rely on us to provide them access.
> What is the difference between this and having to sell the cloud access and supporting the people who buy a subscription
Margins.
Pricing for cloud compute is much higher and servicing and management for the provider is much cheaper.
If I sold hardware directly, then I'm often on the hook for support contracts which can get pricy with hardware and distract from shipping future facing product features, as customers who purchase directly have longer upgrade windows due to logistical overhead.
The pricing isn't necessarily much higher. The pricing is set up to amortize the cost of running things over a period of time. If you're only going to use it for a short time, you pay more than someone who commits to a 3 year contract.
It also isn't just the hardware capex, it is everything involved under the covers. Market pricing also factors into that as well. This is something I've struggled with myself quite a bit. I know all of my costs and what I'd like to charge, but because my offering is so brand new, until my competitors announced their pricing, I wasn't sure what the market would tolerate.
Selling hardware directly is hard for exactly what you state though. Service contracts are a pain in the butt. All of this latest AI hardware has high rates of failures too. Up until recently with AMD coming to market with a great offering, the only thing people want any sort of quantity on are nvidia products. Groq probably realized that people buying 1-2 cards at a time, wasn't going to be profitable.
> Groq probably realized that people buying 1-2 cards at a time, wasn't going to be profitable
Yep! And if would have bogged them down by slowing down R&D cycles and even fulfilling orders as they obviously are not placing orders the same size as Nvidia or AMD.
I wonder what this holds for SambaNova and other similar companies as well.
> What is the difference between this and having to sell the cloud access and supporting the people who buy a subscription
Knowledge/training.
If you're shipping a brand new hardware arch, exposed as raw hardware, then you're on the hook for training everyone in the world and fixing all their weird edge case uses.
I.e. are you willing to invest in Intel/AMD/Nvidia-scale QA and support?
If you're exposing a PaaS (or even IaaS), then you have some levers you can tweak / mask behind the scenes, so only your team need be experts at low-level operations.
For a fast-paced company, the latter model makes a lot more sense, at least until hardware+software stabilizes.
> I.e. are you willing to invest in Intel/AMD/Nvidia-scale QA and support?
At least in our experience, the first line of support is from the chassis vendors. You don't go to a store and buy MI300x. You buy them from someone like SMCI/Dell, who provides the support. Of course, behind the scenes, they might be talking to AMD. Even those chassis companies often have other providers of their gear (like Exxact) as another line of defense as well.
In the case of Groq, it would have been death by 1000 cuts to have to support end users directly, especially if they are selling small quantities. It is much easier to just build data centers full of gear, maintain it yourself and then just rent the time on the hardware.
To be totally honest, I have no idea. When I first learned about them, I reached out to the CEO privately on LinkedIn, he asked what I was up to, I told him probably more than I should have (I come from an open source and transparent background), then he stopped talking to me entirely.
Since then, one of the co-founders blocked me on Twitter for pointing out that despite their claims, they were not the first MI300x to production. Neither were we, ElioVP gets that trophy, then Lamini, then GigaIO. Making us 4th and them 5th. I could go on and on with weird stuff I've seen them do, but it just isn't productive here.
Anyway, I think we have some overlap since we both are one of the few startups on the planet that actually has MI300x. But beyond that, I believe strongly that this space is large enough for multiple players and I don't see a need to be weird with each other. Apparently, I'm not on the same page though.
Thank you. I've been on HN since 2009. The most successful people I've seen here, are the ones that are transparent, honest and ethical.
I'm not trying to point fingers, I'm just focused on building a sustainable business and listening to my customers needs. The only way I can do that is by communicating with everyone around me as clearly and openly as I can. All our customers will know exactly where they stand, at all times.
I post a lot of open information on r/AMD_Stock and the feedback that I've gotten there has been exceptional. People are excited to see if AMD can claw back a bit of the market. For the safety and success of AI, we don't need team blue vs. team red, we need everyone to work towards having as many options as possible.
This is one way that I think we are going to differentiate ourselves. We won't just have MI300x, we will have every best-of-the-best chunk of hardware that we can get our hands on. No longer will super computers be tied up behind govt/edu grants. We want to democratize it. It has long been a goal of mine to build a super computer, and here is my chance. I'm excited.
One thing that sets us apart is that my co-founder and I have a ton of experience deploying, managing and optimizing 150,000 AMD GPUs and 20PB+ of storage. We did it ourselves, all through covid and all of the supply chain issues. I'm not sure many others have done that and this is something that we are well versed at doing.
I'm also seeing my competitors hiring a ton, while we are staying lean and mean with a very small team. I'd rather automate everything we deploy and focus all of our investors money on buying compute. We also have a pool of previous people we can hire from, which I think is quite an advantage over blanket hiring.
Just wanted to say that I like how you see things.
Also, if you need help with infrastructure automation and are interested in using Nix for increased reproducibility, get in touch. I know you said you want to keep lean and already have a pipeline of potential candidates, but just in case.
Hi Lucian, thank you for stepping up to the plate. Recognized and appreciated. We aren't quite in the hiring phase today, still bootstrapping and focused on growing with customer demand, but I have absolutely added you to my list for the future. Cheers!
The hardware is actually pretty amazing. 192GB (or 1.5TB in a chassis), is a game changer.
I'll let you know once I get my hands on them again. There really isn't enough public information about them at all. So far, my friends at ElioVP [0] have published a blog post. Still with not enough detail for my taste, but I'm pretty sure he is limited by what he can talk about. Luckily, I am not.
I mention in another comment below that my current goal is to get a bunch of people to perform testing on them and then publish blog posts along with open source code. This way, we can start a repository of CI/CD tests to see how things improve with time. ROCm 6.1 is rumored to be quite an improvement.
We haven't spec'd to buy A's quite yet as you're actually the first person I've heard even suggest them. If you're truly interested, hit me up personally.
By default, we are putting dual 9754's in the chassis, along with 3TB ram and 155TB nvme. A pretty beefy box. However, if you want to work with us, we can customize this to whatever customers need.
Effectively, we are the capex/opex for something that requires a lot of upfront funding and want to work with businesses that would rather focus on the software side of things.
Was mostly just checking to see if someone had already tested GPGPU, though I know some HPC labs like the MI300A. While I am starting a business, I'm not at the point of shipping software just yet (I wish!). Will definitely keep you in mind for if/when we get to AMD -- it's something I'd want, though that depends on achieving any modicum of success, haha.
It’s basically their minimum cluster size for a reasonable model requires 8ish racks of compute.
Semi analysis did some cost estimates, and I did some but you’re likely paying somewhere in the 12 million dollar range for the equipment to serve a single query using llama-70b. Compare that to a couple of gpus, and it’s easy to see why they are struggling to sell hardware, they can’t scale down.
Since they didn’t use hbm, you need to stich enough cards together to get the memory to hold your model. It takes a lot of 256mb cards to get to 64gb, and there isn’t a good way to try the tech out since a single rack really can’t serve an LLM.
The cloud provider path sounds riskier since that’s two capital intensive businesses, chip design and production and running a cloud service provider.
This is some fantastic insight. That would also inflate the opex (power/space/staffing) needs as well as people need to consider all the networking gear to hook this stuff together. 400G nic/switches/cables aren't cheap and in some cases, very hard to obtain in any sort of quantity.
It does seem like an odd move in that case. I liken this to a company like Bitmain. Why sell the miners when you could just run them yourselves? Well, fact is that they do both. But in this case, Groq is turning off the sales. Who knows, maybe it just ends up being a temporary thing until they can sort all of the pieces out.
> If customers come with requests for high volumes of chips for very large installations, Groq will instead propose partnering on data center deployment. Ross said that Groq has “signed a deal” with Saudi state-owned oil company Aramco, though he declined to give further details, saying only that the deal involved “a very large deployment of [Groq] LPUs.”
If you read on, Groq said they would only sell hardware to US companies and outside companies would get cloud services, not the LPUs. I think the US government told them to keep the LPUs in-house since they could be the secret sauce for scale.
I'm not questioning the deployment strategy, I'm wondering why Saudi Aramco wants to access so much compute power that is highly specialized(?) for generative AI workloads. Or is it more general than that?
> I'm wondering why Saudi Aramco wants to access so much compute power that is highly specialized(?) for generative AI workloads. Or is it more general than that?
For vanity reasons and because AI is the future (not every company acts that rationally for huge buying decisions).
Maybe Saudi doesn’t want to rely on OpenAI and other APIs and wants to run a fine-tuned Mixtral model on the cloud or their hardware. International companies will probably opt for an open source model since the data is sensitive and OpenAI could pass that to intelligence.
What's the connection there? The second link says it's not about materials discovery at all but rather their GigaPOWERS model, which is a physics simulation used to optimize CO2 injection into their fields (i.e. optimizing recovery). POWERS is old, it was in development for decades already. Given that they don't plan to use Groq for LLMs but simply for its parallel computation abilities. I wonder to what extent this deal - if it goes through - will seriously drain Groq, actually, as POWERS would not be like the code Groq was designed to run and so much of their performance comes from the way they tightly optimize for very specific calculations.
Groq's system was designed to run arbitrary high performance numerical workloads. In the past it has been used for a variety of scientific computation tasks, including nuclear fusion and drug discovery.
It's an assumption from my side, that they will utilize GenAI in material discovery for Aramco or SABIC[0]. Even if Groq won't fit that use-case a couple of billions with another hardware vendor is nothing if this paid off.
Partly supply-chain security I imagine. If generative AI does indeed become the next big thing much nicer to have a giant pile of hardware physically in your country than buying a drip feed from a foreign company.
Sounds like they're looking to get bought up to me. I'm sure they could monetize their current hardware, and build to sell just like other niche hardware vendors. Anyone remember the hype around big "cloud" storage boxes 10 years back?
That sucks. I wanted to save up for a couple years and get some hardware for home, but I guess the "AI" space moves so fast you barely get a couple months
I'll look into it, though seeing "contact us" always makes me think they're not going to sell a single unit to a home user. (With that said, Groq probably wouldn't either. You can technically buy LPUs for 20k each, without an expectation of support, but it takes tens of them to run Mixtral.)
Most of the low-level pieces are in Rust, the TUI is written in Python and most of the remaining pieces are getting lowered down to the Rust libraries over time.
(It was all Python up until ~6 months ago)
EDIT: Oh, and you can buy the Grayskull cards online now, without contacting anyone.
I figured I'd add a bit of clarification here. Tenstorrent has a couple SW stacks. The lowest level programming (Metalium) model is written in C++ and goes down to the register level. Higher level entrypoints (ttlib, ttnn) are written on top of that in python. I think it's the smi tooling and other systems software that might be written in rust.
You would need ~250 groq cards to run a 7B model since their system doesn't scale down. So if you want to buy their hardware, you need a few millions dollars.
Their hardware was never for people at home, but for cloud providers.
That doesn't sound right. Their public demo ran on 568 LPUs because they had Mixtral-8x7B and LLaMA-70B (45B and 70B respectively). IIRC their cards each have slightly over 200MB of SRAM so this almost exactly checks out.
A 7B model would then be able to run on about 60 LPUs. Even at $20,000 per card that would be only $1.2 million and I highly doubt the cost is actually that high, that's just what DigiKey says the cost of an LPU is, if you're trying to buy just one :)
Custom state-of-the-art silicon is ridiculously expensive.
For a minimum 100 wafers = 100k chips, Groq may be paying $1k/chip purely in amortizing design costs.
Chip design (software + engineer time) and fabrication setup (lithography masks) grow exponentially [1][2] with smaller nodes, e.g., maybe $100M for Groq's current 14nm chips to ~$500M for their planned 4nm tapeout. Once you reach mass production (>>1000 wafers, which have ~150 large chips each), wafers are $4k-$16k each. (these same issues still exist on older slower nodes, albeit not as bad)
This could be reduced somewhat if chip design software were cheaper, but probably >20% of this cost is fundamental due to manufacturing difficulty.
> Custom state-of-the-art silicon is ridiculously expensive.
Think about the amount of money being dumped into "AI" at this point. If you've got the technology and people to make stuff faster/better/cheaper, finding investors to dump money into your chip making business is probably not as hard as it was 2 years ago.
Groq is making this change for other reasons than the expense of tapping out chips.
The smoke and mirrors around groq are finally clearing. Truth is that their system is insanely expensive to maintain. hundreds (> 500 iirc) of chips to get wild tokens/s but the power and maintenance expense is crazy high for that number of chips. TCO just isn’t worth it
You don't know that. For one thing, their silicon costs are going to be relatively cheap. It's an old reliable, 14nm process, and compared to even Google's TPU this is a relatively simple chip. For another they _could_ be putting all that silicon to a good use, and by all indications they are. Because there's far less local memory movement, and weights are distributed throughout the system, even this 14nm system could be energy efficient. In 9/10ths of all power in a conventional system does not go towards compute - it's wasted in moving data back and forth. This is especially bad in transformers, which, because of their size, largely defeat the memory hierarchies the architects worked so hard to perfect. IOW, all your caches are useless and you're unnecessarily wasting 90% of your energy while also getting worse latency and worse throughput (due to memory bus bandwidth constraints). Oops. These folks seem to be offering something that nobody else does - a feasible, proven way to get out of jail free. I wish them all the success they can get, because all the other currently available architectures are largely unsuitable for high throughput transformer inference, and they work in spite, instead of because, of their design.
Peak H100 power consumption is 700W. Average power consumption of the groq card (from their own website) is 240W. With 576 chips it just doesn’t look good. How much is that millisecond perf gain worth it to end users?
That said I think their arch is super interesting. I just think that demo was way too hype when the actual system is pretty impractical.
So? They aren't performing the same computation. You can't compare the two. What you can compare is power draw at an equivalent tokens/sec on the same model. But you don't have that number.
I’m just estimating here from public numbers. My point is the power consumption could be 100W on that workload and the groq chip could be 1k, both ridiculously optimistic. The whole system is still crazy expensive. H100s will not have latency as fast, but in terms of concurrent users and TCO, I really don’t think groq will be worth it. You could probably get the same concurrent users and throughput with like 8 h100s. Latency won’t be as good, but price could be much lower.
I don't understand why the comments are trash-talking Groq. They are the fastest LLM inference provider by a big margin. Why would they sell their hardware to any other company for any price? Keep it all for themselves and take over the market. 95% of my LLM requests go to Groq these days. The other 5% to Claude Opus. Latency is king.
What open source model are you using when you hit groq?
I just benchmarked some perf for some of my larger context window queries last week and groq's API took 1.6 seconds versus 1.8 to 2.2 for OpenAI GPT-3.5-turbo. So, it wasn't much faster. I almost emailed their support to see if I was doing something wrong. Would love to hear any details about your workload or the complexity of your queries.
Google has used LMs in search for years (just not trendy LLMs), and search is famously optimized to the millisecond. Visa uses LMs to perform fraud detection every time someone makes a transaction, which is also quite latency sensitive. I'm guessing "informed folks" aren't so informed about the broader market.
OpenAI and Anthropic's APIs are obviously not latency-driven. Same with comparable LLM API resellers like Azure. Most people are likely not expecting tight latency SLOs there. That said, chat experiences (esp. voice ones) would probably be even more valuable if they could react in "human time" instead of with few seconds delay.
Integrating specialized hardware that can shave inference to fractions of a second seems like something that could be useful in a variety of latency-sensitive opportunities. Especially if this allows larger language models to be used where traditionally they were too slow.
Reducing latency doesn't automatically translate to winning the market or even increased revenue. There are tons of other variables such as functionality, marketing, back-office sales deals and partnerships. Lots of times, users can't even tell which service is objectively better (even though you and I have the know how and tools to measure and better know reality).
Unfortunately the technical angle is only one piece of the puzzle.
I have lots of questions about how important latency is since you may be replacing many minutes or hours of a person’s time with undoubtedly a quicker response by any measure. This seems like a knee jerk reaction assuming latency is as important as it’s been with advertising.
I’m not convinced latency matters as much as groqs material tries to claim it does.
When it's already faster than I can absorb the response, which for me as an organic brain includes the normal token generation rate of the free tier of ChatGPT.
If I was using them to process far more text, e.g. summarise long documents, or if I was using it as an inline editing assistant, then I'd care more about the speed.
Name one use case where there is a difference between latency of 200 t/s (fireworks.ai mixtral model) and 500 t/s (groq mixtral)? Not throughput and not time to first token, but latency.
Groq model shines at latency, not at the other two.
For example, if you're a game company and you want to use LLMs so your players can converse with nonplayer characters in natural language, replacing a multiple-choice conversation tree - you'd want that to be low latency, and you'd want it to be cheap.
But are people really going to do this? The cost here seems prohibitive unless you're doing a subscription type game (and even then I'm not sure). And the kinds of games that benefit from open ended dialogue attract players who just want to pay an upfront cost and have an adventure.
(All the sudden having nightmares of getting billed for the conversations I have in the single player game I happen to be enjoying...)
If there is a future with this idea, its gotta be just shipping the LLM with game right?
That’s an open source Mistral ML model implementation which runs on GPUs (all of them, not just nVidia), takes 4.5GB on disk, uses under 6GB of VRAM, and optimized for interactive single-user use case. Probably fast enough for that application.
You wouldn’t want in-game dialogues with the original model though. Game developers would need to finetune, retrain and/or do something else with these weights and/or my implementation.
FWIW to confused others, trying to extract something from that video, it looks like this game [1] is using this stuff. Based solely on the reviews and the game play videos (while definitely acknowledging its technically in development status), it kinda looks like long term profitability is the least of their concerns here...
EDIT: Watching the videos, I am more and more confused by why this is even desirable. The complexity of dialogue in a game, it seems, needs to match the complexity of the more general possibilities and actions you can undertake in the game itself. Without that, it all just feels like you are in a little chatbot sandbox within the game, even if the dialogue is perfectly "in character." It all seems to feel less immersive with the LLMs.
Absolutely on the mark with this comment. LLMs aren't magical end-goal technology. We have a while to go it seems before they've settled into all the use-cases and we've established what does and doesn't work.
It would probably look like an InfiniteCraft-style model, where conversation possibilities are saved, and new dialogue (edge nodes) is computed as needed.
Small, bounded conversations, with problematic lines trimmed over time, striking a balance between possibility and self-contradiction.
I could see it working really well in a Mass Effect-type game.
Google won search in large part because of their latency. I stopped using local models because of latency. I switched from OpenAI to VertexAI because of latency (and availability)
Research suggest most answers and use cases do not require the largest, most sophisticated models. When you start building more complex systems, the overall time increases from chaining and you can pick different models for different points
It's not a lot faster for input but it is something like 10x faster for output(mixtral vs gpt-3.5). This could enable completely new mode of interaction with LLMs e.g. agents.
This business model is bound to get attacked and suffer a painful exit soon. Here's why:
First, the whole systems of chips architecture that everyone is talking about will solve for increasing overall SRAM available to keep more model state on super fast memory and avoid going to slow memory.
Secondly, anyone serious about their data (enterprises) won't be okay with making API calls to Groq. Anyone serious about their data and have a lot scale (consumer internet) won't also be okay with making expensive API calls to Groq at scale.
Their cloud is attractive only if I can use their API for experimentation toy apps to continue developing in this direction while the rest of the major industry players systems of chip architecture catches up and solves for SRAM size bottleneck and manufacturing process bottleneck, and once that's solved, I get more powerful compute for cheaper $$ to deploy on-prem.
So, this cloud strategy is short-lived. I see another pivot on the horizon.
I think one major challenge they'll face is that their architecture is incredibly fast at running the ~10-100B parameter open-source models, but starts hitting scaling issues with state-of-the-art models. They need 10k+ chips for a GPT-4-class model, but their optical interconnect only supports a few hundred chips.
Groq is still under a 30 request per minute rate-limit, which drops to 10 requests per minute if you have all day usage.
Billing has been "coming soon" this whole time, and while they've built out hype enabling features like function calling, somehow they can't setup a Stripe webhook to collect money for realistic rate limits.
They couldn't scream "we can't service the tiniest bit of our demand" any louder at this point.
_
Edit: For anyone looking for fast inference without the smoke and mirrors, I've been using Fireworks.ai in production and it's great. 200 tk/s - 300 tk/s is closer to Groq than it is to OpenAI and co.
And as a bonus they support PEFT with serverless pricing.
reply