Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login
Groq CEO: 'We No Longer Sell Hardware' (www.eetimes.com) similar stories update story
13 points by frozenport | karma 2997 | avg karma 1.05 2024-04-07 22:40:50 | hide | past | favorite | 152 comments



view as:

Read: We're forcing someone's hand in acquiring us.

Groq is still under a 30 request per minute rate-limit, which drops to 10 requests per minute if you have all day usage.

Billing has been "coming soon" this whole time, and while they've built out hype enabling features like function calling, somehow they can't setup a Stripe webhook to collect money for realistic rate limits.

They couldn't scream "we can't service the tiniest bit of our demand" any louder at this point.

_

Edit: For anyone looking for fast inference without the smoke and mirrors, I've been using Fireworks.ai in production and it's great. 200 tk/s - 300 tk/s is closer to Groq than it is to OpenAI and co.

And as a bonus they support PEFT with serverless pricing.


they don't even let us pay them, it's insane

I just have free API access with no ability to add a credit card.


What are you using all this for? Whats the product?

I run an AI story telling site and an AI ideation platform.

The story telling site alone averaged 27k requests a day this week, so about double what their current request limit is, and honestly not even that popular of a site.

You can't run much more than a toy project on their current rate limits.


Wildly fast inference. And current chips are 14nm so headroom to get a lot better.

Note that SRAM density doesn't scale at the same rate as logic density, and Groqs "secret sauce" is putting a ton of SRAM on their chips. Their stuff won't nesseserily see the full benefits of switching to denser nodes if the bottleneck is how much SRAM they can pack onto each chip.

https://www.tomshardware.com/news/no-sram-scaling-implies-on...


They're calling the lie on needing bleeding edge hardware for performance.

5 yr old silicon (14 nm!!) and no hbm.

Their secret sauce seems to be an ahead-of-time compiler that statically lays out entire computation, enabling zero contention at runtime. Basically, they stamp out all non-determinism.

https://wow.groq.com/isca-2022-paper


No HBM because they use tons of fast SRAM instead. Isn't that the main driver for performance here?

(the way I understood it => it's still cost effective at scale due to throughput increase this brings)


cost effective in what sense? groq doesn't achieve high efficiency, only low latency. but that's not done in a cost-effective way. compare sambanova achieving the same performance with 8 chips instead of 568, and with higher precision.

The # of chips is not the most important metric.

Most important, even ignoring latency, is throughput (tokens) per $$$. And according to their own benchmark [1] (famous last words :)), they're quite cost efficient.

[1] https://www.semianalysis.com/p/groq-inference-tokenomics-spe...


> No HBM because they use tons of fast SRAM instead. Isn't that the main driver for performance here?

No doubt fast SRAM helps, but from a computation pov imho its that they've statically planned computation and eliminated all locks.

Short explainer here: https://www.youtube.com/watch?v=H77tV1KcWIE (Based on their paper).


Thanks for putting this together! Will give it a watch now

So Itanium but functional?

From skimming the link above, it seems like they accepted it's extremely difficult (maybe impossible) to generate high ILP from a VLIW compiler on complex hardware (what Itanium tried to do).

So they attacked the italicized portion and simplified the hardware. Mostly by eliminating memory-layer non-determinism / using time-sync'd global memory instructions as part of the ISA(?).

This apparently reduced the difficulty of the compiler problem to something manageable (but no doubt still "fun")... and voila, performance.


The problem set groq is restricted to (known size tensor manipulation) lends itself to much easier solutions than full blown ILP. The general problem of compiling arbitrary Turing complete algorithms to the arch is NP hard. Tensor manipulation... That's a different story.

It's not really a lie though. They require 20x more chips (storing all the weights in sram instead of hbm is expensive!) than Nvidia GPUs for a ~2x speed increase. Overall the power cost is more expensive for groq than GPUs.

Source? Or how do you know that?

This was discussed extensively in previous threads, e.g. https://news.ycombinator.com/item?id=39431989

Are power and 20x 14-nm chip capacity limiting factor currently?

It's not inconceivable that's a better trade-off than leading-node and HBM requirements.


Edit: 200x more chips, not 20x.

I wonder if the use of eDRAM (https://en.wikipedia.org/wiki/EDRAM), which is essentially embedding DRAM into a chip made on a logic process would be a good idea here.

EDRAM is essentially a tradeoff between SRAM and DRAM, offering much greater density at the cost of somewhat worse throughput and latency.

There were a couple of POWER cpus that used EDRAM as L3 cache, but it seems to have fallen out of favor.


It fell out of favor because it lost the density advantage in newer processes.

So unless there are new Croq datacenters coming, this is only interesting for North American users. Otherwise H100 based latency optimized solutions would be faster - in particular for time-to-first-token sensitive applications.

> latency optimized solutions would be faster - in particular for time-to-first-token sensitive applications

Do you have any idea how fast Groq is? Go try it. Consistently over 400 t/s for most of the models that they support, and extremely low latency.


time to first token != tokens per second

remember that EU -> US is ~150ms unavoidable latency, for example. then your comparison is local H100 vs Grok + 150ms latency to first token.


> time to first token != tokens per second

I said "and extremely low latency" because I know they are different. Groq's TTFT is still consistently competitive with any other provider, and lower than most of them. Here's some benchmarks: https://github.com/ray-project/llmperf-leaderboard#70b-model...


I'm in Australia. I have 249ms of unavoidable latency and I'd still use the groq API if I could. It's that much faster than other inference solutions.

Man, I want to appreciate a nice new hardware approach, but they say such BS that it is hard to read about them:

> “There might need to be a new term, because by the end of next year we’re going to deploy enough LPUs that compute-wise, it’s going to be the equivalent of all the hyperscalers combined,” he said. “We already have a non-trivial portion of that.”

Really? Does anyone seriously believe they are going to be the equivalent of all hyperscalers in compute next year? (Where Meta alone is at 1 million H100 equivalents.) In the same article where they say it's too hard for them to sell chips? And when they literally don't have a setup to even accept a credit card today?


You don't put a million-dollar rack on a credit card.

Interesting, I guess that is why I never got a response back from them about buying their stuff.

My guess is that they realized that just selling hardware is a lot harder than running it themselves. Deploying this level of compute is non-trivial, with very high rates of failure, as well as huge supply chain issues. If you have to sell the hardware and support people buying it, that is a world of trouble.

> no-one wants to take the risk of buying a whole bunch of hardware

I do!

Nobody has stated it yet, but this is probably great news for tenstorrent.

Disclosure: building a cloud compute provider starting with AMD MI300x, and eventually any other high end hardware that our customers are asking for.


That's interesting, something that I've been really wanting to get into as well, but where I am there is literally no venture capital to raise for this at the moment. I'd be interested to know more and/or bounce some ideas though.

Extremely capital intensive, but also requires relationships in the industry at many levels. Luckily, I happen to have both and they are crazy enough to put their trust in me. I feel very grateful for that.

that or if you put chips in the hands of your customers, they may start to benchmark it against other equivalent solutions

Funny you should mention that. ;-)

https://www.reddit.com/r/LocalLLaMA/comments/1bpgrdf/wanted_...

I've got about a dozen people signed up. Just working through some hardware issues right now (see above about high rate of failures), and hope to have this resolved next week, so that I can get people onto them and doing their testing.


Shiny! You have you guys gotten MI300X's working for non-inference use cases?

We got them, then gave them to a customer to use for a week, got them back and now we are having some hardware issues that we are in the process of sorting out. I literally haven't had more than a few hours time on them yet.

They _should_ work fine for both training and inference, but since nobody has done much in the way of public in-depth benchmarks yet... I was hoping to get people to do it for us in order to stay as unbiased as possible.

noticing now: strange that my previous comment was downvoted. Would be nice to understand what someone didn't like about what I said!


That sounds like your machine broke after 1 week of intensive use.

Sadly, not even intensive.

The thing that nobody talks about is that there is a high rate of failures on this high end equipment. I've heard as high as 20%, in the first month. I'm not even talking about AMD here.

If anyone thinks they can just buy some accelerators and throw them into a rack and expect them to work flawlessly... they've got some hard lessons to learn.

This will be less of an issue as we grow as we will have plenty of stock to pull from, but it is a real bummer as we are starting as a proof of concept first. We started working on this business last August, before anyone knew whether or not AMD would even change course on AI.

The good news is that we onboarded a customer the day that we announced our availability, we passed that PoC challenge with flying colors and closed significant additional funding immediately after that. Onwards and upwards, just have to roll with the punches.


Ooooh yeah, from experience, even the usual (thorough) suspects who over-over-over-spec cooling and power (as anyone here has heard an HPE Apollo 6500 at full fan speed can attest) have a hard time getting all the interconnects, pcie, firmware stuff up to snuff and running H24. Once it's setup it's amazing but the bringup can be rocky.

And I'm not even talking 100/400G network, wonderful wonderful hardware, good luck debugging and getting all the RoCE/RDMA/GPUDirect/StorageDirect/NCCL working (already a bit of pain on nvidia, with a large installed base...).

Either you want to learn all this stuff (for reasons) or you're dumping a lot of money on fast-evolving tech.


> If you have to sell the hardware and support people buying it, that is a world of trouble.

What is the difference between this and having to sell the cloud access and supporting the people who buy a subscription?


A lot. It's the same reason amazon doesn't sell servers and instead gives you access to a single instance that everyone pretends is the same but in reality is massively transient.

We give full bare metal access. If you just want one GPU, we give you a VM with PCIe pass through. If you take a whole box, we can give BMC and can give you access to the PDU itself, to hard reboot things remotely. It is as if you own the whole machine yourself.

Good question. Not much. If anything, what I'm doing is even harder because I will have multiple sources for the hardware. I have to deal with all of the hardware and data center issues, as well as the customers who rely on us to provide them access.

Good thing that I'm a glutton for punishment.


> What is the difference between this and having to sell the cloud access and supporting the people who buy a subscription

Margins.

Pricing for cloud compute is much higher and servicing and management for the provider is much cheaper.

If I sold hardware directly, then I'm often on the hook for support contracts which can get pricy with hardware and distract from shipping future facing product features, as customers who purchase directly have longer upgrade windows due to logistical overhead.


The pricing isn't necessarily much higher. The pricing is set up to amortize the cost of running things over a period of time. If you're only going to use it for a short time, you pay more than someone who commits to a 3 year contract.

It also isn't just the hardware capex, it is everything involved under the covers. Market pricing also factors into that as well. This is something I've struggled with myself quite a bit. I know all of my costs and what I'd like to charge, but because my offering is so brand new, until my competitors announced their pricing, I wasn't sure what the market would tolerate.

Selling hardware directly is hard for exactly what you state though. Service contracts are a pain in the butt. All of this latest AI hardware has high rates of failures too. Up until recently with AMD coming to market with a great offering, the only thing people want any sort of quantity on are nvidia products. Groq probably realized that people buying 1-2 cards at a time, wasn't going to be profitable.


> it is everything involved under the covers

Yep! You explained it better than me!

> Groq probably realized that people buying 1-2 cards at a time, wasn't going to be profitable

Yep! And if would have bogged them down by slowing down R&D cycles and even fulfilling orders as they obviously are not placing orders the same size as Nvidia or AMD.

I wonder what this holds for SambaNova and other similar companies as well.


> I wonder what this portends for SambaNova and other similar vendors as well.

Time will tell. This is definitely an interesting development.


>I wonder what this portends for SambaNova and other similar vendors as well.

They're focused on services based full stack deployments afaik.

They come in with a rack and sell you on models as well.


> What is the difference between this and having to sell the cloud access and supporting the people who buy a subscription

Knowledge/training.

If you're shipping a brand new hardware arch, exposed as raw hardware, then you're on the hook for training everyone in the world and fixing all their weird edge case uses.

I.e. are you willing to invest in Intel/AMD/Nvidia-scale QA and support?

If you're exposing a PaaS (or even IaaS), then you have some levers you can tweak / mask behind the scenes, so only your team need be experts at low-level operations.

For a fast-paced company, the latter model makes a lot more sense, at least until hardware+software stabilizes.


> I.e. are you willing to invest in Intel/AMD/Nvidia-scale QA and support?

At least in our experience, the first line of support is from the chassis vendors. You don't go to a store and buy MI300x. You buy them from someone like SMCI/Dell, who provides the support. Of course, behind the scenes, they might be talking to AMD. Even those chassis companies often have other providers of their gear (like Exxact) as another line of defense as well.

In the case of Groq, it would have been death by 1000 cuts to have to support end users directly, especially if they are selling small quantities. It is much easier to just build data centers full of gear, maintain it yourself and then just rent the time on the hardware.


How do y'all compare to https://tensorwave.com?

To be totally honest, I have no idea. When I first learned about them, I reached out to the CEO privately on LinkedIn, he asked what I was up to, I told him probably more than I should have (I come from an open source and transparent background), then he stopped talking to me entirely.

Since then, one of the co-founders blocked me on Twitter for pointing out that despite their claims, they were not the first MI300x to production. Neither were we, ElioVP gets that trophy, then Lamini, then GigaIO. Making us 4th and them 5th. I could go on and on with weird stuff I've seen them do, but it just isn't productive here.

Anyway, I think we have some overlap since we both are one of the few startups on the planet that actually has MI300x. But beyond that, I believe strongly that this space is large enough for multiple players and I don't see a need to be weird with each other. Apparently, I'm not on the same page though.

¯\_(?)_/¯


For what it’s worth, to me, your approach is the one I’d prefer as a customer.

Thank you. I've been on HN since 2009. The most successful people I've seen here, are the ones that are transparent, honest and ethical.

I'm not trying to point fingers, I'm just focused on building a sustainable business and listening to my customers needs. The only way I can do that is by communicating with everyone around me as clearly and openly as I can. All our customers will know exactly where they stand, at all times.

I post a lot of open information on r/AMD_Stock and the feedback that I've gotten there has been exceptional. People are excited to see if AMD can claw back a bit of the market. For the safety and success of AI, we don't need team blue vs. team red, we need everyone to work towards having as many options as possible.

This is one way that I think we are going to differentiate ourselves. We won't just have MI300x, we will have every best-of-the-best chunk of hardware that we can get our hands on. No longer will super computers be tied up behind govt/edu grants. We want to democratize it. It has long been a goal of mine to build a super computer, and here is my chance. I'm excited.

One thing that sets us apart is that my co-founder and I have a ton of experience deploying, managing and optimizing 150,000 AMD GPUs and 20PB+ of storage. We did it ourselves, all through covid and all of the supply chain issues. I'm not sure many others have done that and this is something that we are well versed at doing.

I'm also seeing my competitors hiring a ton, while we are staying lean and mean with a very small team. I'd rather automate everything we deploy and focus all of our investors money on buying compute. We also have a pool of previous people we can hire from, which I think is quite an advantage over blanket hiring.


Just wanted to say that I like how you see things. Also, if you need help with infrastructure automation and are interested in using Nix for increased reproducibility, get in touch. I know you said you want to keep lean and already have a pipeline of potential candidates, but just in case.

Hi Lucian, thank you for stepping up to the plate. Recognized and appreciated. We aren't quite in the hiring phase today, still bootstrapping and focused on growing with customer demand, but I have absolutely added you to my list for the future. Cheers!

Sure. Thank you for answering. Good luck with what you're doing!

How's the overall software support for MI300 series? The hardware itself looks great.

(also, +100 to valuing honesty and transparency)


The hardware is actually pretty amazing. 192GB (or 1.5TB in a chassis), is a game changer.

I'll let you know once I get my hands on them again. There really isn't enough public information about them at all. So far, my friends at ElioVP [0] have published a blog post. Still with not enough detail for my taste, but I'm pretty sure he is limited by what he can talk about. Luckily, I am not.

I mention in another comment below that my current goal is to get a bunch of people to perform testing on them and then publish blog posts along with open source code. This way, we can start a repository of CI/CD tests to see how things improve with time. ROCm 6.1 is rumored to be quite an improvement.

[0] https://www.evp.cloud/post/diving-deeper-insights-from-our-l...


Nice! I'm pretty interested in GPGPU applications and MI300A, but I'm also just glad for more competition. Love that you hit up the LocalLLaMa sub.

Do you know if anyone's tested CuPy stuff on MI300X?


We haven't spec'd to buy A's quite yet as you're actually the first person I've heard even suggest them. If you're truly interested, hit me up personally.

By default, we are putting dual 9754's in the chassis, along with 3TB ram and 155TB nvme. A pretty beefy box. However, if you want to work with us, we can customize this to whatever customers need.

Effectively, we are the capex/opex for something that requires a lot of upfront funding and want to work with businesses that would rather focus on the software side of things.


Was mostly just checking to see if someone had already tested GPGPU, though I know some HPC labs like the MI300A. While I am starting a business, I'm not at the point of shipping software just yet (I wish!). Will definitely keep you in mind for if/when we get to AMD -- it's something I'd want, though that depends on achieving any modicum of success, haha.

Awesome, thanks for the follow up. I wish you the best with your business! It definitely isn't easy, but we will be here to help, when you need it.

It’s basically their minimum cluster size for a reasonable model requires 8ish racks of compute.

Semi analysis did some cost estimates, and I did some but you’re likely paying somewhere in the 12 million dollar range for the equipment to serve a single query using llama-70b. Compare that to a couple of gpus, and it’s easy to see why they are struggling to sell hardware, they can’t scale down.

Since they didn’t use hbm, you need to stich enough cards together to get the memory to hold your model. It takes a lot of 256mb cards to get to 64gb, and there isn’t a good way to try the tech out since a single rack really can’t serve an LLM.

The cloud provider path sounds riskier since that’s two capital intensive businesses, chip design and production and running a cloud service provider.


This is some fantastic insight. That would also inflate the opex (power/space/staffing) needs as well as people need to consider all the networking gear to hook this stuff together. 400G nic/switches/cables aren't cheap and in some cases, very hard to obtain in any sort of quantity.

It does seem like an odd move in that case. I liken this to a company like Bitmain. Why sell the miners when you could just run them yourselves? Well, fact is that they do both. But in this case, Groq is turning off the sales. Who knows, maybe it just ends up being a temporary thing until they can sort all of the pieces out.


The GroqCard (RS-GQ-GC1-0109) was in stock at mouser a few weeks ago and they are still taking orders.

> If customers come with requests for high volumes of chips for very large installations, Groq will instead propose partnering on data center deployment. Ross said that Groq has “signed a deal” with Saudi state-owned oil company Aramco, though he declined to give further details, saying only that the deal involved “a very large deployment of [Groq] LPUs.”

What? How does this make sense?


If you read on, Groq said they would only sell hardware to US companies and outside companies would get cloud services, not the LPUs. I think the US government told them to keep the LPUs in-house since they could be the secret sauce for scale.

I'm not questioning the deployment strategy, I'm wondering why Saudi Aramco wants to access so much compute power that is highly specialized(?) for generative AI workloads. Or is it more general than that?

> I'm wondering why Saudi Aramco wants to access so much compute power that is highly specialized(?) for generative AI workloads. Or is it more general than that?

For vanity reasons and because AI is the future (not every company acts that rationally for huge buying decisions).


Also diversification. They’re smart people; oil isn’t the future. Comparatively small investments to hedge make perfect sense.

Maybe Saudi doesn’t want to rely on OpenAI and other APIs and wants to run a fine-tuned Mixtral model on the cloud or their hardware. International companies will probably opt for an open source model since the data is sensitive and OpenAI could pass that to intelligence.

It’s about material discovery[0], Aramco built a 250b model to help them improve efficiency [1].

[0] https://deepmind.google/discover/blog/millions-of-new-materi...

[1] https://www.aramco.com/en/news-media/speeches/2024/leap24--r...


What's the connection there? The second link says it's not about materials discovery at all but rather their GigaPOWERS model, which is a physics simulation used to optimize CO2 injection into their fields (i.e. optimizing recovery). POWERS is old, it was in development for decades already. Given that they don't plan to use Groq for LLMs but simply for its parallel computation abilities. I wonder to what extent this deal - if it goes through - will seriously drain Groq, actually, as POWERS would not be like the code Groq was designed to run and so much of their performance comes from the way they tightly optimize for very specific calculations.

Hi there, I work for Groq.

Groq's system was designed to run arbitrary high performance numerical workloads. In the past it has been used for a variety of scientific computation tasks, including nuclear fusion and drug discovery.

https://www.alcf.anl.gov/news/argonne-deploys-new-groq-syste...


It's an assumption from my side, that they will utilize GenAI in material discovery for Aramco or SABIC[0]. Even if Groq won't fit that use-case a couple of billions with another hardware vendor is nothing if this paid off.

[0]https://en.wikipedia.org/wiki/SABIC


Partly supply-chain security I imagine. If generative AI does indeed become the next big thing much nicer to have a giant pile of hardware physically in your country than buying a drip feed from a foreign company.

Oil & gas has large data needs, they had petabyte-scale data 2 decades ago.

Sounds like they're looking to get bought up to me. I'm sure they could monetize their current hardware, and build to sell just like other niche hardware vendors. Anyone remember the hype around big "cloud" storage boxes 10 years back?

Given that their hardware is different I can kinda see how they don’t want to deal with supporting customers.

> what do you mean I can’t just drop a CUDA docker image in?


if you're a hardware startup that doesn't sell hardware, what are you?

> if you're a hardware startup that doesn't sell hardware, what are you?

A hardware startup that sells cloud access to its hardware. :-)


Hardware setup that produces superior hardware and extracts the benefit in house ?

That sucks. I wanted to save up for a couple years and get some hardware for home, but I guess the "AI" space moves so fast you barely get a couple months

Save up for Tenstorrent instead.

I'll look into it, though seeing "contact us" always makes me think they're not going to sell a single unit to a home user. (With that said, Groq probably wouldn't either. You can technically buy LPUs for 20k each, without an expectation of support, but it takes tens of them to run Mixtral.)

fwiw as a consumer I have a tenstorrent card in my machine

that is great. please hit me up privately, i'd love to chat with you about it.

Most of the low-level pieces are in Rust, the TUI is written in Python and most of the remaining pieces are getting lowered down to the Rust libraries over time.

(It was all Python up until ~6 months ago)

EDIT: Oh, and you can buy the Grayskull cards online now, without contacting anyone.


Even talking to people there, my experience is that they are super nice!

> EDIT: Oh, and you can buy the Grayskull cards online now, without contacting anyone.

I actually don't mind having to contact, I only mind if they won't want to sell to me due to being a non-bulk order.

> Most of the low-level pieces are in Rust

That's awesome!


I figured I'd add a bit of clarification here. Tenstorrent has a couple SW stacks. The lowest level programming (Metalium) model is written in C++ and goes down to the register level. Higher level entrypoints (ttlib, ttnn) are written on top of that in python. I think it's the smi tooling and other systems software that might be written in rust.

You would need ~250 groq cards to run a 7B model since their system doesn't scale down. So if you want to buy their hardware, you need a few millions dollars.

Their hardware was never for people at home, but for cloud providers.


That doesn't sound right. Their public demo ran on 568 LPUs because they had Mixtral-8x7B and LLaMA-70B (45B and 70B respectively). IIRC their cards each have slightly over 200MB of SRAM so this almost exactly checks out.

A 7B model would then be able to run on about 60 LPUs. Even at $20,000 per card that would be only $1.2 million and I highly doubt the cost is actually that high, that's just what DigiKey says the cost of an LPU is, if you're trying to buy just one :)


Custom state-of-the-art silicon is ridiculously expensive.

For a minimum 100 wafers = 100k chips, Groq may be paying $1k/chip purely in amortizing design costs.

Chip design (software + engineer time) and fabrication setup (lithography masks) grow exponentially [1][2] with smaller nodes, e.g., maybe $100M for Groq's current 14nm chips to ~$500M for their planned 4nm tapeout. Once you reach mass production (>>1000 wafers, which have ~150 large chips each), wafers are $4k-$16k each. (these same issues still exist on older slower nodes, albeit not as bad)

This could be reduced somewhat if chip design software were cheaper, but probably >20% of this cost is fundamental due to manufacturing difficulty.

[1] https://www.semianalysis.com/p/the-dark-side-of-the-semicond... [2] https://www.extremetech.com/computing/272096-3nm-process-nod...


The report I read said that latest TSMC is 17K per wafer. How much less it is for 14nm I don't know.

The masks are the expensive part, not the wafers.

They are both fabulously expensive.

> Custom state-of-the-art silicon is ridiculously expensive.

Think about the amount of money being dumped into "AI" at this point. If you've got the technology and people to make stuff faster/better/cheaper, finding investors to dump money into your chip making business is probably not as hard as it was 2 years ago.

Groq is making this change for other reasons than the expense of tapping out chips.


i don’t support hardware development directly, but i’m a software infrastructure engineer working adjacent to the teams that do so.

can’t comment on specifics, but imo our hardware team punches above its weight class in terms of # of people and time spent in design.


The smoke and mirrors around groq are finally clearing. Truth is that their system is insanely expensive to maintain. hundreds (> 500 iirc) of chips to get wild tokens/s but the power and maintenance expense is crazy high for that number of chips. TCO just isn’t worth it

You don't know that. For one thing, their silicon costs are going to be relatively cheap. It's an old reliable, 14nm process, and compared to even Google's TPU this is a relatively simple chip. For another they _could_ be putting all that silicon to a good use, and by all indications they are. Because there's far less local memory movement, and weights are distributed throughout the system, even this 14nm system could be energy efficient. In 9/10ths of all power in a conventional system does not go towards compute - it's wasted in moving data back and forth. This is especially bad in transformers, which, because of their size, largely defeat the memory hierarchies the architects worked so hard to perfect. IOW, all your caches are useless and you're unnecessarily wasting 90% of your energy while also getting worse latency and worse throughput (due to memory bus bandwidth constraints). Oops. These folks seem to be offering something that nobody else does - a feasible, proven way to get out of jail free. I wish them all the success they can get, because all the other currently available architectures are largely unsuitable for high throughput transformer inference, and they work in spite, instead of because, of their design.

Peak H100 power consumption is 700W. Average power consumption of the groq card (from their own website) is 240W. With 576 chips it just doesn’t look good. How much is that millisecond perf gain worth it to end users?

That said I think their arch is super interesting. I just think that demo was way too hype when the actual system is pretty impractical.


So? They aren't performing the same computation. You can't compare the two. What you can compare is power draw at an equivalent tokens/sec on the same model. But you don't have that number.

I’m just estimating here from public numbers. My point is the power consumption could be 100W on that workload and the groq chip could be 1k, both ridiculously optimistic. The whole system is still crazy expensive. H100s will not have latency as fast, but in terms of concurrent users and TCO, I really don’t think groq will be worth it. You could probably get the same concurrent users and throughput with like 8 h100s. Latency won’t be as good, but price could be much lower.

Why would they want to run it themselves if the TCO didn’t work out

I thought that was par for the course these days.

Operate at a loss. Get a big valuation. Cash out.


Because they rather operate at a loss with high revenue rather than have 0 revenue and loss?

I don't understand why the comments are trash-talking Groq. They are the fastest LLM inference provider by a big margin. Why would they sell their hardware to any other company for any price? Keep it all for themselves and take over the market. 95% of my LLM requests go to Groq these days. The other 5% to Claude Opus. Latency is king.

why don't you stream the results?

You still have to wait for the end of the streamed response until you can continue with your task.

What open source model are you using when you hit groq?

I just benchmarked some perf for some of my larger context window queries last week and groq's API took 1.6 seconds versus 1.8 to 2.2 for OpenAI GPT-3.5-turbo. So, it wasn't much faster. I almost emailed their support to see if I was doing something wrong. Would love to hear any details about your workload or the complexity of your queries.


> 1.6 vs 1.8-2.2 seconds

I believe certain companies would kill for 20% performance improvements on their main product.


"kill", .. why would anyone kill for a fraction of a second in this case? Informed folks know that LLM hosters aren't raking in the big bucks.

They're selling dreams and aspirations, and those are what's driving the funding.


Google has used LMs in search for years (just not trendy LLMs), and search is famously optimized to the millisecond. Visa uses LMs to perform fraud detection every time someone makes a transaction, which is also quite latency sensitive. I'm guessing "informed folks" aren't so informed about the broader market.

OpenAI and Anthropic's APIs are obviously not latency-driven. Same with comparable LLM API resellers like Azure. Most people are likely not expecting tight latency SLOs there. That said, chat experiences (esp. voice ones) would probably be even more valuable if they could react in "human time" instead of with few seconds delay.

Integrating specialized hardware that can shave inference to fractions of a second seems like something that could be useful in a variety of latency-sensitive opportunities. Especially if this allows larger language models to be used where traditionally they were too slow.


I wish things were so simple!

Reducing latency doesn't automatically translate to winning the market or even increased revenue. There are tons of other variables such as functionality, marketing, back-office sales deals and partnerships. Lots of times, users can't even tell which service is objectively better (even though you and I have the know how and tools to measure and better know reality).

Unfortunately the technical angle is only one piece of the puzzle.


I have lots of questions about how important latency is since you may be replacing many minutes or hours of a person’s time with undoubtedly a quicker response by any measure. This seems like a knee jerk reaction assuming latency is as important as it’s been with advertising.

I’m not convinced latency matters as much as groqs material tries to claim it does.


I guess its tool calling? When you chain the LLMs together?

When has latency ever not mattered?

Let alone 'chat' use cases, but holding a reponse up for N*1.2 longer than it could holds all sorts of other resources up/down stream.


When it's already faster than I can absorb the response, which for me as an organic brain includes the normal token generation rate of the free tier of ChatGPT.

If I was using them to process far more text, e.g. summarise long documents, or if I was using it as an inline editing assistant, then I'd care more about the speed.


> When it's already faster than I can absorb the response

Streaming a response from a chatbot is only one use-case of LLMs.

I would argue the most interesting applications do not fall into this category.


Number of different use cases (categories) I'd agree; I'm not so sure about use (volume)…

…not yet anyway. Fast moving area, lots of blue water outside the chat interface.


Name one use case where there is a difference between latency of 200 t/s (fireworks.ai mixtral model) and 500 t/s (groq mixtral)? Not throughput and not time to first token, but latency.

Groq model shines at latency, not at the other two.


Depends on your application.

For example, if you're a game company and you want to use LLMs so your players can converse with nonplayer characters in natural language, replacing a multiple-choice conversation tree - you'd want that to be low latency, and you'd want it to be cheap.


But are people really going to do this? The cost here seems prohibitive unless you're doing a subscription type game (and even then I'm not sure). And the kinds of games that benefit from open ended dialogue attract players who just want to pay an upfront cost and have an adventure.

(All the sudden having nightmares of getting billed for the conversations I have in the single player game I happen to be enjoying...)

If there is a future with this idea, its gotta be just shipping the LLM with game right?


> If there is a future with this idea, its gotta be just shipping the LLM with game right?

That might be a nice application for this library of mine: https://github.com/Const-me/Cgml/

That’s an open source Mistral ML model implementation which runs on GPUs (all of them, not just nVidia), takes 4.5GB on disk, uses under 6GB of VRAM, and optimized for interactive single-user use case. Probably fast enough for that application.

You wouldn’t want in-game dialogues with the original model though. Game developers would need to finetune, retrain and/or do something else with these weights and/or my implementation.


I understand there are games using LLMs for NPC dialog, yes [1]

> If there is a future with this idea, its gotta be just shipping the LLM with game right?

Depends how high you can let your GPU requirements get :)

[1] https://www.youtube.com/watch?v=Kw51fkRiKZU


FWIW to confused others, trying to extract something from that video, it looks like this game [1] is using this stuff. Based solely on the reviews and the game play videos (while definitely acknowledging its technically in development status), it kinda looks like long term profitability is the least of their concerns here...

EDIT: Watching the videos, I am more and more confused by why this is even desirable. The complexity of dialogue in a game, it seems, needs to match the complexity of the more general possibilities and actions you can undertake in the game itself. Without that, it all just feels like you are in a little chatbot sandbox within the game, even if the dialogue is perfectly "in character." It all seems to feel less immersive with the LLMs.

1. https://store.steampowered.com/app/2240920/Vaudeville/


Absolutely on the mark with this comment. LLMs aren't magical end-goal technology. We have a while to go it seems before they've settled into all the use-cases and we've established what does and doesn't work.

It would probably look like an InfiniteCraft-style model, where conversation possibilities are saved, and new dialogue (edge nodes) is computed as needed.

Small, bounded conversations, with problematic lines trimmed over time, striking a balance between possibility and self-contradiction.

I could see it working really well in a Mass Effect-type game.


Google won search in large part because of their latency. I stopped using local models because of latency. I switched from OpenAI to VertexAI because of latency (and availability)

Model quality matters a ton too. They aren't serving OpenAI or Anthropic models, which are state of the art.

Research suggest most answers and use cases do not require the largest, most sophisticated models. When you start building more complex systems, the overall time increases from chaining and you can pick different models for different points

What is the killer app product of a LLM Play ATM that's not a lossleader?

What context did I miss that implies they are using an open source model?

If you go to GroqChat (which is like a demo app), they offer Gemma, Mistral, and LLaMa. These are all open-weights models.

It's not a lot faster for input but it is something like 10x faster for output(mixtral vs gpt-3.5). This could enable completely new mode of interaction with LLMs e.g. agents.

>> why the comments are trash-talking Groq

they probably bought NVDA stock :)


How do you decide which requests to send to gpt4/opus?

If I was developing an AI app, I'd care about quality first before speed. And the open-source models just aren't as good as the closed ones.

Another casualty of AI KYC.

I'm not able to get consistent replies from the API. It's lightening fast for like ten minutes and then starts freezing up for several seconds.

I want to use it, but it's been very unreliable. I have been using Claude 3 and thinking about together.ai with Mixtral.


Same, it's great when it's quick / available, but they seem underprovisioned for busy times and I often get long 10-30 second stalls.

IMHO, Groq is being shadow acquired by Google

This business model is bound to get attacked and suffer a painful exit soon. Here's why:

First, the whole systems of chips architecture that everyone is talking about will solve for increasing overall SRAM available to keep more model state on super fast memory and avoid going to slow memory.

Secondly, anyone serious about their data (enterprises) won't be okay with making API calls to Groq. Anyone serious about their data and have a lot scale (consumer internet) won't also be okay with making expensive API calls to Groq at scale.

Their cloud is attractive only if I can use their API for experimentation toy apps to continue developing in this direction while the rest of the major industry players systems of chip architecture catches up and solves for SRAM size bottleneck and manufacturing process bottleneck, and once that's solved, I get more powerful compute for cheaper $$ to deploy on-prem.

So, this cloud strategy is short-lived. I see another pivot on the horizon.


The same has been said of OpenAI for a couple of years now (that they're just a platform to prototype on before moving on to open source models)...

... and yet, they're still leading the field.

I think it's a bit early to think the field is getting commoditized yet.


>> won't be okay with making API calls to Groq

Linked article:

  If customers come with requests for high volumes of chips for very large installations, Groq will instead propose partnering on data center deployment

Totally saw this one coming! [1]

I think one major challenge they'll face is that their architecture is incredibly fast at running the ~10-100B parameter open-source models, but starts hitting scaling issues with state-of-the-art models. They need 10k+ chips for a GPT-4-class model, but their optical interconnect only supports a few hundred chips.

[1] https://www.zach.be/p/why-is-everybody-talking-about-groq


[dead]

Legal | privacy