Hacker Read

laserbeam | karma 2084 | avg karma 5.33 · 2024-04-08 04:56:58

> 1.6 vs 1.8-2.2 seconds

I believe certain companies would kill for 20% performance improvements on their main product.

metadat | karma 19104 | avg karma 3.2 · 2024-04-08 05:12:29

"kill", .. why would anyone kill for a fraction of a second in this case? Informed folks know that LLM hosters aren't raking in the big bucks.

They're selling dreams and aspirations, and those are what's driving the funding.

reply

vineyardmike | karma 7130 | avg karma 2.33 · 2024-04-08 07:56:33

Google has used LMs in search for years (just not trendy LLMs), and search is famously optimized to the millisecond. Visa uses LMs to perform fraud detection every time someone makes a transaction, which is also quite latency sensitive. I'm guessing "informed folks" aren't so informed about the broader market.

OpenAI and Anthropic's APIs are obviously not latency-driven. Same with comparable LLM API resellers like Azure. Most people are likely not expecting tight latency SLOs there. That said, chat experiences (esp. voice ones) would probably be even more valuable if they could react in "human time" instead of with few seconds delay.

Integrating specialized hardware that can shave inference to fractions of a second seems like something that could be useful in a variety of latency-sensitive opportunities. Especially if this allows larger language models to be used where traditionally they were too slow.

reply

metadat | karma 19104 | avg karma 3.2 · 2024-04-09 00:52:34

I wish things were so simple!

Reducing latency doesn't automatically translate to winning the market or even increased revenue. There are tons of other variables such as functionality, marketing, back-office sales deals and partnerships. Lots of times, users can't even tell which service is objectively better (even though you and I have the know how and tools to measure and better know reality).

Unfortunately the technical angle is only one piece of the puzzle.

reply

gpapilion | karma 1022 | avg karma 2.7 · 2024-04-08 05:41:59

I have lots of questions about how important latency is since you may be replacing many minutes or hours of a person’s time with undoubtedly a quicker response by any measure. This seems like a knee jerk reaction assuming latency is as important as it’s been with advertising.

I’m not convinced latency matters as much as groqs material tries to claim it does.

reply

frozenport | karma 2997 | avg karma 1.05 · 2024-04-08 06:07:47

I guess its tool calling? When you chain the LLMs together?

w-ll | karma 1868 | avg karma 2.85 · 2024-04-08 06:11:44

When has latency ever not mattered?

Let alone 'chat' use cases, but holding a reponse up for N*1.2 longer than it could holds all sorts of other resources up/down stream.

reply

ben_w | karma 20467 | avg karma 1.69 · 2024-04-08 08:03:17

When it's already faster than I can absorb the response, which for me as an organic brain includes the normal token generation rate of the free tier of ChatGPT.

If I was using them to process far more text, e.g. summarise long documents, or if I was using it as an inline editing assistant, then I'd care more about the speed.

reply

qeternity | karma 8257 | avg karma 3.52 · 2024-04-08 08:16:03

> When it's already faster than I can absorb the response

Streaming a response from a chatbot is only one use-case of LLMs.

I would argue the most interesting applications do not fall into this category.

reply

ben_w | karma 20467 | avg karma 1.69 · 2024-04-08 09:17:05

Number of different use cases (categories) I'd agree; I'm not so sure about use (volume)…

…not yet anyway. Fast moving area, lots of blue water outside the chat interface.

reply

boroboro4 | karma 23 | avg karma 1.28 · 2024-04-08 20:41:31

Name one use case where there is a difference between latency of 200 t/s (fireworks.ai mixtral model) and 500 t/s (groq mixtral)? Not throughput and not time to first token, but latency.

Groq model shines at latency, not at the other two.

reply

michaelt | karma 31037 | avg karma 4.1 · 2024-04-08 09:11:16

Depends on your application.

For example, if you're a game company and you want to use LLMs so your players can converse with nonplayer characters in natural language, replacing a multiple-choice conversation tree - you'd want that to be low latency, and you'd want it to be cheap.

reply

beepbooptheory | karma 1930 | avg karma 1.98 · 2024-04-08 11:49:15

But are people really going to do this? The cost here seems prohibitive unless you're doing a subscription type game (and even then I'm not sure). And the kinds of games that benefit from open ended dialogue attract players who just want to pay an upfront cost and have an adventure.

(All the sudden having nightmares of getting billed for the conversations I have in the single player game I happen to be enjoying...)

If there is a future with this idea, its gotta be just shipping the LLM with game right?

reply

Const-me | karma 5797 | avg karma 1.88 · 2024-04-08 13:13:23

> If there is a future with this idea, its gotta be just shipping the LLM with game right?

That might be a nice application for this library of mine: https://github.com/Const-me/Cgml/

That’s an open source Mistral ML model implementation which runs on GPUs (all of them, not just nVidia), takes 4.5GB on disk, uses under 6GB of VRAM, and optimized for interactive single-user use case. Probably fast enough for that application.

You wouldn’t want in-game dialogues with the original model though. Game developers would need to finetune, retrain and/or do something else with these weights and/or my implementation.

reply

michaelt | karma 31037 | avg karma 4.1 · 2024-04-08 14:35:29

I understand there are games using LLMs for NPC dialog, yes [1]

> If there is a future with this idea, its gotta be just shipping the LLM with game right?

Depends how high you can let your GPU requirements get :)

[1] https://www.youtube.com/watch?v=Kw51fkRiKZU

reply

beepbooptheory | karma 1930 | avg karma 1.98 · 2024-04-08 14:46:32

FWIW to confused others, trying to extract something from that video, it looks like this game [1] is using this stuff. Based solely on the reviews and the game play videos (while definitely acknowledging its technically in development status), it kinda looks like long term profitability is the least of their concerns here...

EDIT: Watching the videos, I am more and more confused by why this is even desirable. The complexity of dialogue in a game, it seems, needs to match the complexity of the more general possibilities and actions you can undertake in the game itself. Without that, it all just feels like you are in a little chatbot sandbox within the game, even if the dialogue is perfectly "in character." It all seems to feel less immersive with the LLMs.

1. https://store.steampowered.com/app/2240920/Vaudeville/

reply

razodactyl | karma 333 | avg karma 1.08 · 2024-04-09 16:58:54

Absolutely on the mark with this comment. LLMs aren't magical end-goal technology. We have a while to go it seems before they've settled into all the use-cases and we've established what does and doesn't work.

sdwr | karma 1719 | avg karma 1.81 · 2024-04-09 18:19:45

It would probably look like an InfiniteCraft-style model, where conversation possibilities are saved, and new dialogue (edge nodes) is computed as needed.

Small, bounded conversations, with problematic lines trimmed over time, striking a balance between possibility and self-contradiction.

I could see it working really well in a Mass Effect-type game.

reply

verdverm | karma 6501 | avg karma 0.84 · 2024-04-08 09:31:55

Google won search in large part because of their latency. I stopped using local models because of latency. I switched from OpenAI to VertexAI because of latency (and availability)

robrenaud | karma 2293 | avg karma 4.44 · 2024-04-08 05:46:17

Model quality matters a ton too. They aren't serving OpenAI or Anthropic models, which are state of the art.

verdverm | karma 6501 | avg karma 0.84 · 2024-04-08 09:33:57

Research suggest most answers and use cases do not require the largest, most sophisticated models. When you start building more complex systems, the overall time increases from chaining and you can pick different models for different points

cyanydeez | karma 1718 | avg karma 0.75 · 2024-04-08 20:06:12

What is the killer app product of a LLM Play ATM that's not a lossleader?