Hacker Read

ponywombat · 2024-02-20 12:46:55

This is very impressive, but whilst it was very fast with Mixtral yesterday, today I waited 59.44s for a response. If I was to use your API, the end-to-end is much more important than the Output Tokens Throughput and Time to first token metrics. Will you also publish average / minimum / maximum end-to-end times too?

pg_bot | karma 5051 | avg karma 5.12 · | 2018-09-18 16:55:13+00:00

Cool glad to hear from an insider about how things are/were run. Do you have any more insight on what your performance budget is/was for a typical request/response cycle? We aim for sub millisecond response time at the 99th percentile.

joelm | karma 15 | avg karma 0.71 · | 2023-05-27 14:32:57

Latency has been the biggest challenge for me.

They cite "two to 15+ seconds" in this blog post for responses. Via the OpenAI API I've been seeing more like 45-60 seconds for responses (using GPT-3.5-turbo or GPT-4 in chat mode). Note, this is using ~3500 tokens total.

I've had to extensively adapt to that latency in the UI of our product. Maybe I should start showing funny messages while the user is waiting (like I've seen porkbun do when you pay for domain names).

reply

phillipcarter | karma 3882 | avg karma 3.77 · | 2024-03-04 18:47:22

Yeah, but latency is still a factor here. Any follow-up question requires re-scanning the whole context, which often takes a long time. IIRC when Google showed their demos for this use case each request took over 1 minute for ~650k tokens.

2023throwawayy | karma 359 | avg karma 3.49 · | 2023-09-27 14:33:23

Interesting their fastest backend response seems to be around 500ms up to a full second.

mrhichem | karma 207 | avg karma 3.51 · | 2021-08-27 03:20:46

Tradier API. Their free api has a 15 minutes delay with a limited threshold per minute.

synaesthesisx | karma 1725 | avg karma 3.91 · | 2023-11-14 01:17:01

Are you noticing this with the API too? I’ve observed inconsistencies in response time via API (v4-T)

ayw | karma 438 | avg karma 1.79 · | 2016-09-20 21:41:21+00:00

Thanks! We try to have cool demos :) We're working on having faster and faster guarantees, but for now we'll guarantee responses within 20 minutes. It's definitely much faster normally.

By default, Scale is built using webhooks/callbacks to return responses. We definitely don't want our API to be blocking, and so we designed it intentionally not to block.

reply

khazhoux | karma 8386 | avg karma 4.82 · | 2023-06-14 21:56:00

I'm trying to experiment with the API but the response time is always in the 15-25second range. How are people getting any interesting work done with it?

I see others on the OpenAPI dev forum complaining about this too, but no resolution.

reply

d13 | karma 398 | avg karma 1.3 · | 2021-09-29 02:12:53

But how fast? I see other companies advertising 1.5 second response times for GPT-J, but a now assume that’s average per token, as for, say, a 200 word prompt response times can be well over a minute during heavy use periods like weekends when everyone is hitting their side projects.

hbrundage | karma 1191 | avg karma 15.67 · | 2011-12-29 23:11:04+00:00

Couldn't you just hit the MixPanel API after the response has been flushed? It means responses are returned as fast as normal and you don't up the complexity of the whole thing by introducing other processes and potentially bottlenecking queues.

eberfreitas | karma 254 | avg karma 1.87 · | 2012-01-02 19:23:51+00:00

This is great! It would be also cool to see a response time report.

justincormack | karma 11120 | avg karma 2.4 · | 2013-01-12 11:03:24+00:00

I doubt that requests are that evenly distributed on a second by second basis so peak response rates might be a few times that for some seconds.

Simulacra | karma 2212 | avg karma 0.92 · | 2020-05-30 15:51:41+00:00

Our API is a little different because of security, but our metric for evaluation is less than two minutes of down time per year.

tisba | karma 10 | avg karma 0.71 · | 2014-06-20 15:30:30

You are referring to the error rate? :) We actually did about 800M additional requests two days after I've written the post.

abritinthebay | karma 1846 | avg karma 1.04 · | 2017-11-20 22:35:23+00:00

5 seconds per request??

What on earth? Our entire app is React+Redux based, gathers a bunch of unique user data per-request from various apis and we still get ~40-120ms response times (and that's on CDN cache misses).

Hell, our Node server's connection timeout is 3 seconds and we only ever hit that due an APIs tanking or something.

reply

KeplerBoy | karma 1999 | avg karma 2.67 · | 2023-05-06 10:12:20

That's a lot of requests.

Not that it matters for the calculation, but i wonder how long such a request (ingesting 32k tokens and responding with a similar amount) would take.

At the speed of regular ChatGPT take would take a good while.

reply

thinkMOAR | karma 382 | avg karma 1.08 · | 2018-03-19 10:33:01

"When you deliver response time that drops down to about a quarter of a second, results seem to be instantaneous to users."

I don't think everybody agrees with this statement.

reply

akubera | karma 463 | avg karma 4.21 · | 2016-05-04 22:28:17+00:00

Yes!

An initial test last night of a trivial growler app (i.e. one method that replies with a simple string) saw a median response time drop from ~200ms to ~80ms.

(using apachebenchmark, n=1000 c=100)

reply

gbrits | karma 123 | avg karma 1.17 · | 2016-11-30 22:42:53+00:00

Not OP, but athena returns results for most queries in a couple of seconds (quote is somewhere in the blogpost) this would likely not be enough for your typical request/response flows.