This is very impressive, but whilst it was very fast with Mixtral yesterday, today I waited 59.44s for a response. If I was to use your API, the end-to-end is much more important than the Output Tokens Throughput and Time to first token metrics. Will you also publish average / minimum / maximum end-to-end times too?
Cool glad to hear from an insider about how things are/were run. Do you have any more insight on what your performance budget is/was for a typical request/response cycle? We aim for sub millisecond response time at the 99th percentile.
They cite "two to 15+ seconds" in this blog post for responses. Via the OpenAI API I've been seeing more like 45-60 seconds for responses (using GPT-3.5-turbo or GPT-4 in chat mode). Note, this is using ~3500 tokens total.
I've had to extensively adapt to that latency in the UI of our product. Maybe I should start showing funny messages while the user is waiting (like I've seen porkbun do when you pay for domain names).
Yeah, but latency is still a factor here. Any follow-up question requires re-scanning the whole context, which often takes a long time. IIRC when Google showed their demos for this use case each request took over 1 minute for ~650k tokens.
Thanks! We try to have cool demos :) We're working on having faster and faster guarantees, but for now we'll guarantee responses within 20 minutes. It's definitely much faster normally.
By default, Scale is built using webhooks/callbacks to return responses. We definitely don't want our API to be blocking, and so we designed it intentionally not to block.
I'm trying to experiment with the API but the response time is always in the 15-25second range. How are people getting any interesting work done with it?
I see others on the OpenAPI dev forum complaining about this too, but no resolution.
But how fast? I see other companies advertising 1.5 second response times for GPT-J, but a now assume that’s average per token, as for, say, a 200 word prompt response times can be well over a minute during heavy use periods like weekends when everyone is hitting their side projects.
Couldn't you just hit the MixPanel API after the response has been flushed? It means responses are returned as fast as normal and you don't up the complexity of the whole thing by introducing other processes and potentially bottlenecking queues.
What on earth? Our entire app is React+Redux based, gathers a bunch of unique user data per-request from various apis and we still get ~40-120ms response times (and that's on CDN cache misses).
Hell, our Node server's connection timeout is 3 seconds and we only ever hit that due an APIs tanking or something.
An initial test last night of a trivial growler app (i.e. one method that replies with a simple string) saw a median response time drop from ~200ms to ~80ms.
Not OP, but athena returns results for most queries in a couple of seconds (quote is somewhere in the blogpost) this would likely not be enough for your typical request/response flows.
reply