A guidance language for controlling LLMs

sharemywin | karma 5432 | avg karma 0.74 · 2023-05-16 11:51:14

It does look like it makes easier to code against a model. But, is this supposed to work along side lang-chain or hugging face agents or as an alternative to?

evanmays | karma 685 | avg karma 57.08 · 2023-05-16 12:16:00

It's in langchain competitor territory but also much lower level and less opinionated. I.e. Guidance has no vector store support but it does manage caching Key/Value on the GPU which can be a big latency win

ttul | karma 8355 | avg karma 3.91 · 2023-05-16 12:36:49

The first commit was on November 6th, but it didn't show up in Web Archive until May 6th, suggesting it was developed mostly in private and in parallel with LangChain (LangChain's first commit in Github is about October 24th). Microsoft's code is very tidy and organized. I wonder if they used this tool internally to support their LLM research efforts.

Terretta | karma 18480 | avg karma 2.96 · 2023-05-18 05:52:28

Something like this could be a helpful framework to mock and research-iterate purpose-directed tools such as Microsoft GitHub's CoPilot for VSCode.

slundberg | karma 16 | avg karma 2.67 · 2023-05-16 12:47:45

As others mentioned, this was initially developed before LangChain became widely used. Since it is lower level, you can leverage other tools, like any vector store interface you like such as in LangChain. Writing complex chain of thought structure is much more concise in guidance I think since it tries to keep you as close to the real strings going into the model as possible.

ntonozzi | karma 466 | avg karma 3.88 · 2023-05-16 11:54:22

How does this work? I've seen a cool project about forcing Llama to output valid JSON: https://twitter.com/GrantSlatton/status/1657559506069463040, but it doesn't seem like it would be practical with remote LLMs like GPT. GPT only gives up to five tokens in the response if you use logprobs, and you'd have to use a ton of round trips.

joshka | karma 1414 | avg karma 1.52 · 2023-05-16 12:08:07

Yeah, I'm also curious about a) round trips and b) how much would have to be doubled (is there a new endpoint that keeps the existing context while adding or streams to the api rather than just from it?)

tuchsen | karma 133 | avg karma 2.33 · 2023-05-16 12:09:26

Not associated with this project (or LMQL), but one of the authors of LMQL, a similar project, answered this in a recent thread about it.

https://news.ycombinator.com/item?id=35484673#35491123

        As a solution to this, we implement speculative execution, allowing us to
        lazily validate constraints against the generated output, while still
        failing early if necessary. This means, we don't re-query the API for
        each token (very expensive), but rather can do it in segments of
        continuous token streams, and backtrack where necessary

Basically they use OpenAI's streaming API, then validate continuously that they're getting the appropriate output, retrying only if they get an error. It's a really clever solution.

newhouseb | karma 2321 | avg karma 5.02 · 2023-05-16 12:54:45

This is slick -- It's not explicitly documented anywhere but I hope OpenAI has the necessary callbacks to terminate generation when the API stream is killed rather than continuing in the background until another termination condition happens? I suppose one could check this via looking at API usage when a stream is killed early.

tuchsen | karma 133 | avg karma 2.33 · 2023-05-16 13:46:01

Yeah I did a CLI tool for talking to ChatGPT. I'm pretty sure they stop generating when you kill the SSE stream, based on my anecdotal experience of keeping ChatGPT4 costs down by killing it as soon as i get the answer I'm looking for. You're right that it's undocumented behavior though, on a whole the API docs they give you are as thin as the API itself.

killthebuddha | karma 323 | avg karma 2.52 · 2023-05-16 13:58:20

I'm skeptical that the streaming API would really save that much cost. In my experience the vast majority of all tokens used are input tokens rather than completed tokens.

boywitharupee | karma 43 | avg karma 0.73 · 2023-05-19 11:24:49

Any new call to the API is considered fresh. I don't believe your session is saved.

newhouseb | karma 2321 | avg karma 5.02 · 2023-05-19 18:42:09

We're talking about the streaming API which streams generated text token by token, not the normal one-shot API. I have no insider knowledge but would agree with your intuition on the normal API.

newhouseb | karma 2321 | avg karma 5.02 · 2023-05-16 12:18:46

I built a similar thing to Grant's work a couple months ago and prototyped what this would look like against OpenAI's APIs [1]. TL;DR is that depending on how confusing your schema is, you might expect up to 5-10x the token usage for a particular prompt but better prompting can definitely reduce this significantly.

[1] https://github.com/newhouseb/clownfish#so-how-do-i-use-this-...