Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Outlines guided generation #1539

Merged
merged 34 commits into from
Feb 15, 2024
Merged

Outlines guided generation #1539

merged 34 commits into from
Feb 15, 2024

Conversation

drbh
Copy link
Collaborator

@drbh drbh commented Feb 8, 2024

This WIP PR starts to add grammar support via outlines, currently this PR supports very simple regex grammars and does not optimize for precompiling or caching grammar fsm's.

todo:

  • add simple outlines guidance to NextTokenChooser
  • update protos for grammar
  • update generation params API
  • constrain simple grammar
  • support parsing more complex grammar into fsm
  • support all outline support grammar types
  • explore optimizations to avoid recompiling grammars

guided request

curl -s 'http://localhost:3000/generate' \
--header 'Content-Type: application/json' \
--data-raw '{
    "inputs": "make an email for david: \n",
    "parameters": {
        "max_new_tokens": 6,
        "grammar": "[\\w-]+@([\\w-]+\\.)+[\\w-]+"
    }
}' | jq

response

{
  "generated_text": "[email protected]"
}

unguided request

curl -s 'http://localhost:3000/generate' \
--header 'Content-Type: application/json' \
--data '{
    "inputs": "make an email for david: \n",
    "parameters": {
        "max_new_tokens": 6
    }
}' | jq

response

{
  "generated_text": "    email = 'david"
}

@drbh drbh force-pushed the outlines-guided-generation branch 2 times, most recently from a0c8b9a to d4de402 Compare February 8, 2024 17:59
@drbh
Copy link
Collaborator Author

drbh commented Feb 8, 2024

updates:

  • support parsing more complex grammar into fsm
  • support all (serializable) outline support grammar types
  • explore optimizations to avoid recompiling grammars (rely on lru_cache)
  • add grammar to all token choosers (should work with all/most model architectures)

JSON schemas are supported and can be used like:

 curl -s 'http://localhost:3000/generate' \
--header 'Content-Type: application/json' \
--data '{
    "inputs": "info: david holtz like trees and has two cats. ",
    "parameters": {
        "max_new_tokens": 100,
        "grammar": {
            "$id": "https://example.com/person.schema.json",
            "$schema": "https://json-schema.org/draft/2020-12/schema",
            "title": "Person",
            "type": "object",
            "properties": {
                "firstName": {
                    "type": "string",
                    "description": "The person'\''s first name."
                },
                "lastName": {
                    "type": "string",
                    "description": "The person'\''s last name."
                },
                "hobby": {
                    "description": "The person'\''s hobby.",
                    "type": "string"
                },
                "numCats": {
                    "description": "The number of cats the person has.",
                    "type": "integer",
                    "minimum": 0
                }
            },
            "required": ["firstName", "lastName", "hobby", "numCats"]
        }
    }
}' | jq .

response

{
  "generated_text": "{\"firstName\": \"david\", \"hobby\": \"trees\", \"lastName\": \"holtz\", \"numCats\": 2}"
}

regex strings are still supported as well

curl --location 'http://localhost:3000/generate' \
--header 'Content-Type: application/json' \
--data-raw '{
    "inputs": "name: david. email:  ",
    "parameters": {
        "max_new_tokens": 20,
        "grammar": "[\\w-]+@([\\w-]+\\.)+[\\w-]+"
    }
}'
{
  "generated_text":"[email protected]_number_1.1234567890.phone_"
}

notes:

building the from the grammar FSM is very computationally expensive and is required for the first generation. Wait times can be ~10 seconds with complex grammars; this performance impact (along with other thing) need to be taken into account before adding this feature

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@drbh drbh force-pushed the outlines-guided-generation branch from 370e47f to cadb0a9 Compare February 10, 2024 03:22
prefill_tokens_indices[
out_start_index : out_end_index - 1
] = batch.input_ids[start_index + 1 : start_index + out_length]
prefill_tokens_indices[out_start_index : out_end_index - 1] = (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay we're going to need to force lint everything with some standard instead of using each other's editor's default :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh I totally agree we should standardize our formatters. Currently I'm using Black out of the box. Is there an existing config I should use?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Black is fine, we just need to enforce it repo wide in the CI. (We need to pin a revision though, black is not great at backward compatibility)

proto/generate.proto Outdated Show resolved Hide resolved
Copy link
Collaborator

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice PR overall.

I've seen weird issues with the grammar leading out of regex returns (I'm guessing it has to do with the llama hack).

server/text_generation_server/utils/tokens.py Outdated Show resolved Hide resolved
server/text_generation_server/utils/tokens.py Outdated Show resolved Hide resolved
Comment on lines 401 to 403
self.fsm_grammar_states[i] = self.grammar_processor.advance(
next_ids[i].item(), self.fsm_grammar_states[i], self.grammars[i]
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole point of HeteregenousProcessor is to processor all the tokens at once, without making any CPU calls.

This .item() defeats the purpose.

I don't have any good ideas aside from moving this bit to the CPU loop in causal_lm/ flash_causal_lm.
Which is probably error prone.

@OlivierDehaene Do you have better suggestions ?

server/text_generation_server/utils/tokens.py Outdated Show resolved Hide resolved
@Narsil
Copy link
Collaborator

Narsil commented Feb 12, 2024

explore optimizations to avoid recompiling grammars (rely on lru_cache)

Seems to works great.
Let's roll as-is for performance regarding the compilation we'll investigate later how to optimize this. The ideal situation would be to use the tokenization workers to do the compilation and send the compiled objects directly to the python backend (so we don't overload the python with CPU cycles which would slow down the GPUS)

We can also add a simple way to let server owners disable grammar as a simple way to keep complexity low and latency highly predictable.


fsm = self.compile_fsm(grammars[i], self.tokenizer)
allowed_tokens = fsm.allowed_token_ids(fsm_grammar_states[i])
mask = torch.full((logits.shape[-1],), -math.inf, device=self.device)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WE could generate that at the start of the loop and reuse it over and over, no?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this resolved ? Did you resolve and forgot to push maybe ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Narsil I think I may have misunderstood the original comment, I moved the fsm creation into the HeterogeneousGrammarLogitProcessor.__init__ and access self.fsms[i] in the HeterogeneousGrammarLogitProcessor.__call__.

This should reduce the number of times the fsm is generated to the number of times HeterogeneousNextTokenChooser is initialized instead of on each call.

Is there a different location I should move the fsm compilation too? Thank you!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mask is being generated over and over.
Reusing the mask seems better here (this kind of stuff makes a difference unfortunately).

Just create the mask once, and fill_ it to clean its values.
To be fair, resetting the values of the mask is costly anyway, maybe this is too early optimizations, but allocation in a loop is really and easy one to fix.

For reference, I wont 2ms/token on the mamba + cuda graphs by moving the n_layers tensors into a single tensor so that the copies would be a single kernel launch (that's how bad launching any single op is ).

@drbh drbh force-pushed the outlines-guided-generation branch from cadb0a9 to fc689e0 Compare February 12, 2024 15:44
@drbh drbh marked this pull request as ready for review February 13, 2024 01:56
server/text_generation_server/utils/tokens.py Outdated Show resolved Hide resolved
server/text_generation_server/models/flash_causal_lm.py Outdated Show resolved Hide resolved
Comment on lines 515 to 516
except json.JSONDecodeError:
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a try catch to support schemas that are already a regex? If so can you add a comment?
It could be possible to move this compile to the router with PyO3 if this takes too much time.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it is and just added a comment in the latest commit. I agree we should move the compilation out of the server and into the router.

I like the idea of using PyO3 and seems relatively straight forward (glanced at the docs). I'll start a new PR moving that logic as a follow up

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While doing that, maybe we should ask more explicitness from the users, and force the grammar type to be specified.

That would avoid a try except abuse, and make error message probably more readable. Wdyt ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{ 'grammar": {"type": "regex", "content": "..."}}
{ 'grammar": {"type": "json", "content": "..."}}

Also higher level API might be even better to expose to users: https://www.anyscale.com/blog/anyscale-endpoints-json-mode-and-function-calling-features

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's okay to leave things for future PRs, but nothing that we tag in a revision can be removed later. So all the surface must be correct for us to do a release.

@drbh
Copy link
Collaborator Author

drbh commented Feb 13, 2024

can be run with the --grammar-support cli flag or GRAMMAR_SUPPORT=true env variable

**update: --grammar-support is deprecated in favor of --disable-grammar-support. Gramma support is available by default

text-generation-launcher \
  --model-id HuggingFaceH4/zephyr-7b-beta \
  --grammar-support

Requests

With grammar support:

**updated snippet

curl -s 'http://localhost:3000/generate' \
--header 'Content-Type: application/json' \
--data '{
    "inputs": "[INST]convert to JSON: I saw a puppy a cat and a raccoon during my bike ride in the park [/INST]",
    "parameters": {
        "max_new_tokens": 200,
        "repetition_penalty": 1.3,
        "grammar": {
            "type": "json",
            "value": {
                "properties": {
                    "location": {
                        "type": "string"
                    },
                    "activity": {
                        "type": "string"
                    },
                    "animals_seen": {
                        "type": "integer",
                        "minimum": 1,
                        "maximum": 5
                    },
                    "animals": {
                        "type": "array",
                        "items": {
                            "type": "string"
                        }
                    }
                },
                "required": ["location", "activity", "animals_seen", "animals"]
            }
        }
    }
}' | jq .
{
  "generated_text": "{\n\"activity\": \"biking\",\n\"animals\": [\"puppy\",\"cat\",\"raccoon\"]\n  , \"animals_seen\": 3,\n   \"location\":\"park\"}"
}

Without grammar support:

If support is not toggled on then sending a grammar in the parameters will result in

{
  "error": "Input validation error: grammar is not supported",
  "error_type": "validation"
}

@Narsil
Copy link
Collaborator

Narsil commented Feb 14, 2024

--grammar-support

I would have done it the other way personally, --disable-grammar-support, just so that the default is to support it.
Most users will be happy to have the features, the defence is only intended for power users, no ? (We could even remove the flag when we find a clever solution to slow down only the grammar requests down and not all users at the same time).

Copy link
Member

@OlivierDehaene OlivierDehaene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice :)
Have you verified if it works well with speculation?

integration-tests/conftest.py Outdated Show resolved Hide resolved
router/src/lib.rs Outdated Show resolved Hide resolved
server/text_generation_server/models/flash_causal_lm.py Outdated Show resolved Hide resolved
@drbh drbh force-pushed the outlines-guided-generation branch from 0c9b22f to 63b0917 Compare February 14, 2024 16:06
@drbh drbh force-pushed the outlines-guided-generation branch from 63b0917 to f0cdd9c Compare February 14, 2024 16:08
@drbh
Copy link
Collaborator Author

drbh commented Feb 14, 2024

Nice :) Have you verified if it works well with speculation?

Now it should 🙂

@drbh
Copy link
Collaborator Author

drbh commented Feb 14, 2024

--grammar-support

I would have done it the other way personally, --disable-grammar-support, just so that the default is to support it. Most users will be happy to have the features, the defence is only intended for power users, no ? (We could even remove the flag when we find a clever solution to slow down only the grammar requests down and not all users at the same time).

great points! I've updated the PR to use --disable-grammar-support instead in the latest commits, thank you!

Copy link
Collaborator

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I think we can still improve further, let's do this in other PRs.

@Narsil Narsil merged commit cef0553 into main Feb 15, 2024
9 checks passed
@Narsil Narsil deleted the outlines-guided-generation branch February 15, 2024 09:28
@PawelFaron
Copy link

PawelFaron commented Feb 18, 2024

can be run with the --grammar-support cli flag or GRAMMAR_SUPPORT=true env variable

text-generation-launcher \
  --model-id HuggingFaceH4/zephyr-7b-beta \
  --grammar-support

Requests

With grammar support:

curl -s 'http://localhost:3000/generate' \
--header 'Content-Type: application/json' \
--data '{
    "inputs": "[INST]convert to JSON: I saw a puppy a cat and a raccoon during my bike ride in the park [/INST]",
    "parameters": {
        "max_new_tokens": 200,
        "repetition_penalty": 1.3,
        "grammar": {
            "properties": {
                "location": {
                    "type": "string"
                },
                "activity": {
                    "type": "string"
                },
                "animals_seen": {
                    "type": "integer",
                    "minimum": 1,
                    "maximum": 5
                },
                "animals": {
                    "type": "array",
                    "items": {
                        "type": "string"
                    }
                }
            },
            "required": ["location", "activity", "animals_seen", "animals"]
        }
    }
}' | jq .
{
  "generated_text": "{\n\"activity\": \"biking\",\n\"animals\": [\"puppy\",\"cat\",\"raccoon\"]\n  , \"animals_seen\": 3,\n   \"location\":\"park\"}"
}

Without grammar support:

If support is not toggled on then sending a grammar in the parameters will result in

{
  "error": "Input validation error: grammar is not supported",
  "error_type": "validation"
}

Did something change in the final implementation? I'm geetting this error with thix exact request:
Failed to deserialize the JSON body into the target type: parameters.grammar: missing field 'type' at line 27 column 4(base)

@paulcx
Copy link

paulcx commented Feb 19, 2024

can be run with the --grammar-support cli flag or GRAMMAR_SUPPORT=true env variable

text-generation-launcher \
  --model-id HuggingFaceH4/zephyr-7b-beta \
  --grammar-support

Requests

With grammar support:

curl -s 'http://localhost:3000/generate' \
--header 'Content-Type: application/json' \
--data '{
    "inputs": "[INST]convert to JSON: I saw a puppy a cat and a raccoon during my bike ride in the park [/INST]",
    "parameters": {
        "max_new_tokens": 200,
        "repetition_penalty": 1.3,
        "grammar": {
            "properties": {
                "location": {
                    "type": "string"
                },
                "activity": {
                    "type": "string"
                },
                "animals_seen": {
                    "type": "integer",
                    "minimum": 1,
                    "maximum": 5
                },
                "animals": {
                    "type": "array",
                    "items": {
                        "type": "string"
                    }
                }
            },
            "required": ["location", "activity", "animals_seen", "animals"]
        }
    }
}' | jq .
{
  "generated_text": "{\n\"activity\": \"biking\",\n\"animals\": [\"puppy\",\"cat\",\"raccoon\"]\n  , \"animals_seen\": 3,\n   \"location\":\"park\"}"
}

Without grammar support:
If support is not toggled on then sending a grammar in the parameters will result in

{
  "error": "Input validation error: grammar is not supported",
  "error_type": "validation"
}

Did something change in the final implementation? I'm geetting this error with thix exact request: Failed to deserialize the JSON body into the target type: parameters.grammar: missing field 'type' at line 27 column 4(base)

@Narsil I got same error with demo schema:

{
    "properties": {
        "location": {
            "type": "string"
        },
        "activity": {
            "type": "string"
        },
        "animals_seen": {
            "type": "integer",
            "minimum": 1,
            "maximum": 5
        },
        "animals": {
            "type": "array",
            "items": {
                "type": "string"
            }
        }
    },
    "required": [
        "location",
        "activity",
        "animals_seen",
        "animals"
    ]
}

Besides that, I also found some additional errors at the top of fastapi docs ui like this:

Resolver error at paths./generate.post.requestBody.content.application/json.schema.properties.parameters.properties.grammar.allOf.0.$ref
Could not resolve reference: Could not resolve pointer: /components/schemas/GrammarType does not exist in document
Resolver error at paths./generate.post.requestBody.content.application/json.schema.properties.parameters.properties.grammar.$ref
Could not resolve reference: Could not resolve pointer: /components/schemas/GrammarType does not exist in document

@drbh
Copy link
Collaborator Author

drbh commented Feb 19, 2024

Hi @PawelFaron and @paulcx thank you both for the feedback and for testing the new feature!

In order to resolve the issues above please note that

  1. the flag --grammar-support has been deprecated in favor of --disable-grammar-support. Gramma support is available by default and does not require a launcher flag to use
  2. the request format changed slightly and is now nested inside a new object. The grammar includes a type and value at the top level; please see the updated example payload below.

**notes: the grammar can be type json or regex. The grammar needs to compile on the first request which can take a couple of seconds but subsequent requests should be faster.

{
    "inputs": "[INST]convert to JSON: I saw a puppy a cat and a raccoon during my bike ride in the park [/INST]",
    "parameters": {
        "max_new_tokens": 200,
        "repetition_penalty": 1.3,
        "grammar": {
            "type": "json",
            "value": {
                "properties": {
                    "location": {
                        "type": "string"
                    },
                    "activity": {
                        "type": "string"
                    },
                    "animals_seen": {
                        "type": "integer",
                        "minimum": 1,
                        "maximum": 5
                    },
                    "animals": {
                        "type": "array",
                        "items": {
                            "type": "string"
                        }
                    }
                },
                "required": ["location", "activity", "animals_seen", "animals"]
            }
        }
    }
}

I hope this is helpful, please let me know if there are any other issues!

@paulcx
Copy link

paulcx commented Feb 20, 2024

Thanks @drbh. This demo provided above works. btw, would you please have a look the errors?

20240220080527

@Jason-CKY
Copy link
Contributor

Jason-CKY commented Feb 20, 2024

Hi I've tested this feature on orca-13b and llama-2-13b, both of which just generates <unk> tokens until the max_new_tokens is reached when trying to enforce grammar.

UPDATE
I managed to get the grammar working by removing certain generation parameters in the request body.
Adding any of the below request parameters will result in <UNK> token generation alongside grammar:

  • top_k
  • top_p
  • typical_p

@OlivierDehaene
Copy link
Member

@Jason-CKY #1578 will fix this issue.

@paulcx
Copy link

paulcx commented Feb 21, 2024

@Jason-CKY #1578 will fix this issue.

errors exist in #1578 as showed

image

kdamaszk pushed a commit to kdamaszk/tgi-gaudi that referenced this pull request Apr 29, 2024
This WIP PR starts to add grammar support via outlines, currently this
PR supports very simple regex grammars and does not optimize for
precompiling or caching grammar fsm's.

todo:
- [X] add simple outlines guidance to `NextTokenChooser`
- [X] update protos for grammar
- [X] update generation params API
- [X] constrain simple grammar
- [ ] support parsing more complex grammar into fsm
- [ ] support all outline support grammar types
- [ ] explore optimizations to avoid recompiling grammars

guided request
```bash
curl -s 'http://localhost:3000/generate' \
--header 'Content-Type: application/json' \
--data-raw '{
    "inputs": "make an email for david: \n",
    "parameters": {
        "max_new_tokens": 6,
        "grammar": "[\\w-]+@([\\w-]+\\.)+[\\w-]+"
    }
}' | jq
```
response
```json
{
  "generated_text": "[email protected]"
}
```

unguided request
```bash
curl -s 'http://localhost:3000/generate' \
--header 'Content-Type: application/json' \
--data '{
    "inputs": "make an email for david: \n",
    "parameters": {
        "max_new_tokens": 6
    }
}' | jq
```
response
```json
{
  "generated_text": "    email = 'david"
}
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants