Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

json: fix additionalProperties, allow space after enum/const #7840

Merged
merged 21 commits into from
Jun 26, 2024

Conversation

ochafik
Copy link
Collaborator

@ochafik ochafik commented Jun 9, 2024

Addresses the 3 main issues in #7789:

  • "additionalProperties": true should now be fixed.

    • The code was previously forcing a least one additional property
    • Properties can now be any JSON value, not just objects
    • Now skips the names already in properties (before, could essentially bypass the types defined in properties), generating an exclusion pattern w/ a trie:
      • If properties just had "a", the additional properties' key will have a rule like:

        ["] ( [a] char+ | [^"a] char* )? ["]

      • with "ab" & "ac":

        ["] ( [a] ([b] char+ | [c] char+ | [^"bc] char*) | [^"a] char* )? ["]

  • additionalProperties now defaults to true as it should

  • space now allowed after enum / const values.

TODO:

cc/ @HanClinto

Important note for Pydantic users

By default, pydantic's JSON schemas allow for additional properties, which will start being honored by llama.cpp once this PR gets merged.

While extra properties may be useful for backwards compatibility (e.g. deserializing values from newer app version w/ an old version), for the purpose of generating data, it will let the LLM sample potentially many useless properties.

You can disable additional properties with two lines of extra config for each model class:

# pip install pydantic
import json
from pydantic import BaseModel, Extra
class QAPair(BaseModel):
        class Config:
            extra = 'forbid'  # triggers additionalProperties: false in the JSON schema
        question: str
        concise_answer: str
        justification: str
print(json.dumps(QAPair.model_json_schema()))
Show output
{
  "additionalProperties": false,
  "properties": {
    "question": {
      "title": "Question",
      "type": "string"
    },
    "concise_answer": {
      "title": "Concise Answer",
      "type": "string"
    },
    "justification": {
      "title": "Justification",
      "type": "string"
    }
  },
  "required": [
    "question",
    "concise_answer",
    "justification"
  ],
  "title": "QAPair",
  "type": "object"
}

@HanClinto
Copy link
Collaborator

Nice work!

I know it's not spec, but I'm wondering if you had it right the first time. Do you think that we should default to additionalProperties being false? It certainly seems like the main reason someone would want to use schema-constrained generation.

@ochafik ochafik changed the title json schemas: fix additionalProperties, allow space after enum/const json: fix additionalProperties, allow space after enum/const Jun 9, 2024
@ochafik
Copy link
Collaborator Author

ochafik commented Jun 9, 2024

I know it's not spec, but I'm wondering if you had it right the first time. Do you think that we should default to additionalProperties being false? It certainly seems like the main reason someone would want to use schema-constrained generation.

Yeah, been having these guilty thoughts too.

There is a distinct risk that most users will forget to add "additionalProperties": false (or the extra Pydantic config), which may derail / slow down their generation sometimes without them seeing why (esp. through pydantic, since it drops those extra props by default).

On the other hand, people should ideally really get more fluent about JSON schemas and this is a very important feature of them. How about we double down on the compliant behaviour in the upcoming docs, and we revisit the default at the first report of users tripping on this?

(Edit: I'm also afraid of people using the same Pydantic models for generation and other uses, I guess they could still say extra = 'allow' to get explicit "additionalProperties": true - just checked - but it's not great to be that tool that forces alterations to the schema to match its quirks - which we'll already have a fair bit of anyway)

@ochafik
Copy link
Collaborator Author

ochafik commented Jun 10, 2024

Actually additionalProperties is more broken than I remembered. It should only allow property names not defined in properties, otherwise it may defeat the typing of said properties (shouldn't be too hard, preparing a fix w/ a prefix tree)

@HanClinto
Copy link
Collaborator

I know it's not spec, but I'm wondering if you had it right the first time. Do you think that we should default to additionalProperties being false? It certainly seems like the main reason someone would want to use schema-constrained generation.

Yeah, been having these guilty thoughts too.

There is a distinct risk that most users will forget to add "additionalProperties": false (or the extra Pydantic config), which may derail / slow down their generation sometimes without them seeing why (esp. through pydantic, since it drops those extra props by default).

On the other hand, people should ideally really get more fluent about JSON schemas and this is a very important feature of them. How about we double down on the compliant behaviour in the upcoming docs, and we revisit the default at the first report of users tripping on this?

(Edit: I'm also afraid of people using the same Pydantic models for generation and other uses, I guess they could still say extra = 'allow' to get explicit "additionalProperties": true - just checked - but it's not great to be that tool that forces alterations to the schema to match its quirks - which we'll already have a fair bit of anyway)

GBNF is already our application-specific version of BNF / EBNF -- it's not the worst to have our particular syntax.

I've used some other online checkers of json schemas, and noticed quirks in those as well (for instance, if you specify an object, but allow additionalProperties, can you submit an array instead? I think some validators permit this, and some do not).

So yeah, I'm definitely in two minds about this, and I could see it going either way. If we include additionalProperties=true as the default behavior, then I feel like the default in basically all of our examples will need to include additionalProperties=false.

@ochafik ochafik changed the title json: fix additionalProperties, allow space after enum/const [WIP] json: fix additionalProperties, allow space after enum/const Jun 11, 2024
@mofosyne mofosyne added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label Jun 12, 2024
@ochafik
Copy link
Collaborator Author

ochafik commented Jun 13, 2024

I know it's not spec, but I'm wondering if you had it right the first time. Do you think that we should default to additionalProperties being false? It certainly seems like the main reason someone would want to use schema-constrained generation.

Yeah, been having these guilty thoughts too.
There is a distinct risk that most users will forget to add "additionalProperties": false (or the extra Pydantic config), which may derail / slow down their generation sometimes without them seeing why (esp. through pydantic, since it drops those extra props by default).
On the other hand, people should ideally really get more fluent about JSON schemas and this is a very important feature of them. How about we double down on the compliant behaviour in the upcoming docs, and we revisit the default at the first report of users tripping on this?
(Edit: I'm also afraid of people using the same Pydantic models for generation and other uses, I guess they could still say extra = 'allow' to get explicit "additionalProperties": true - just checked - but it's not great to be that tool that forces alterations to the schema to match its quirks - which we'll already have a fair bit of anyway)

GBNF is already our application-specific version of BNF / EBNF -- it's not the worst to have our particular syntax.

@HanClinto Agree custom syntax is fine when it's about writing grammars by hand, but I think a big use case for schemas is to have them generated by tools (e.g. Pydantic, Zod), so spec compliance does matter a lot more.

I've used some other online checkers of json schemas, and noticed quirks in those as well (for instance, if you specify an object, but allow additionalProperties, can you submit an array instead? I think some validators permit this, and some do not).

The specs are actually an enjoyable read, finally started diving into them.

additionalProperties only applies to object type, not array (see keywords that apply to subschema). If the type has alternatives, keywords will only apply to the alternatives they're defined for (see Assertions and Instance Primitive Types and type).

So yeah, I'm definitely in two minds about this, and I could see it going either way. If we include additionalProperties=true as the default behavior, then I feel like the default in basically all of our examples will need to include additionalProperties=false.

I think people who write JSON schemas by hand should know this very core feature of schemas, it's alright to add to all our examples (the JSON conversion tests all have it already, and they're linked from the grammar readme as kinda ground truth of what we support).

I think in practice it's unlikely models will decide to add too many extra properties anyway, it's good practice to add the schema to the prompt and without seeing an explicit additionalProperties: true I doubt they'll want to add extra cruft.

@ochafik ochafik marked this pull request as ready for review June 22, 2024 19:56
@ochafik
Copy link
Collaborator Author

ochafik commented Jun 22, 2024

@HanClinto feels very good to uncomment the tests you kindly prepared for this! 🙏 (aaaand they surfaced some bugs I've just fixed 😅)

@HanClinto
Copy link
Collaborator

@HanClinto feels very good to uncomment the tests you kindly prepared for this! 🙏 (aaaand they surfaced some bugs I've just fixed 😅)

Excellent -- love to hear it, thank you! :)

I've mulled it over for a while, and I think I agree with you -- unless we have a strong reason to break spec-compliance, we might as well stick with the spec, and having additionalProperties default to true (in alignment with the spec) is probably worthwhile.

Nice work on all this! Ready for re-review?

@HanClinto
Copy link
Collaborator

Nice work on all this! Ready for re-review?

Hah -- just noticed:

ochafik marked this pull request as ready for review 8 minutes ago

Nevermind my question -- I'll get to reviewing! :)

})""",
// Passing strings
{
"{\"a\": 42}",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Would it be worth switch these to R"""(-type strings?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced most of them

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced most of them

@HanClinto
Copy link
Collaborator

HanClinto commented Jun 23, 2024

Here's a schema with a few failing test cases:

    test_schema(
        "required + optional + additional props",
        // Schema
        R"""({
            "type": "object",
            "properties": {
                "and": {"type": "number"},
                "also": {"type": "number"}
            },
            "required": ["and"],
            "additionalProperties": {"type": "string"}
        })""",
        // Passing strings
        {
            R"""({"and": 42})""",
            R"""({"and": 42, "also": 43})""",
            R"""({"and": 42, "andes": "mountains"})""",
            // TODO: Currently fails, but should pass.
            R"""({"and": 42, "an": "hour"})""",
            // TODO: Currently fails, but should pass.
            R"""({"sdf":"hour", "and": 42})""",
            // TODO: Currently fails, but should pass.
            R"""({"and": 42, "als":"hour"})""",
        },
        // Failing strings
        {
            R"""({})""", // Missing all required properties
            R"""({"and": 42, "also": null})""", // Cannot override defined property to a new type
            R"""({"and": 42, "andes": 15})""", // Additional properties cannot be numbers
            R"""({"an":null, "and": 42})""", // Additional properties cannot be null
        }
    );

Note that I'm not entirely sure how hard we need to work to fix these or not. Mainly I only think we need to address these if it exposes an underlying shortcoming -- but the fact of the matter is that the LLM can generate valid outputs even in its current form -- there may just be a few additional cases that it's prevented from generating. Not sure that that's a big deal.

@HanClinto
Copy link
Collaborator

Re: the second of the new fail examples, I think that I don't mind the idea that all required fields should come first.

If it makes a simpler (and faster) grammar, then I wonder how many other similar shortcuts and assumptions we could accept.

Do we need our grammars to be able to generate every conceivable possible matching json object, or is it okay if we generate only a smaller subset?

@ochafik
Copy link
Collaborator Author

ochafik commented Jun 23, 2024

Re: the second of the new fail examples, I think that I don't mind the idea that all required fields should come first.

If it makes a simpler (and faster) grammar, then I wonder how many other similar shortcuts and assumptions we could accept.

Do we need our grammars to be able to generate every conceivable possible matching json object

Yeah I don't think that's realistic with a stateless grammar. Constraining the order, not only of required fields (which already happened before #5978), but also of optional ones, might be the only way to avoid exploding the alternatives and/or the grammar size.

(Also it still allows for a custom order within each class of required/optional matching the order properties are defined, which removes the need to provide something like the legacy prop_order mechanism in the Python version)

or is it okay if we generate only a smaller subset?

I think the strong contract should be all outputs must match the schema (except for unsupported features).

The main drawback I can think of with this is the order of properties might be surprising to a model. For instance if you wanna generate a ReAct-style object {thought: ...., action: ...}, clearly the thought must come first, so either be required or both be optional but thought defined first in the schema. And general, other cases may be addressed by adding the schema to the prompt, which is a common practice anyway. This is important to document tho, so I'll send a PR w/ edits to grammars/README once the few grammar related changes are merged.

@HanClinto
Copy link
Collaborator

Yeah I don't think that's realistic with a stateless grammar. Constraining the order, not only of required fields (which already happened before #5978), but also of optional ones, might be the only way to avoid exploding the alternatives and/or the grammar size.

Yeah, I think you're right.

(Also it still allows for a custom order within each class of required/optional matching the order properties are defined, which removes the need to provide something like the legacy prop_order mechanism in the Python version)

I think this is a reasonable approach. If the grammar expects fields to be defined in the same order that they're defined in the JSON, then I think that's an expectation we can live with.

or is it okay if we generate only a smaller subset?

I think the strong contract should be all outputs must match the schema (except for unsupported features).

The main drawback I can think of with this is the order of properties might be surprising to a model. For instance if you wanna generate a ReAct-style object {thought: ...., action: ...}, clearly the thought must come first, so either be required or both be optional but thought defined first in the schema. And general, other cases may be addressed by adding the schema to the prompt, which is a common practice anyway. This is important to document tho, so I'll send a PR w/ edits to grammars/README once the few grammar related changes are merged.

The thought / action example is a very good one. I don't know what limitations Pydantic or other generators might have when creating schemas, but if it doesn't do things like force fields to be in alphabetical order, then as long as the user puts "thought" before "action" in the schema object, then our generated grammar will always restrict its objects to be in the same order.

That feels pretty reasonable to me, and seems like one of the most efficient ways to create a grammar (that doesn't entirely explode in branching permutations, which I think is currently one of our bigger enemies).

@ochafik ochafik changed the title [WIP] json: fix additionalProperties, allow space after enum/const json: fix additionalProperties, allow space after enum/const Jun 24, 2024
@HanClinto
Copy link
Collaborator

What do you think about simplifying this down?

The Trie that you made is absolutely amazing (you are a wizard when it comes to pushing the limits of stateless grammars!!) -- I'm wondering if this will produce grammars that are too inefficient.

Instead, if we set some basic limitations of our grammars, such as:

  • All fields will be generated in the order they are listed in the grammar
  • Optional fields may be skipped, but required fields will be non-optional
  • If additionalProperties are allowed (not recommended), then they must come after ALL defined fields

Would this let us avoid using the trie structure (or would we still need it to verify that additional properties don't share names with predefined properties)? Would this simplify the grammars and help with generation speed?

I think this may be pretty close to what you've already done I think...? I'll confess I haven't fully dug into it to learn all the nooks and crannies, so apologies if this is already the direction you've been going.

@ochafik
Copy link
Collaborator Author

ochafik commented Jun 24, 2024

What do you think about simplifying this down?

The Trie that you made is absolutely amazing (you are a wizard when it comes to pushing the limits of stateless grammars!!) -- I'm wondering if this will produce grammars that are too inefficient.

Thanks!! 😊

Given the constraints, the trie should to produce as few alternatives as possible. There is a definite cost to allowing additionalProperties and it should be caveated appropriately.

Instead, if we set some basic limitations of our grammars, such as:

  • All fields will be generated in the order they are listed in the grammar

Yes, and also required properties will be generated before optional ones

  • Optional fields may be skipped, but required fields will be non-optional

Not a limitation :-)

  • If additionalProperties are allowed (not recommended), then they must come after ALL defined fields

Yep! This should all be documented soon in the grammars/README (along w/ performance implications of allowing extra properties), your bullets are a good start actually.

And to visualize what this means in grammar space:

echo '{
  "type": "object",        
  "properties": {
    "a": {"type": "string"},
    "b": {"type": "string"},
    "c": {"type": "string"},
    "d": {"type": "string"} 
  },
  "required": ["c", "d"]
}' | examples/json_schema_to_grammar.py -
root ::= "{" space 
  c-kv "," space
  d-kv
  ( "," space ( 
    a-kv a-rest
    | b-kv b-rest
    | additional-kv ( "," space additional-kv )* 
  ) )?
  "}" space
c-kv ::= "\"c\"" space ":" space string
d-kv ::= "\"d\"" space ":" space string
a-kv ::= "\"a\"" space ":" space string
a-rest ::= ( "," space b-kv )? b-rest
b-kv ::= "\"b\"" space ":" space string
b-rest ::= ( "," space additional-kv )*
additional-k ::= ["] ( [a] char+ | [b] char+ | [c] char+ | [d] char+ | [^"abcd] char* )? ["] space
additional-kv ::= additional-k ":" space value
...

Would this let us avoid using the trie structure (or would we still need it to verify that additional properties don't share names with predefined properties)?

We still need to verify there's no name collisions, unfortunately, see spec for additionalProperties

"Validation with "additionalProperties" applies only to the child values of instance names that do not appear in the annotation results of either "properties" or "patternProperties"."

(otherwise defeats the typing of explicit props)

Would this simplify the grammars and help with generation speed?

With both this name exclusion trie and the min/max changes from #7797 I've tried to keep alternatives in check. The users will have the final control over speed by deciding to allow additional props.

example of combining min/max, additionalProperties, min/maxItems
// schema.json
{
  "type": "array",
  "items": {
    "type": "object",
    "properties": {
      "a": {
        "type": "integer",
        "minimum": 0
      },
      "b": {
        "type": "integer",
        "maximum": -100
      },
      "c": {
        "type": "integer",
        "minimum": -100,
        "maximum": 2000
      },
      "d": {
        "type": "integer",
        "exclusiveMinimum": 0
      }
    },
    "required": ["c", "d"]
  },
  "minItems": 10,
  "maxItems": 20
}
hyperfine --warmup 1 --runs 3 \
  -L branch master,json-integ \
    --setup 'git checkout {branch} && \
             make clean && \
             make -j LLAMA_CURL=1 llama-cli' \
    'BRANCH={branch} \
      ./llama-cli -j "$( cat schema.json )" \
        -mu https://huggingface.co/bartowski/Phi-3-mini-4k-instruct-GGUF/resolve/main/Phi-3-mini-4k-instruct-Q8_0.gguf \
        -c 0 --seed 124 \
        -p "Please generate values using the following schema: $( cat schema.json ); Make sure there are values for a and b and an other value, e.g. {\"c\": 10, \"d\": 20, \"a\": 1, \"b\": -2000, \"other\": \"none\"}"'

I think this may be pretty close to what you've already done I think...? I'll confess I haven't fully dug into it to learn all the nooks and crannies, so apologies if this is already the direction you've been going.

Hehe, yes it is, I'm glad our minds went in the same direction again 😜😎

@HanClinto
Copy link
Collaborator

  • All fields will be generated in the order they are listed in the grammar

Yes, and also required properties will be generated before optional ones

I'm rolling this one around, and I'm not sure that this should be the case. How might it change things if we kept them in the order they are listed in the json spec, regardless of whether they're required or not?

For your earlier thought / action example, I might not require "thought", but I always want to have an "action". If I have a "thought", it makes sense to generate it before "action".

If we add the requirement that all required attributes must come first, then it feels like it removes the flexibility to do things like the above.

  • Optional fields may be skipped, but required fields will be non-optional

Not a limitation :-)

hah, true. :)

I think this may be pretty close to what you've already done I think...? I'll confess I haven't fully dug into it to learn all the nooks and crannies, so apologies if this is already the direction you've been going.

Hehe, yes it is, I'm glad our minds went in the same direction again 😜😎

Nice. :D

@ochafik
Copy link
Collaborator Author

ochafik commented Jun 24, 2024

  • All fields will be generated in the order they are listed in the grammar

Yes, and also required properties will be generated before optional ones

I'm rolling this one around, and I'm not sure that this should be the case. How might it change things if we kept them in the order they are listed in the json spec, regardless of whether they're required or not?

@HanClinto Hah, thanks for questioning this, think you might be right, the code needs to be shuffled a tiny bit but we might end up with less max alternatives in some cases, and simpler grammars.

Still with the {a?: string, b?: string, c: string, d: string, ...extra} example, the current required -> optional -> extras flow has 4 alternatives after d:

graph TD;
    OPEN["{"]-->c;
    c-->d;
    d-->END["}"];
    d-->a;
    d-->b;
    d-->extra;
    a-->END["}"];
    a-->b;
    b-->END["}"];
    b-->extra;
    extra-->extra;
    extra-->END["}"];
Loading

While keeping the natural order then additional props maxes out at 3 alternatives:

graph TD;
    OPEN["{"]-->a;
    a-->b;
    a-->c;
    OPEN["{"]-->b;
    b-->c;
    OPEN["{"]-->c;
    c-->d;
    d-->extra;
    extra-->extra;
    d-->END["}"];
    extra-->END["}"];
Loading

Which could be implemented w/ these rules:

root ::= "{" space (
    a-kv a-rest
    | b-kv b-rest
    | c-kv c-rest
  )? "}" space

a-rest ::= ( "," space b-kv )? b-rest
b-rest ::= ( "," space c-kv ) c-rest
c-rest ::= ( "," space d-kv ) d-rest
d-rest ::= ( "," space additional-kv )*

(essentially, max alternatives is uninterrupted length of optional properties run + 2, rather than total number of optional properties + 2; the +2 is 1 to bypass run of optionals and 1 to enter extras loop, if the run of optionals comes after the last required property)

I'll start incubating an update 🤗

Edit: Actually, allowing a unified order is orthogonal with this PR, I think we should send these additionalProperties fixes as they are.

@HanClinto
Copy link
Collaborator

HanClinto commented Jun 25, 2024

Edit: Actually, allowing a unified order is orthogonal with this PR, I think we should send these additionalProperties fixes as they are.

Sounds good to me! You've done amazing work on all of this -- I'm looking forward to seeing these merged in!

@ochafik ochafik merged commit 6777c54 into ggerganov:master Jun 26, 2024
53 checks passed
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Jun 30, 2024
…anov#7840)

* json: default additionalProperty to true

* json: don't force additional props after normal properties!

* json: allow space after enum/const

* json: update pydantic example to set additionalProperties: false

* json: prevent additional props to redefine a typed prop

* port not_strings to python, add trailing space

* fix not_strings & port to js+py

* Update json-schema-to-grammar.cpp

* fix _not_strings for substring overlaps

* json: fix additionalProperties default, uncomment tests

* json: add integ. test case for additionalProperties

* json: nit: simplify condition

* reformat grammar integ tests w/ R"""()""" strings where there's escapes

* update # tokens in server test: consts can now have trailing space
MagnusS0 pushed a commit to MagnusS0/llama.cpp-normistral-tokenizer that referenced this pull request Jul 1, 2024
…anov#7840)

* json: default additionalProperty to true

* json: don't force additional props after normal properties!

* json: allow space after enum/const

* json: update pydantic example to set additionalProperties: false

* json: prevent additional props to redefine a typed prop

* port not_strings to python, add trailing space

* fix not_strings & port to js+py

* Update json-schema-to-grammar.cpp

* fix _not_strings for substring overlaps

* json: fix additionalProperties default, uncomment tests

* json: add integ. test case for additionalProperties

* json: nit: simplify condition

* reformat grammar integ tests w/ R"""()""" strings where there's escapes

* update # tokens in server test: consts can now have trailing space
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples python python script changes Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level server testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants