-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
json
: fix additionalProperties, allow space after enum/const
#7840
Conversation
Nice work! I know it's not spec, but I'm wondering if you had it right the first time. Do you think that we should default to additionalProperties being false? It certainly seems like the main reason someone would want to use schema-constrained generation. |
json schemas
: fix additionalProperties, allow space after enum/constjson
: fix additionalProperties, allow space after enum/const
Yeah, been having these guilty thoughts too. There is a distinct risk that most users will forget to add On the other hand, people should ideally really get more fluent about JSON schemas and this is a very important feature of them. How about we double down on the compliant behaviour in the upcoming docs, and we revisit the default at the first report of users tripping on this? (Edit: I'm also afraid of people using the same Pydantic models for generation and other uses, I guess they could still say |
Actually |
GBNF is already our application-specific version of BNF / EBNF -- it's not the worst to have our particular syntax. I've used some other online checkers of json schemas, and noticed quirks in those as well (for instance, if you specify an object, but allow additionalProperties, can you submit an array instead? I think some validators permit this, and some do not). So yeah, I'm definitely in two minds about this, and I could see it going either way. If we include |
json
: fix additionalProperties, allow space after enum/constjson
: fix additionalProperties, allow space after enum/const
@HanClinto Agree custom syntax is fine when it's about writing grammars by hand, but I think a big use case for schemas is to have them generated by tools (e.g. Pydantic, Zod), so spec compliance does matter a lot more.
The specs are actually an enjoyable read, finally started diving into them.
I think people who write JSON schemas by hand should know this very core feature of schemas, it's alright to add to all our examples (the JSON conversion tests all have it already, and they're linked from the grammar readme as kinda ground truth of what we support). I think in practice it's unlikely models will decide to add too many extra properties anyway, it's good practice to add the schema to the prompt and without seeing an explicit |
@HanClinto feels very good to uncomment the tests you kindly prepared for this! 🙏 (aaaand they surfaced some bugs I've just fixed 😅) |
Excellent -- love to hear it, thank you! :) I've mulled it over for a while, and I think I agree with you -- unless we have a strong reason to break spec-compliance, we might as well stick with the spec, and having Nice work on all this! Ready for re-review? |
Hah -- just noticed:
Nevermind my question -- I'll get to reviewing! :) |
tests/test-grammar-integration.cpp
Outdated
})""", | ||
// Passing strings | ||
{ | ||
"{\"a\": 42}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Would it be worth switch these to R"""(
-type strings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced most of them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced most of them
Here's a schema with a few failing test cases: test_schema(
"required + optional + additional props",
// Schema
R"""({
"type": "object",
"properties": {
"and": {"type": "number"},
"also": {"type": "number"}
},
"required": ["and"],
"additionalProperties": {"type": "string"}
})""",
// Passing strings
{
R"""({"and": 42})""",
R"""({"and": 42, "also": 43})""",
R"""({"and": 42, "andes": "mountains"})""",
// TODO: Currently fails, but should pass.
R"""({"and": 42, "an": "hour"})""",
// TODO: Currently fails, but should pass.
R"""({"sdf":"hour", "and": 42})""",
// TODO: Currently fails, but should pass.
R"""({"and": 42, "als":"hour"})""",
},
// Failing strings
{
R"""({})""", // Missing all required properties
R"""({"and": 42, "also": null})""", // Cannot override defined property to a new type
R"""({"and": 42, "andes": 15})""", // Additional properties cannot be numbers
R"""({"an":null, "and": 42})""", // Additional properties cannot be null
}
); Note that I'm not entirely sure how hard we need to work to fix these or not. Mainly I only think we need to address these if it exposes an underlying shortcoming -- but the fact of the matter is that the LLM can generate valid outputs even in its current form -- there may just be a few additional cases that it's prevented from generating. Not sure that that's a big deal. |
Re: the second of the new fail examples, I think that I don't mind the idea that all required fields should come first. If it makes a simpler (and faster) grammar, then I wonder how many other similar shortcuts and assumptions we could accept. Do we need our grammars to be able to generate every conceivable possible matching json object, or is it okay if we generate only a smaller subset? |
Yeah I don't think that's realistic with a stateless grammar. Constraining the order, not only of required fields (which already happened before #5978), but also of optional ones, might be the only way to avoid exploding the alternatives and/or the grammar size. (Also it still allows for a custom order within each class of required/optional matching the order properties are defined, which removes the need to provide something like the legacy prop_order mechanism in the Python version)
I think the strong contract should be all outputs must match the schema (except for unsupported features). The main drawback I can think of with this is the order of properties might be surprising to a model. For instance if you wanna generate a ReAct-style object |
Yeah, I think you're right.
I think this is a reasonable approach. If the grammar expects fields to be defined in the same order that they're defined in the JSON, then I think that's an expectation we can live with.
The thought / action example is a very good one. I don't know what limitations Pydantic or other generators might have when creating schemas, but if it doesn't do things like force fields to be in alphabetical order, then as long as the user puts "thought" before "action" in the schema object, then our generated grammar will always restrict its objects to be in the same order. That feels pretty reasonable to me, and seems like one of the most efficient ways to create a grammar (that doesn't entirely explode in branching permutations, which I think is currently one of our bigger enemies). |
json
: fix additionalProperties, allow space after enum/constjson
: fix additionalProperties, allow space after enum/const
What do you think about simplifying this down? The Trie that you made is absolutely amazing (you are a wizard when it comes to pushing the limits of stateless grammars!!) -- I'm wondering if this will produce grammars that are too inefficient. Instead, if we set some basic limitations of our grammars, such as:
Would this let us avoid using the trie structure (or would we still need it to verify that additional properties don't share names with predefined properties)? Would this simplify the grammars and help with generation speed? I think this may be pretty close to what you've already done I think...? I'll confess I haven't fully dug into it to learn all the nooks and crannies, so apologies if this is already the direction you've been going. |
Thanks!! 😊 Given the constraints, the trie should to produce as few alternatives as possible. There is a definite cost to allowing additionalProperties and it should be caveated appropriately.
Yes, and also required properties will be generated before optional ones
Not a limitation :-)
Yep! This should all be documented soon in the grammars/README (along w/ performance implications of allowing extra properties), your bullets are a good start actually. And to visualize what this means in grammar space: echo '{
"type": "object",
"properties": {
"a": {"type": "string"},
"b": {"type": "string"},
"c": {"type": "string"},
"d": {"type": "string"}
},
"required": ["c", "d"]
}' | examples/json_schema_to_grammar.py - root ::= "{" space
c-kv "," space
d-kv
( "," space (
a-kv a-rest
| b-kv b-rest
| additional-kv ( "," space additional-kv )*
) )?
"}" space
c-kv ::= "\"c\"" space ":" space string
d-kv ::= "\"d\"" space ":" space string
a-kv ::= "\"a\"" space ":" space string
a-rest ::= ( "," space b-kv )? b-rest
b-kv ::= "\"b\"" space ":" space string
b-rest ::= ( "," space additional-kv )*
additional-k ::= ["] ( [a] char+ | [b] char+ | [c] char+ | [d] char+ | [^"abcd] char* )? ["] space
additional-kv ::= additional-k ":" space value
...
We still need to verify there's no name collisions, unfortunately, see spec for additionalProperties "Validation with "additionalProperties" applies only to the child values of instance names that do not appear in the annotation results of either "properties" or "patternProperties"." (otherwise defeats the typing of explicit props)
With both this name exclusion trie and the min/max changes from #7797 I've tried to keep alternatives in check. The users will have the final control over speed by deciding to allow additional props. example of combining min/max, additionalProperties, min/maxItems// schema.json
{
"type": "array",
"items": {
"type": "object",
"properties": {
"a": {
"type": "integer",
"minimum": 0
},
"b": {
"type": "integer",
"maximum": -100
},
"c": {
"type": "integer",
"minimum": -100,
"maximum": 2000
},
"d": {
"type": "integer",
"exclusiveMinimum": 0
}
},
"required": ["c", "d"]
},
"minItems": 10,
"maxItems": 20
} hyperfine --warmup 1 --runs 3 \
-L branch master,json-integ \
--setup 'git checkout {branch} && \
make clean && \
make -j LLAMA_CURL=1 llama-cli' \
'BRANCH={branch} \
./llama-cli -j "$( cat schema.json )" \
-mu https://huggingface.co/bartowski/Phi-3-mini-4k-instruct-GGUF/resolve/main/Phi-3-mini-4k-instruct-Q8_0.gguf \
-c 0 --seed 124 \
-p "Please generate values using the following schema: $( cat schema.json ); Make sure there are values for a and b and an other value, e.g. {\"c\": 10, \"d\": 20, \"a\": 1, \"b\": -2000, \"other\": \"none\"}"'
Hehe, yes it is, I'm glad our minds went in the same direction again 😜😎 |
I'm rolling this one around, and I'm not sure that this should be the case. How might it change things if we kept them in the order they are listed in the json spec, regardless of whether they're required or not? For your earlier thought / action example, I might not require "thought", but I always want to have an "action". If I have a "thought", it makes sense to generate it before "action". If we add the requirement that all required attributes must come first, then it feels like it removes the flexibility to do things like the above.
hah, true. :)
Nice. :D |
@HanClinto Hah, thanks for questioning this, think you might be right, the code needs to be shuffled a tiny bit but we might end up with less max alternatives in some cases, and simpler grammars. Still with the graph TD;
OPEN["{"]-->c;
c-->d;
d-->END["}"];
d-->a;
d-->b;
d-->extra;
a-->END["}"];
a-->b;
b-->END["}"];
b-->extra;
extra-->extra;
extra-->END["}"];
While keeping the natural order then additional props maxes out at 3 alternatives: graph TD;
OPEN["{"]-->a;
a-->b;
a-->c;
OPEN["{"]-->b;
b-->c;
OPEN["{"]-->c;
c-->d;
d-->extra;
extra-->extra;
d-->END["}"];
extra-->END["}"];
Which could be implemented w/ these rules: root ::= "{" space (
a-kv a-rest
| b-kv b-rest
| c-kv c-rest
)? "}" space
a-rest ::= ( "," space b-kv )? b-rest
b-rest ::= ( "," space c-kv ) c-rest
c-rest ::= ( "," space d-kv ) d-rest
d-rest ::= ( "," space additional-kv )* (essentially, max alternatives is uninterrupted length of optional properties run + 2, rather than total number of optional properties + 2; the +2 is 1 to bypass run of optionals and 1 to enter extras loop, if the run of optionals comes after the last required property) I'll start incubating an update 🤗 Edit: Actually, allowing a unified order is orthogonal with this PR, I think we should send these additionalProperties fixes as they are. |
Sounds good to me! You've done amazing work on all of this -- I'm looking forward to seeing these merged in! |
…anov#7840) * json: default additionalProperty to true * json: don't force additional props after normal properties! * json: allow space after enum/const * json: update pydantic example to set additionalProperties: false * json: prevent additional props to redefine a typed prop * port not_strings to python, add trailing space * fix not_strings & port to js+py * Update json-schema-to-grammar.cpp * fix _not_strings for substring overlaps * json: fix additionalProperties default, uncomment tests * json: add integ. test case for additionalProperties * json: nit: simplify condition * reformat grammar integ tests w/ R"""()""" strings where there's escapes * update # tokens in server test: consts can now have trailing space
…anov#7840) * json: default additionalProperty to true * json: don't force additional props after normal properties! * json: allow space after enum/const * json: update pydantic example to set additionalProperties: false * json: prevent additional props to redefine a typed prop * port not_strings to python, add trailing space * fix not_strings & port to js+py * Update json-schema-to-grammar.cpp * fix _not_strings for substring overlaps * json: fix additionalProperties default, uncomment tests * json: add integ. test case for additionalProperties * json: nit: simplify condition * reformat grammar integ tests w/ R"""()""" strings where there's escapes * update # tokens in server test: consts can now have trailing space
Addresses the 3 main issues in #7789:
"additionalProperties": true
should now be fixed.properties
(before, could essentially bypass the types defined inproperties
), generating an exclusion pattern w/ a trie:If
properties
just had"a"
, the additional properties' key will have a rule like:["] ( [a] char+ | [^"a] char* )? ["]
with
"ab"
&"ac"
:["] ( [a] ([b] char+ | [c] char+ | [^"bc] char*) | [^"a] char* )? ["]
additionalProperties
now defaults totrue
as it shouldspace now allowed after enum / const values.
TODO:
cc/ @HanClinto
Important note for Pydantic users
By default, pydantic's JSON schemas allow for additional properties, which will start being honored by llama.cpp once this PR gets merged.
While extra properties may be useful for backwards compatibility (e.g. deserializing values from newer app version w/ an old version), for the purpose of generating data, it will let the LLM sample potentially many useless properties.
You can disable additional properties with two lines of extra config for each model class:
Show output