Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple repeat operators #37

Open
Krzmbrzl opened this issue Nov 23, 2023 · 4 comments
Open

Multiple repeat operators #37

Krzmbrzl opened this issue Nov 23, 2023 · 4 comments
Labels
bug Something isn't working

Comments

@Krzmbrzl
Copy link

The following snippet

A <- B C?
B <- "B"
C <- [a-z]+

produces the error

pe._errors.GrammarError: multiple repeat operators

upon compiling it.

It seems like pe thinks that the ? and the + operator follow each other immediately and ignores the rule indirection. However, I think a grammar like the above seems perfectly normal.
Therefore, this seems like a bug to me.

@goodmami goodmami added the bug Something isn't working label Nov 23, 2023
@goodmami
Copy link
Owner

Hmm, I think you're right, if we only relax the check for an outer ?. The purpose of this error is to avoid infinite loops. For the three repeat operators *, +, and ?, we have the following combinations:

Pattern Effect in pe* Allowed by re**
(X*)* infinite loop a**
(X*)+ infinite loop a*+ ✔️
(X*)? redundant a*? ✔️
(X+)* redundant a+*
(X+)+ redundant a++ ✔️
(X+)? equivalent to X* a+? ✔️
(X?)* infinite loop a?*
(X?)+ infinite loop a?+ ✔️
(X?)? redundant a?? ✔️

* You might not see the infinite loop in pe because the packrat parser has a simple guard for loops that do not advance the position, and, secondly, optimized grammars might turn the pattern into a single regex which does not have the infinite loop.

** The re module gives an error for a**, a+*, and a?*, but you can avoid the error with a group, like (a*)* or (?:a*)*, and using it to match does not result in an infinite loop.

In spite of protections against it, there are some ways to induce the infinite loop. For instance, if you use a capture (~) to prevent the grammar error and then use the machine parser, it will get stuck:

import pe
pe.match('(~"a"*)*', "b", parser="machine")  # DANGER: infinite loop

I'm prepared to allow (X+)?, because even if the effect is the same as X*, it seems natural to write grammars like this:

Something <- Token?
Token     <- Char+

For the "redundant" cases above, it should be safe to allow them, perhaps with a GrammarWarning instead of a GrammarError. For now I'd rather continue raising a GrammarError on the infinite loop cases.

(Aside: while investigating this issue, I came across another bug. See #38)

@Krzmbrzl
Copy link
Author

Krzmbrzl commented Nov 24, 2023

I think it would be good to always error on immediate chaining of those operators as in e.g. A*+ regardless of whether or not this can be interpreted in a reasonable non-infinite-loop-producing way.
However, as soon as we deal with indirections across rules, I think all repetition operators should be allowed as long as the referenced rule can not match an empty string. If it can, all of those operators either don't make sense to begin with (?) or can create an infinite loop (* and +).
However, as long as the referenced rule definitely matches at least one character/byte, applying a repetition operator to it makes sense.

The fact that some of this chaining is redundant and can be rewritten with only a single operator should be an implementation detail of pe when it optimizes the grammar/parser.
However, from a user's POV it's quite convenient to write a rule matching a specific input and then reuse that rule elsewhere with different repetition operators. Just think of e.g. an ID rule such as

ID <- [a-zA-Z_0-9]+

which can then be used as e.g.

name <- ID ID? # first name and optional last name
name_list <- ID+ # Multiple names after another

Furthermore, I would like to point out that e.g.

A < B+
B <- [a-z]+

is not equal to

A < [a-z]+

nor to

A <- [a-z]+

this is due to auto-ignoring (of whitespace).

@goodmami
Copy link
Owner

I think it would be good to always error on immediate chaining of those operators as in e.g. A*+ regardless of whether or not this can be interpreted in a reasonable non-infinite-loop-producing way.

These are currently blocked at the syntax level (aside: maybe we need a new error class like GrammarSyntaxError), as they are not part of Ford's original PEG metasyntax. I'm not entirely opposed to allowing it if we allow (A+)?, etc.

However, as soon as we deal with indirections across rules, I think all repetition operators should be allowed as long as the referenced rule can not match an empty string.

One part of pe's design is that non-terminals are not inherently different from inline forms. That is, the A1 and A2 productions below have exactly equivalent behavior:

A1 <- B C
B  <- "b"
C  <- "c"

A2 <- "b" "c"

This design is what enables pe to reason about when to optimize patterns. Semantic changes, such as using a ~ capture or attaching an action to a rule, introduce differences that block optimization.

Thus, if we allow

A <- B?
B <- "b"+

Then we must also allow

A <- ("b"+)?

This is also why the capture in (~"a"*)* above causes the infinite loop. So maybe we need the ability to see across these semantic effects for detecting potential infinite loops.

However, as long as the referenced rule definitely matches at least one character/byte, applying a repetition operator to it makes sense.

Apologies if I misunderstood, but I want to be able to detect these things at compile time, not parse time. So I don't want to care if, at parse time, A* matches more than zero characters so that we can allow a second * to repeat over it. I want to detect at compile time that this pattern can potentially match an empty string infinitely.

Furthermore, I would like to point out that [...] is not equal to [...] nor to [...] this is due to auto-ignoring (of whitespace).

Yes, I did not bring up auto-ignore in my reply above because I wanted to keep the conversation focused, but I agree that we would need to consider its effects in the multiple repeat operator issue. It's a relatively new feature of pe and I'm still not as certain about its ramifications.

@Krzmbrzl
Copy link
Author

Krzmbrzl commented Nov 25, 2023

I want to detect at compile time that this pattern can potentially match an empty string infinitely.

Indeed that is what I meant. If the formal specification of the rule allows it to potentially match an empty string, don't allow repetition operators on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants