Skip to content

Commit

Permalink
Merge pull request #48 from goodmami/gh-46-bounded-repetitions
Browse files Browse the repository at this point in the history
  • Loading branch information
goodmami authored Jun 19, 2024
2 parents 89ccdfd + 0b61886 commit ed9789c
Show file tree
Hide file tree
Showing 17 changed files with 266 additions and 68 deletions.
16 changes: 16 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,20 @@
## [Unreleased][unreleased]


## [v0.5.3][]

**Release date: 2024-06-18**

### Added

* `pe.operators.Repeat` ([#46])
* `e{n}` and `e{m,n}` syntax ([#46])

### Changed

* Bounded repetitions via `e{n}` and `e{m,n}` forms ([#46])


## [v0.5.2][]

**Release date: 2024-03-28**
Expand Down Expand Up @@ -183,6 +197,7 @@ descent parser and a work-in-progress state-machine parser.
[v0.5.0]: ../../releases/tag/v0.5.0
[v0.5.1]: ../../releases/tag/v0.5.1
[v0.5.2]: ../../releases/tag/v0.5.2
[v0.5.3]: ../../releases/tag/v0.5.3

[#6]: https://github.com/goodmami/pe/issues/6
[#7]: https://github.com/goodmami/pe/issues/7
Expand All @@ -199,3 +214,4 @@ descent parser and a work-in-progress state-machine parser.
[#36]: https://github.com/goodmami/pe/issues/36
[#38]: https://github.com/goodmami/pe/issues/38
[#44]: https://github.com/goodmami/pe/issues/44
[#46]: https://github.com/goodmami/pe/issues/46
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,8 @@ e # exactly one
e? # zero or one (optional)
e* # zero or more
e+ # one or more
e{5} # exactly 5
e{3,5} # three to five
# combining expressions
e1 e2 # sequence of e1 and e2
Expand Down
3 changes: 3 additions & 0 deletions docs/api/pe.operators.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,9 @@ the rest are described by the [specification](../specification.md).
* pe.operators.**<a id="Plus" href="#Plus">Plus</a>**
(*expression*)

* pe.operators.**<a id="Repeat" href="#Repeat">Repeat</a>**
(*expression, count=-1, min=0, max=-1*)

* pe.operators.**<a id="Nonterminal" href="#Nonterminal">Nonterminal</a>**
(*name*)

Expand Down
44 changes: 39 additions & 5 deletions docs/specification.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,10 @@ following sections.
| `p?` | [Optional] | [Quantified] | Match `p` zero or one times |
| `p*` | [Star] | [Quantified] | Match `p` zero or more times |
| `p+` | [Plus] | [Quantified] | Match `p` one or more times |
| `p{n}` | [Repeat] | [Quantified] | Match `p` exactly *n* times |
| `p{m,n}` | [Repeat] | [Quantified] | Match `p` from *m* to *n* times |
| `p{,n}` | [Repeat] | [Quantified] | Match `p` up to *n* times |
| `p{m,}` | [Repeat] | [Quantified] | Match `p` at least *m* times |
| `q` | (default) | [Valued] | Match quantified term `q`; consume input; pass up values |
| `&q` | [And] | [Valued] | Succeed if `q` matches; consume no input; suppress values |
| `!q` | [Not] | [Valued] | Fail if `q` matches; consume no input; suppress values |
Expand Down Expand Up @@ -69,7 +73,10 @@ Valued <- (prefix:Prefix)? Quantified
Prefix <- AND / NOT / TILDE / Binding
Binding <- Identifier COLON
Quantified <- Primary (quantifier:Quantifier)?
Quantifier <- QUESTION / STAR / PLUS
Quantifier <- QUESTION / STAR / PLUS / Repeat
Repeat <- LEFTBRACE RepeatSpec RIGHTBRACE
RepeatSpec <- (min:Integer)? COMMA (max:Integer)?
/ count:Integer
Primary <- Name / Group / Literal / Class / DOT
Name <- Identifier !Operator
Group <- OPEN Expression CLOSE
Expand All @@ -93,8 +100,12 @@ Char <- '\\' [tnvfr"'-\[\\\]]
Oct <- [0-7]
Hex <- [0-9a-fA-F]
Integer <- ~( [0-9]+ ) Spacing
LEFTARROW <- '<-' Spacing
LEFTANGLE <- '<' Space Spacing
LEFTBRACE <- '{' Spacing
RIGHTBRACE <- '}' Spacing
SLASH <- '/' Spacing
AND <- '&' Spacing
NOT <- '!' Spacing
Expand All @@ -105,6 +116,7 @@ STAR <- '*' Spacing
PLUS <- '+' Spacing
OPEN <- '(' Spacing
CLOSE <- ')' Spacing
COMMA <- ',' Spacing
DOT <- '.' Spacing
Spacing <- (Space / Comment)*
Expand Down Expand Up @@ -133,8 +145,8 @@ the only expression type that may be quantified.
Quantified expressions indicate how many times they must occur for the
expression to match. The default (unannotated) quantified expression
must occur exactly once. The [Optional], [Star], and [Plus] operators
change this number.
must occur exactly once. The [Optional], [Star], [Plus], and [Repeat]
operators change this number.
##### Valued
[Valued]: #valued
Expand Down Expand Up @@ -192,6 +204,8 @@ expressions compared to the equivalent in-situ expression.
| `e?` | [Optional] | 5 | [Quantified] |
| `e*` | [Star] | 5 | [Quantified] |
| `e+` | [Plus] | 5 | [Quantified] |
| `e{n}` | [Repeat] | 5 | [Quantified] |
| `e{m,n}` | [Repeat] | 5 | [Quantified] |
| `&e` | [And] | 4 | [Valued] |
| `!e` | [Not] | 4 | [Valued] |
| `~e` | [Capture] | 4 | [Valued] |
Expand Down Expand Up @@ -278,7 +292,7 @@ The following ASCII punctuation characters, in addition to
not necessarily inside [string literals](#literal) or [character
classes](#class); for these see below):
! " # & ' ( ) * + . / : ? [ \ ] _ ~
! " # & ' ( ) * + . / : ? [ \ ] _ { } ~

Special characters inside [string literals](#literal) and [character
classes](#class) are different. For both, the `\` character is used
Expand All @@ -293,7 +307,7 @@ escaped:
The other ASCII punctuation characters are currently unused but are
reserved in expressions for potential future uses:

$ % , - ; < = > @ ` { | }
$ % , - ; < = > @ ` |
### Whitespace
Expand Down Expand Up @@ -538,6 +552,26 @@ while bound values get overwritten (therefore only the value bound by
the last match is passed up).


### Repeat
[Repeat]: #repeat

- Notation: `e{n}` or `e{m,n}`
- Function: [Repeat](api/pe.operators.md#Repeat)(*expression, count=-1, min=0, max=-1*)
- Type: [Quantified]

The Repeat operator succeeds if the given expression succeeds exactly
*n* times in the `e{n}` form, or between *m* and *n* times (inclusive)
in the `e{m,n}` form. Emitted values of the given expression are
accumulated while bound values get overwritten (therefore only the
value bound by the last match is passed up).

It is a more general quantifier than those above as it can replace
[Optional](#optional) (`e{0,1}`), [Star](#star) (`e{0,}`), and
[Plus](#plus) (`e{1,}`) while also allowing for bounded repetition
with minimum and maximum counts greater than one as well as fixed
count repetitions.


### And
[And]: #and

Expand Down
1 change: 1 addition & 0 deletions pe/_constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ class Operator(enum.Enum):
OPT = (_auto(), 5, 'Quantified') # (OPT, (expr,))
STR = (_auto(), 5, 'Quantified') # (STR, (expr,))
PLS = (_auto(), 5, 'Quantified') # (PLS, (expr,))
RPT = (_auto(), 5, 'Quantified') # (RPT, (expr, min, max))
AND = (_auto(), 4, 'Valued') # (AND, (expr,))
NOT = (_auto(), 4, 'Valued') # (NOT, (expr,))
CAP = (_auto(), 4, 'Valued') # (CAP, (expr,))
Expand Down
58 changes: 38 additions & 20 deletions pe/_cy_machine.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ cdef enum OpCode:
cdef struct State:
int opidx
int pos
int count
int mark
int argidx
int kwidx
Expand All @@ -55,6 +56,7 @@ cdef struct State:
cdef State* push(
int opidx,
int pos,
int count,
int mark,
int argidx,
int kwidx,
Expand All @@ -65,6 +67,7 @@ cdef State* push(
raise MemoryError()
state.opidx = opidx
state.pos = pos
state.count = count
state.mark = mark
state.argidx = argidx
state.kwidx = kwidx
Expand Down Expand Up @@ -100,6 +103,7 @@ cdef class Instruction:
OpCode opcode
short oploc
Scanner scanner
short maxcount
bint marking
bint capturing
object action
Expand All @@ -110,6 +114,7 @@ cdef class Instruction:
OpCode opcode,
short oploc=1,
Scanner scanner=None,
short maxcount=1,
bint marking=False,
bint capturing=False,
object action=None,
Expand All @@ -118,6 +123,7 @@ cdef class Instruction:
self.opcode = opcode
self.oploc = oploc
self.scanner = scanner
self.maxcount = maxcount
self.marking = marking
self.capturing = capturing
self.action = action
Expand All @@ -128,6 +134,7 @@ cdef class Instruction:
self.opcode,
self.oploc,
scanner=self.scanner,
maxcount=self.maxcount,
marking=self.marking,
capturing=self.capturing,
action=self.action,
Expand Down Expand Up @@ -207,8 +214,8 @@ cdef class _Parser:
) except -2:
cdef State* state
cdef int retval = -1
state = push(0, 0, -1, 0, 0, NULL) # failure (top backtrack entry)
state = push(-1, -1, -1, 0, 0, state) # success
state = push(0, 0, 0, -1, 0, 0, NULL) # failure (top backtrack entry)
state = push(-1, -1, 1, -1, 0, 0, state) # success
try:
state = self._match(idx, s, pos, args, kwargs, memo, state)
retval = state.pos
Expand Down Expand Up @@ -242,20 +249,20 @@ cdef class _Parser:
instr = pi[idx]

if instr.marking:
state = push(0, -1, pos, len(args), len(kwargs), state)
state = push(0, -1, 0, pos, len(args), len(kwargs), state)

if instr.opcode == SCAN:
pos = instr.scanner._scan(s, pos, slen)
if pos < 0:
idx = FAILURE

elif instr.opcode == BRANCH:
state = push(idx + instr.oploc, pos, -1, len(args), len(kwargs), state)
state = push(idx + instr.oploc, pos, 0, -1, len(args), len(kwargs), state)
idx += 1
continue

elif instr.opcode == CALL:
state = push(idx + 1, -1, -1, -1, -1, state)
state = push(idx + 1, -1, 0, -1, -1, -1, state)
idx = instr.oploc
continue

Expand All @@ -265,10 +272,14 @@ cdef class _Parser:
continue

elif instr.opcode == UPDATE:
state.pos = pos
state.argidx = len(args)
state.kwidx = len(kwargs)
idx += instr.oploc
if instr.maxcount == -1 or state.count < instr.maxcount:
state.pos = pos
state.argidx = len(args)
state.kwidx = len(kwargs)
idx += instr.oploc
else:
state = pop(state)
idx += 1
continue

elif instr.opcode == RESTORE:
Expand Down Expand Up @@ -330,7 +341,7 @@ cdef class _Parser:
idx += 1

if not state:
state = push(0, -1, -1, 0, 0, NULL)
state = push(0, -1, 0, -1, 0, 0, NULL)
else:
state.pos = pos
return state
Expand Down Expand Up @@ -395,11 +406,12 @@ def _opt(defn):
Instruction(COMMIT, 1)]


def _str(defn): return _rpt(defn, 0)
def _pls(defn): return _rpt(defn, 1)
def _str(defn): return _loop(defn, 0, -1)
def _pls(defn): return _loop(defn, 1, -1)
def _rpt(defn): return _loop(defn, defn.args[1], defn.args[2])


def _rpt(defn, mincount):
def _loop(defn, mincount, maxcount):
pis = _parsing_instructions(defn.args[0])
if (len(pis) == 1
and pis[0].opcode == SCAN
Expand All @@ -409,16 +421,20 @@ def _rpt(defn, mincount):
):
pi = pis[0]
pi.scanner.mincount = mincount
pi.scanner.maxcount = -1
pi.scanner.maxcount = maxcount
return [Instruction(SCAN,
scanner=pi.scanner,
maxcount=1, # scanner has maxcount
marking=pi.marking,
capturing=pi.capturing,
action=pi.action)]
return [*(pi.copy() for _ in range(mincount) for pi in pis),
Instruction(BRANCH, len(pis) + 2),
*pis,
Instruction(UPDATE, -len(pis))]
return [
# risk of billion laughs attack
*(pi.copy() for _ in range(mincount) for pi in pis),
Instruction(BRANCH, len(pis) + 2),
*pis,
Instruction(UPDATE, -len(pis), maxcount=maxcount)
]


def _sym(defn):
Expand Down Expand Up @@ -508,6 +524,7 @@ _op_map = {
Operator.OPT: _opt,
Operator.STR: _str,
Operator.PLS: _pls,
Operator.RPT: _rpt,
Operator.SYM: _sym,
Operator.AND: _and,
Operator.NOT: _not,
Expand Down Expand Up @@ -639,6 +656,7 @@ cdef class Regex(Scanner):
# print(i,
# _OpCodeNames[pi.opcode],
# f'{pi.oploc:+}',
# f'{pi.maxcount=}',
# 'marking' if pi.marking else '',
# 'capturing' if pi.capturing else '',
# 'name' if pi.name else '')
Expand All @@ -648,8 +666,8 @@ cdef class Regex(Scanner):
# states = []
# while state:
# states.append(
# f'<State (opidx={state.opidx}, pos={state.pos}, mark={state.mark},'
# f' argidx={state.argidx}, kwidx={state.kwidx})>'
# f'<State (opidx={state.opidx}, pos={state.pos}, count={state.count},'
# f' mark={state.mark}, argidx={state.argidx}, kwidx={state.kwidx})>'
# )
# state = state.prev
# print(f'stack ({len(states)} entries):')
Expand Down
14 changes: 14 additions & 0 deletions pe/_definition.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
SYM = Operator.SYM
OPT = Operator.OPT
STR = Operator.STR
RPT = Operator.RPT
PLS = Operator.PLS
AND = Operator.AND
NOT = Operator.NOT
Expand Down Expand Up @@ -79,10 +80,22 @@ def _format_nonterminal(defn: Definition, prev_op: Operator) -> str:
return defn.args[0]


def _format_repetition(defn: Definition, prev_op: Operator) -> str:
print('rec', defn.op, prev_op)
subdef, _min, _max = defn.args
if _min == _max:
body = str(_min)
else:
body = f"{'' if _min == 0 else _min},{'' if _max == -1 else _max}"
return f"{_format(subdef, defn.op)}{{{body}}}"


_format_decorators: Dict[Operator, Tuple[str, str, str]] = {
# OP: (prefix, delim, suffix)
OPT: ('', '', '?'),
STR: ('', '', '*'),
PLS: ('', '', '+'),
RPT: ('', '', ''),
AND: ('&', '', ''),
NOT: ('!', '', ''),
CAP: ('~', '', ''),
Expand Down Expand Up @@ -123,6 +136,7 @@ def _format_debug(defn: Definition, prev_op: Operator) -> str:
OPT: _format_recursive,
STR: _format_recursive,
PLS: _format_recursive,
RPT: _format_repetition,
AND: _format_recursive,
NOT: _format_recursive,
CAP: _format_recursive,
Expand Down
Loading

0 comments on commit ed9789c

Please sign in to comment.