Syntax.txt

= RUBY REGULAR EXPRESSION SYNTAX


== Syntax Elements

[\]     escape (enable or disable meta character meaning)
[|]     alternation
[(...)] group
[[...]] character class  


== Characters

[\t]           horizontal tab (0x09)
[\v]           vertical tab   (0x0B)
[\n]           newline        (0x0A)
[\r]           return         (0x0D)
[\b]           back space     (0x08)

               \b is effective in character class [...] only
[\f]           form feed      (0x0C)
[\a]           bell           (0x07)
[\e]           escape         (0x1B)
[\nnn]         octal char            (encoded byte value)
[\xHH]         hexadecimal char      (encoded byte value)
[\x{7HHHHHHH}] wide hexadecimal char (character code point value)
[\cx]          control char          (character code point value)
[\C-x]         control char          (character code point value)
[\M-x]         meta  (x|0x80)        (character code point value)
[\M-\C-x]      meta control char     (character code point value)


== Character types

[.]        any character (except newline)
[\w]       word character

           Not Unicode:
           * alphanumeric, "_" and multibyte char. 
           Unicode:
           * General_Category -- (Letter|Mark|Number|Connector_Punctuation)
[\W]       non word char
[\s]       whitespace char

           Not Unicode:
           * \t, \n, \v, \f, \r, \x20
           Unicode:
           * 0009, 000A, 000B, 000C, 000D, 0085(NEL), 
           * General_Category:
             * -- Line_Separator
             * -- Paragraph_Separator
             * -- Space_Separator
[\S]       non whitespace char
[\d]       decimal digit char

           Unicode: General_Category -- Decimal_Number
[\D]       non decimal digit char
[\h]       hexadecimal digit char   [0-9a-fA-F]
[\H]       non hexadecimal digit char


== Character Properties

 \p{property-name}
 \p{^property-name}    (negative)
 \P{property-name}     (negative)

=== property-name:

Works on all encodings:
* Alnum, Alpha, Blank, Cntrl, Digit, Graph, Lower,
  Print, Punct, Space, Upper, XDigit, Word, ASCII,
Works on EUC_JP, Shift_JIS:
* Hiragana, Katakana
Works on UTF8, UTF16, UTF32:
* Any, Assigned, C, Cc, Cf, Cn, Co, Cs, L, Ll, Lm, Lo, Lt, Lu,
  M, Mc, Me, Mn, N, Nd, Nl, No, P, Pc, Pd, Pe, Pf, Pi, Po, Ps,
  S, Sc, Sk, Sm, So, Z, Zl, Zp, Zs, 
  Arabic, Armenian, Bengali, Bopomofo, Braille, Buginese,
  Buhid, Canadian_Aboriginal, Cherokee, Common, Coptic,
  Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian,
  Glagolitic, Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul,
  Hanunoo, Hebrew, Hiragana, Inherited, Kannada, Katakana,
  Kharoshthi, Khmer, Lao, Latin, Limbu, Linear_B, Malayalam,
  Mongolian, Myanmar, New_Tai_Lue, Ogham, Old_Italic, Old_Persian,
  Oriya, Osmanya, Runic, Shavian, Sinhala, Syloti_Nagri, Syriac,
  Tagalog, Tagbanwa, Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan,
  Tifinagh, Ugaritic, Yi

== Quantifiers

=== Greedy

[?]       1 or 0 times
[*]       0 or more times
[+]       1 or more times
[{n,m}]   at least n but not more than m times
[{n,}]    at least n times
[{,n}]    at least 0 but not more than n times ({0,n})
[{n}]     n times

=== Reluctant

[??]      1 or 0 times
[*?]      0 or more times
[+?]      1 or more times
[{n,m}?]  at least n but not more than m times  
[{n,}?]   at least n times
[{,n}?]   at least 0 but not more than n times (== {0,n}?)

=== Possessive (greedy and does not backtrack after repeated)

[?+]      1 or 0 times
[*+]      0 or more times
[++]      1 or more times

({n,m}+, {n,}+, {n}+ are possessive op. in ONIG_SYNTAX_JAVA only)


== Anchors

[^]       beginning of the line
[$]       end of the line
[\b]      word boundary
[\B]      not word boundary
[\A]      beginning of string
[\Z]      end of string, or before newline at the end
[\z]      end of string
[\G]      matching start position 


== Character class

[^...]    negative class (lowest precedence operator)
[x-y]     range from x to y
[[...]]   set (character class in character class)
[..&&..]  intersection (low precedence at the next of ^)
          
If you want to use '[', '-', ']' as a normal character
in a character class, you should escape these characters by '\'.


POSIX bracket ([:xxxxx:], negate [:^xxxxx:])

=== Not Unicode Case:

[alnum]    alphabet or digit char
[alpha]    alphabet
[ascii]    code value: [0 - 127]
[blank]    \t, \x20
[cntrl]    control
[digit]    0-9
[graph]    include all of multibyte encoded characters
[lower]    lower case
[print]    include all of multibyte encoded characters
[punct]    punctuation
[space]    \t, \n, \v, \f, \r, \x20
[upper]    upper case
[xdigit]   0-9, a-f, A-F
[word]     alphanumeric, "_" and multibyte characters


=== Unicode Case:

[alnum]    Letter | Mark | Decimal_Number
[alpha]    Letter | Mark
[ascii]    0000 - 007F
[blank]    Space_Separator | 0009
[cntrl]    Control | Format | Unassigned | Private_Use | Surrogate
[digit]    Decimal_Number
[graph]    [[:^space:]] && ^Control && ^Unassigned && ^Surrogate
[lower]    Lowercase_Letter
[print]    [[:graph:]] | [[:space:]]
[punct]    Connector_Punctuation | Dash_Punctuation | Close_Punctuation |
           Final_Punctuation | Initial_Punctuation | Other_Punctuation |
           Open_Punctuation
[space]    Space_Separator | Line_Separator | Paragraph_Separator |
           0009 | 000A | 000B | 000C | 000D | 0085
[upper]    Uppercase_Letter
[xdigit]   0030 - 0039 | 0041 - 0046 | 0061 - 0066
           (0-9, a-f, A-F)
[word]     Letter | Mark | Decimal_Number | Connector_Punctuation


== Extended groups

[(?#...)]            comment
[(?imx-imx)]         option on/off:
                     * i: ignore case
                     * m: multi-line (dot(.) match newline)
                     * x: extended form
[(?imx-imx:subexp)]  option on/off for subexp
[(?:subexp)]         not captured group
[(subexp)]           captured group
[(?=subexp)]         look-ahead
[(?!subexp)]         negative look-ahead
[(?<=subexp)]        look-behind
[(?<!subexp)]        negative look-behind

                     Subexp of look-behind must be fixed character length.
                     But different character length is allowed in top level
                     alternatives only.
                     ex. (?<=a|bc) is OK. (?<=aaa(?:b|cd)) is not allowed.
                     
                     In negative-look-behind, captured group isn't allowed, 
                     but shy group(?:) is allowed.
[(?>subexp)]         atomic group
                     don't backtrack in subexp.
[(?<name>subexp)]    define named group
                     (All characters of the name must be a word character.)
                     
                     Not only a name but a number is assigned like a captured
                     group.

                     Assigning the same name as two or more subexps is allowed.
                     In this case, a subexp call can not be performed although
                     the back reference is possible.


== Back reference

[\n]          back reference by group number (n >= 1)
[\k<name>]    back reference by group name
In the back reference by the multiplex definition name,
a subexp with a large number is referred to preferentially.
(When not matched, a group of the small number is referred to.)

* Back reference by group number is forbidden if named group is defined 
  in the pattern and ONIG_OPTION_CAPTURE_GROUP is not setted.


=== Back reference with nest level

[\k<name+n>]     n: 0, 1, 2, ...
[\k<name-n>]     n: 0, 1, 2, ...

Destinate relative nest level from back reference position.    

Examples:
   /\A(?<a>|.|(?:(?<b>.)\g<a>\k<b+0>))\z/.match("reer")

   r = ORegexp.compile(<<'__REGEXP__'.strip, :options => Oniguruma::EXTENDED)
   (?<element> \g<stag> \g<content>* \g<etag> ){0}
   (?<stag> < \g<name> \s* > ){0}
   (?<name> [a-zA-Z_:]+ ){0}
   (?<content> [^<&]+ (\g<element> | [^<&]+)* ){0}
   (?<etag> </ \k<name+1> >){0}
   \g<element>
   __REGEXP__
   
   p r.match('<foo>f<bar>bbb</bar>f</foo>').captures


=== Subexp call ("Tanaka Akira special")

[\g<name>]    call by group name
[\g<n>]       call by group number (n >= 1)

* left-most recursive call is not allowed.

  Example:
    (?<name>a|\g<name>b)   => error
    (?<name>a|b\g<name>c)  => OK
* Call by group number is forbidden if named group is defined in the pattern
  and Oniguruma::OPTION_CAPTURE_GROUP is not set.
* If the option status of called group is different from calling position
  then the group's option is effective.
  
  Example:
    (?-i:\g<name>)(?i:(?<name>a)){0}  <i>matches "A"</i>


== Captured group

Behavior of the no-named group (...) changes with the following conditions.
(But named group is not changed.)

[case 1]                         <code>ORegexp.new( '...' )</code>  (named group is not used, no option)

                                 ... is treated as a captured group.
[case 2]                         <code>ORegexp.new( '...', :options => OPTION_DONT_CAPTURE_GROUP )</code> (named group is not used, 'g' option)

                                 ... is treated as a no-captured group (?:...).

[case 3]                         <code>ORegexp.new( '...(?<name>...)...' )</code>  (named group is used, no option)

                                 (?<name>...) is treated as a no-captured group (?:...)
                                 
                                 numbered-backref/call is not allowed.

[case 2]                         <code>ORegexp.new( '...', :options => OPTION_CAPTURE_GROUP )</code> (named group is used, 'G' option)

                                 (?<name>...) is treated as a captured group (?:...)

                                 numbered-backref/call is allowed.

where
* g: OPTION_DONT_CAPTURE_GROUP
* G: OPTION_CAPTURE_GROUP

('g' and 'G' options are argued in ruby-dev ML)


== Syntax dependent options

=== ONIG_SYNTAX_RUBY

[(?m)] dot(.) match newline

=== ONIG_SYNTAX_PERL and ONIG_SYNTAX_JAVA

[(?s)] dot(.) match newline
[(?m)] ^ match after newline, $ match before newline

== Original extensions

* hexadecimal digit char type  \h, \H
* named group                  (?<name>...)
* named backref                \k<name>
* subexp call                  \g<name>, \g<group-num>


== Lacking features compare with perl 5.8.0

* \N{name}
* \l,\u,\L,\U, \X, \C
* (?{code})
* (??{code})
* (?(condition)yes-pat|no-pat)
* \Q...\E

  This is effective on ONIG_SYNTAX_PERL and ONIG_SYNTAX_JAVA.


== Differences with Japanized GNU regex(version 0.12) of Ruby 1.8

* add character property (\p{property}, \P{property})
* add hexadecimal digit char type (\h, \H)
* add look-behind

  (?<=fixed-char-length-pattern), (?<!fixed-char-length-pattern)
* add possessive quantifier. ?+, *+, ++
* add operations in character class. [], &&

  ('[' must be escaped as an usual char in character class.)
* add named group and subexp call.
* octal or hexadecimal number sequence can be treated as 
  a multibyte code char in character class if multibyte encoding
  is specified.
  
  (ex. <code>[\xa1\xa2], [\xa1\xa7-\xa4\xa1]</code>)
* allow the range of single byte char and multibyte char in character
  class.
  
  ex. <code>[a-<<any EUC-JP character>>]</code> in EUC-JP encoding.
* effect range of isolated option is to next ')'.
  ex. (?:(?i)a|b) is interpreted as (?:(?i:a|b)), not (?:(?i:a)|b).
* isolated option is not transparent to previous pattern.
  ex. <code>a(?i)*</code> is a syntax error pattern.
* allowed incompleted left brace as an usual string.
  ex. /{/, /({)/, /a{2,3/ etc...
* negative POSIX bracket [:^xxxx:] is supported.
* POSIX bracket [:ascii:] is added.
* repeat of look-ahead is not allowed.
  ex. <code>(?=a)*</code>, <code>(?!b){5}</code>
* Ignore case option is effective to numbered character.
  ex. <code>/\x61/i =~ "A"<code>
* In the range quantifier, the number of the minimum is omissible.

  <code>/a{,n}/ == /a{0,n}/<code>
  
  The simultanious abbreviation of the number of times of the minimum
  and the maximum is not allowed. (/a{,}/)
* <code>a{n}?<code> is not a non-greedy operator.
  <code>/a{n}?/ == /(?:a{n})?/<code>
* invalid back reference is checked and cause error.
  /\1/, /(a)\2/
* Zero-length match in infinite repeat stops the repeat,
  then changes of the capture group status are checked as stop condition.
   /(?:()|())*\1\2/ =~ ""
   /(?:\1a|())*/ =~ "a"


== Problems

* Invalid encoding byte sequence is not checked in UTF-8.

* Invalid first byte is treated as a character.
   /./u =~ "\xa3"

* Incomplete byte sequence is not checked.
   /\w+/ =~ "a\xf3\x8ec"