Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support embedded expressions/braces in double quoted strings #26

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

Nicholas-Lin
Copy link
Contributor

@Nicholas-Lin Nicholas-Lin commented Aug 5, 2021

Summary

This PR adds support for embedded expressions and embedded braces in double quoted strings. Note that this PR addresses a similar issue to PR #25. Notably this PR also adds support for embedded expressions and this implementation is entirely done in grammar.json (not scanner.cc).

Here are some examples of the constructs that are now supported:

"$var";
"$var[subscript]";
"$var->member";
"{$var->prop}";
"{$var->prop["key"]}";

I also added support for escape character sequences so the following examples should parse correctly:

"\$notavar";  <-- string literal
"\\\\\$notavar"; <-- string literal
"\\\{$embedexp}";  <-- should be identified as an embedded expression since the brace is escaped

Initially there were some issues with the parser incorrectly interpreting instances of #, //, /* in the string as a comment, but this should not be a problem anymore!

Requirements (place an x in each [ ])

@CLAassistant
Copy link

CLAassistant commented Aug 5, 2021

CLA assistant check
All committers have signed the CLA.

@Nicholas-Lin Nicholas-Lin changed the title Support embedded exp/braces in double quoted strings Support embedded expressions/braces in double quoted strings Aug 5, 2021
@Nicholas-Lin Nicholas-Lin marked this pull request as ready for review August 5, 2021 01:36
@Nicholas-Lin Nicholas-Lin requested a review from aosq August 5, 2021 01:36
@Nicholas-Lin Nicholas-Lin marked this pull request as draft August 5, 2021 14:32
@Nicholas-Lin Nicholas-Lin marked this pull request as ready for review August 5, 2021 17:34
@Nicholas-Lin
Copy link
Contributor Author

Nicholas-Lin commented Aug 5, 2021

I noticed that embedded braces do not support scoped identifiers. For example the following test case will fail:

"{$var::get()}";

Not sure if this should be addressed in this PR or we can make a separate PR for it since this one has quite a few changes already.

Comment on lines +421 to +425
choice.rep($._string_character, $._escape_sequence, $.embedded_expression, $.embedded_brace_expression),
'"',
),

_string_character: $ => choice(token(/([^"\\])/), token(prec(1, choice('#', '//', '/*')))),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From last time we talked, did you try putting repeating $._string_character in it's own rule, so we can just have a $.string_body rule that encompass a full consecutive string?

string_body: $ => rep1($._string_character)

or

string_body: $ => rep1(choice(token(/([^"\\])/), token(prec(1, choice('#', '//', '/*')))))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this is a nit, but I don't think you need token around /([^"\\])/. And that rule also doesn't have to use a regex group anymore () since there's only one option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I try that it creates a conflict. Any suggestions for resolving it?

Unresolved conflict for symbol sequence:

  '"'  string_body_repeat1  •  '_string_character_token1'  …

Possible interpretations:

  1:  '"'  (string_body  string_body_repeat1)  •  '_string_character_token1'  …
  2:  '"'  (string_body_repeat1  string_body_repeat1  •  string_body_repeat1)

Possible resolutions:

  1:  Specify a left or right associativity in `string_body`
  2:  Add a conflict for these rules: `string_body`

I suspect that adding this rule as is might lead some problems. Tree-sitter prioritizes matching based on length. Say for example we have:

"sometext$var"

I think the $ will be captured as part of the string body instead of an embedded expression, and therefore the whole thing will be parsed as a string body. We might need to make some modifications to accommodate this.

Copy link
Member

@aosq aosq Aug 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you'd want right associativity for $.string_body since that'll capture the largest node. Left associativity would capture the $.string_body on first match so it'd end up just capturing individual characters like it does without the repeat1.

If we run into problems with $ and {, we could do something like we do for $.xhp_comment where we exclude problematic token characters and only allow them in specific sequences. Something like,

  string_body: $ =>
    repeat1(
      choice(
        seq(opt('\\'), choice(/[^"\\${]/, token(prec(1, choice('#', '//', '/*'))))),
        seq(
          // $ is allowed only if not followed by an identifier character
          repeat1('$'),
          opt('\\'),
          choice(/[^"\\${a-zA-Z_\x80-\xff]/, token(prec(1, choice('#', '//', '/*')))),
        ),
        seq(
          // { is allowed only if not followed by $
          repeat1('{'),
          opt('\\'),
          choice(/[^"\\${]/, token(prec(1, choice('#', '//', '/*')))),
        ),
      ),
    ),

I checked out your branch and tried adding right associativity to repeating $. _string_character and it doesn't seem conflict with $.variable (honestly surprised it doesn't cause issues (at least not immediate ones)).

@frankeld
Copy link
Member

frankeld commented Aug 10, 2021

This is good progress, but there are some extended test cases that seem to break it.
"Test $var->tester- Hello"; errors, but the parser should read $var->tester as embedded member selection expression.

Also, we have some inconsistency with the way we nest expression items. Consider the following test case/output:

"{$var->fun->yum}";

$var->fun->yum;
(selection_expression [5, 1] - [5, 15]
  (variable [5, 1] - [5, 5])
  (selection_expression [5, 7] - [5, 15]
    (qualified_identifier [5, 7] - [5, 10]
      (identifier [5, 7] - [5, 10]))
    (qualified_identifier [5, 12] - [5, 15]
      (identifier [5, 12] - [5, 15]))))))

(selection_expression [7, 0] - [7, 14]
  (selection_expression [7, 0] - [7, 9]
    (variable [7, 0] - [7, 4])
    (qualified_identifier [7, 6] - [7, 9]
      (identifier [7, 6] - [7, 9])))
  (qualified_identifier [7, 11] - [7, 14]
    (identifier [7, 11] - [7, 14])))))

In the case of the double quoted string, we have the variable in the level between the two selection expressions. This is incorrect, as the selection of the variable isn't against the value of fun->yum. The non-embedded version gets parsed correctly, as the leading variable identifier is in the deepest level of the nested selection. This inconsistency also happens with Heredoc variable substitution, which may be where we are inheriting it from.

@cfroystad
Copy link

In case it could be helpful, I've implemented string parsing for PHP in the tree-sitter-php repository. Please use whatever is useful to you: tree-sitter/tree-sitter-php#72

@aosq
Copy link
Member

aosq commented Aug 19, 2021

Also, we have some inconsistency with the way we nest expression items

Started looking into this and you're right that the inconsistency comes from $.heredoc. I originally wrote the custom embedded braced expression rules (instead of say reusing $.call_expression, $.subscript_expression, $.selection_expression) because embedded braced expressions are restricted to expressions that start with a $.variable and to be a valid embedded braced expression there can't be a space between { and $.

Reusing existing call/subscript/selection definitions
Previously, I thought reusing the existing call/subscript/selection rules would allow invalid scenarios. Thinking on this a little more I realized that's not the case: #29.

tree-sitter-hack/grammar.js

Lines 154 to 164 in 8ac0c52

embedded_braced_expression: $ =>
seq(
'{',
choice(
$.variable,
$.call_expression,
$.subscript_expression,
$.selection_expression,
),
'}',
),

Replacing $.embedded_brace_expression with the already defined call/subscript/selection rules fixes the issue you described for heredocs, but I think this only works because heredocs use a scanner. Don't think we could apply the same fix to $.string without a scanner.

Scanner hack
One way to make the simplified version of $.embedded_brace_expression work both for heredoc and string without resorting to a scanner for string content, is to create a scanner node just for the { character of the embedded braced expression. This would allow us to use a simplified $.embedded_brace_expression but restrict the internal expressions to start with $.variable like we today for heredocs.

Fixing custom call/subscript/selecting definitions
I don't see a way to do this (yet) that doesn't require some narly copy-pasting of existing definitions and modifying them further to restrict them to the embedded braced expression case.

This was referenced Aug 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants