Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--[lang]-query can read tree-sitter (scm) query files #144

Merged
merged 12 commits into from
Nov 5, 2024

Conversation

bheylin
Copy link
Contributor

@bheylin bheylin commented Oct 3, 2024

Relates to #143

@alexpovel I've added a proof of concept to this PR.
The most logical place to add the file check and read is to the struct CustomRustQuery.
If so the QueryError needs to be wrapped in a compound Error that can report
io::Error's along with the parsing errors QueryError is now designed to report.

What do you think?

@bheylin bheylin changed the title Add proof of concept lang-query can read files Oct 3, 2024
@bheylin bheylin changed the title lang-query can read files lang-query can read tree-sitter (scm) query files Oct 3, 2024
@bheylin bheylin changed the title lang-query can read tree-sitter (scm) query files --[lang]-query can read tree-sitter (scm) query files Oct 3, 2024
src/scoping/langs/rust.rs Outdated Show resolved Hide resolved
Copy link
Owner

@alexpovel alexpovel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, thanks.

I was able to give it a first quick look, but not fully groked yet.

The CodeQuery enum was always bizarre, never liked it. It had this weird cyclic feel to it regarding trait impls. I get the feeling your refactor cleared that up.

It might be 2-3 weeks until I have time to look again properly, sorry! Feel free to reiterate, I will still take a look and unlock CI runs.

src/main.rs Outdated Show resolved Hide resolved
src/main.rs Outdated Show resolved Hide resolved
src/scoping/langs/mod.rs Outdated Show resolved Hide resolved
src/main.rs Outdated Show resolved Hide resolved
src/main.rs Outdated Show resolved Hide resolved
@alexpovel
Copy link
Owner

Main / Execute release chores (pull_request) is an expected failure, no worries there.

@codecov-commenter
Copy link

codecov-commenter commented Oct 13, 2024

Codecov Report

Attention: Patch coverage is 70.00000% with 12 lines in your changes missing coverage. Please review.

Project coverage is 86.61%. Comparing base (dda6ea4) to head (d1ef82d).

Files with missing lines Patch % Lines
src/main.rs 80.76% 5 Missing ⚠️
src/scoping/langs/c.rs 0.00% 2 Missing ⚠️
src/scoping/langs/csharp.rs 0.00% 2 Missing ⚠️
src/scoping/langs/hcl.rs 0.00% 2 Missing ⚠️
src/scoping/langs/typescript.rs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #144      +/-   ##
==========================================
- Coverage   86.72%   86.61%   -0.11%     
==========================================
  Files          32       32              
  Lines        1876     1898      +22     
==========================================
+ Hits         1627     1644      +17     
- Misses        249      254       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Owner

@alexpovel alexpovel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good, thank you so much!

The PhantomData approach existed to be extra strict. I expected those types to be shuffled about a lot, where confusing them would be bad. But turns out that doesn't really happen, with construction only occurring in a single place.

Getting rid of the weirdly cyclic CodeQuery constructs is a breath of fresh air and appreciated. I'm happy to move forward with this PR if you are.

tests/langs/mod.rs Outdated Show resolved Hide resolved
src/main.rs Outdated Show resolved Hide resolved
src/main.rs Outdated Show resolved Hide resolved
src/main.rs Outdated Show resolved Hide resolved
@alexpovel alexpovel linked an issue Oct 13, 2024 that may be closed by this pull request
@bheylin
Copy link
Contributor Author

bheylin commented Oct 13, 2024

The changes look good, thank you so much!

No worries, happy to help out.

The PhantomData approach existed to be extra strict. I expected those types to be shuffled about a lot, where confusing them would be bad. But turns out that doesn't really happen, with construction only occurring in a single place.

Getting rid of the weirdly cyclic CodeQuery constructs is a breath of fresh air and appreciated. I'm happy to move forward with this PR if you are.

I'm happy to merge this into main when you are 😉

I'll fix up the docs first and we should consider a test for the file opening on Windows.

@alexpovel
Copy link
Owner

There are still some doc test failures in CI: https://github.com/alexpovel/srgn/actions/runs/11314113061/job/31463705601?pr=144#step:7:928

@bheylin
Copy link
Contributor Author

bheylin commented Oct 13, 2024

There are still some doc test failures in CI: https://github.com/alexpovel/srgn/actions/runs/11314113061/job/31463705601?pr=144#step:7:928

Fixed in 2b2ea9d

@bheylin
Copy link
Contributor Author

bheylin commented Oct 13, 2024

@alexpovel I'll also add documentation for the new file mode of --lang-query.

@alexpovel
Copy link
Owner

alexpovel commented Oct 13, 2024

Nice, thank you. One possible approach is:

  • place a file with a language query somewhere in docs/, e.g, docs/query
  • write Markdown with some dummy source code, with the file=somefile.someextension syntax you'll find for the other code blocks in the README, e.g.

    srgn/README.md

    Line 123 in 9238fc1

    ```rust file=music.rs
  • invoke in a console Markdown code block as srgn --some-lang-query docs/query --more-args, then show results

This will all be picked up as part of unit testing via

srgn/tests/readme.rs

Lines 1 to 4 in 9238fc1

//! This module contains a small bash interpreter, capable of interpreting simple pipes
//! of programs such as `echo 'Hello World!' | some-program --some-option | some-program
//! arg1 arg2`. It knows how to process and forward stdin and stdout, and will assert
//! commands run terminate as expected, with stdout as expected.

Lastly, docs like

srgn/src/main.rs

Line 1303 in 9238fc1

/// Scope Python code using a custom tree-sitter query.

and

srgn/src/main.rs

Line 1224 in 9238fc1

const TREE_SITTER_QUERY_VALUE_NAME: &str = "TREE-SITTER-QUERY";

(should read something like TREE-SITTER-QUERY OR FILENAME I guess) also need adjustment. That will trigger the README doc tests to fail, which you can then update (a manual process, sorry):

srgn/README.md

Line 1341 in 9238fc1

$ srgn --help

@alexpovel
Copy link
Owner

Oh and I'll squash all commits! If you feel like it you can do that yourself as well, otherwise I'll squash and come up with a suitable commit message.

Ideally and alternatively, instead of squashing into 1 commit, the languages refactor is one chore commit, and the addition of parsing queries from files is a second feat one, but it's fine if you don't feel like doing that work.

@bheylin
Copy link
Contributor Author

bheylin commented Oct 13, 2024

Oh and I'll squash all commits! If you feel like it you can do that yourself as well, otherwise I'll squash and come up with a suitable commit message.

Ideally and alternatively, instead of squashing into 1 commit, the languages refactor is one chore commit, and the addition of parsing queries from files is a second feat one, but it's fine if you don't feel like doing that work.

Yea no worries I'll squash the commits down to something reasonable.

@bheylin
Copy link
Contributor Author

bheylin commented Oct 13, 2024

All good, I'll give this a try and maybe come back with questions

@bheylin
Copy link
Contributor Author

bheylin commented Oct 13, 2024

@alexpovel as a start of making a test for the --[lang]-query flag, what do you think about moving the [lang]::PreparedQuery strings out to the data/queries/[lang]/*.scm location?

See 94f68dc

This will make the query data easily available for various uses.
As a bonus we can then remove the #[allow(clippy::too_long_lines)].

@alexpovel
Copy link
Owner

I am convinced 🙂 This guy will cause trouble though:

pub(super) const IGNORE: &str = "_SRGN_IGNORE";

It is interpolated via formatcp! in places. I am fine with hard-coding the string in .scm files though. Would not be DRY as it is now (_SRGN_IGNORE is defined in a single place only) but easy to find and adjust.

@bheylin
Copy link
Contributor Author

bheylin commented Oct 13, 2024

I am convinced 🙂 This guy will cause trouble though:

pub(super) const IGNORE: &str = "_SRGN_IGNORE";

It is interpolated via formatcp! in places. I am fine with hard-coding the string in .scm files though. Would not be DRY as it is now (_SRGN_IGNORE is defined in a single place only) but easy to find and adjust.

Yea there's no way around the use of formatcp! I'm afraid.
It's not necessary to move all the prepared strings out to *.scm files, I'm trying to avoid creating queries in the docs when we already have a lot of prepared queries, they're just bound up in the source code.

It may be worth it already though as, I think we could then modify the #[rstest]s for each [lang]::PreparedQuery to read in the entire data/queries/[lang]/* dir instead of enumerating each [lang]::PreparedQuery varinant:

#[rstest]
#[case(
    "base.py_comments",
    include_str!("python/base.py"),
    python::CompiledQuery::new(&python::PreparedQuery::Comments.into()),
)]

I don't have a lot of experience with rstest, but I think that's possible?

Then the existing prepared query tests could be rephrased to test the given query as a literal string and read the query from the file. That might be OTT though, as we don't need to test every single prepared query twice.

So maybe we just modify the existing prepared query tests to only test the queries contained in the files.
Then we know each query is valid and that the file loading though [lang]-query is working at least in the case where there is a file present.

@bheylin
Copy link
Contributor Author

bheylin commented Oct 13, 2024

And having the queries as files would allow the --[lang]-query sections in the README.md to be generated.

      --rust <RUST>
          Scope Rust code using a prepared query.
          
          [env: RUST=]
          [aliases: rs]

          Possible values:
          - comments:         Comments (line and block styles; excluding doc comments;
            comment chars incl.)
          - doc-comments:     Doc comments (comment chars included)
          ...

@bheylin
Copy link
Contributor Author

bheylin commented Oct 13, 2024

@alexpovel We could move the prepared (--rust) and file based (--rust-query) tests into the cli tests mod.

It's moving to a more black-box test, but you can always choose do perform the tests on the lib API and the CLI.

See: e24d015

@bheylin
Copy link
Contributor Author

bheylin commented Oct 13, 2024

@alexpovel it's probably a good idea to keep testing the entire set of prepared queries in the lib itself so that the lib's API is tested. The CLI mod can just test that a single prepared, inline and file based query works.

@alexpovel
Copy link
Owner

A couple thoughts:

  • things like copy-pasting new rstest cases such as

    srgn/tests/langs/mod.rs

    Lines 72 to 77 in 3b8abfa

    #[rstest]
    #[case(
    "base.py_comments",
    include_str!("python/base.py"),
    Python::new(CodeQuery::Prepared(PreparedPythonQuery::Comments)),
    )]
    I am not worried about. That takes seconds to minutes for new features or languages. I understand your pain right now but it's not a common thing!

  • similarly for e.g. generating the srgn --help output in the README. I've been adjusting that manually, which is also quick to do.

  • as for exporting queries to scm files, I'm leaning towards not doing that. _SRGN_IGNORE is one example of where it starts to become a problem. The bigger issue I have is looking into the future, I want to remain dynamic, e.g. with string interpolation as it is now (formatcp), but also perhaps interpolation at runtime. I want to keep that option open. It would allow for a whole new level of functionality srgn can offer over base tree-sitter (queries). One example is pre- and post-processing for finer control.

    Long story short, "outsourced" scm files seem to stand in the way of that more than what they're helping

  • testing prepared queries in the Rust library is a good point! That we can look into. On the other hand, I prefer true-as-possible end-to-end tests wherever possible. At the end of the day, srgn is a CLI first and we should guarantee correctness there as much as possible. The current test suite has very high coverage of all sorts of scenarios (snapshot tests make it ergonomic), and importantly is tested on all three platforms in CI. Moving more tests into just the library will take away from that coverage/thoroughness and decrease confidence (I've been quite happy and confident that when CI is green, srgn is good to go)

@bheylin
Copy link
Contributor Author

bheylin commented Oct 14, 2024

A couple thoughts:

* things like copy-pasting new `rstest` cases such as https://github.com/alexpovel/srgn/blob/3b8abfa38fe7ce94b3cdd446d76e85ecc4c2cf86/tests/langs/mod.rs#L72-L77
   I am not worried about. That takes seconds to minutes for new features or languages. I understand your pain right now but it's not a common thing!

Oh I wasn't worried about the effort it takes to add a new test, more that we could reduce the boilerplate.
But I take your point.

* similarly for e.g. generating the `srgn --help` output in the README. I've been adjusting that manually, which is also quick to do.

All good, it's not a big issue.

* as for exporting queries to `scm` files, I'm leaning towards not doing that. `_SRGN_IGNORE` is one example of where it starts to become a problem. The bigger issue I have is looking into the future, I want to remain dynamic, e.g. with string interpolation as it is now (`formatcp`), but also perhaps interpolation at runtime. I want to keep that option open. It would allow for a whole new level of functionality `srgn` can offer over base tree-sitter (queries). One example is pre- and post-processing for finer control.
  Long story short, "outsourced" `scm` files seem to stand in the way of that more than what they're helping

I'll roll back the last two commits and pursue another solution.

* testing prepared queries in the Rust library is a good point! That we can look into. On the other hand, I prefer true-as-possible end-to-end tests wherever possible. At the end of the day, `srgn` is a CLI first and we should guarantee correctness there as much as possible. The current test suite has very high coverage of all sorts of scenarios (snapshot tests make it ergonomic), and importantly is tested on all three platforms in CI. Moving more tests into just the library will take away from that coverage/thoroughness and decrease confidence (I've been quite happy and confident that when CI is green, `srgn` is good to go)

It's good to know your attitude here, that the CLI tests are the most important.

@bheylin
Copy link
Contributor Author

bheylin commented Oct 14, 2024

Oh and I'll squash all commits! If you feel like it you can do that yourself as well, otherwise I'll squash and come up with a suitable commit message.

Ideally and alternatively, instead of squashing into 1 commit, the languages refactor is one chore commit, and the addition of parsing queries from files is a second feat one, but it's fine if you don't feel like doing that work.

@alexpovel I think the handiest is for me to make a new PR with the major renaming etc and rebase that onto this PR.
Then the history will be clean.

The working week has begun here so I'll try make some time for it during the week some evening.

@bheylin
Copy link
Contributor Author

bheylin commented Oct 29, 2024

@alexpovel After merging #151 this is the resulting change needed to be able to provide a query as a file to --[lang]-query.

I'll add a test as discussed before and address comment #144 (comment).

@bheylin
Copy link
Contributor Author

bheylin commented Oct 29, 2024

@alexpovel I've added a test to the README.md. I had the quote the file arg as the readme.rs parser does not support files with a leading dir (srgn --python-query docs/python_cond_query.scm). It took me a while to figure out what was going on there.

I'll update the docs for the CLI args now and call it a day.

How do you recommend testing the case where the file path doesn't exist etc?
srgn --python-query docs/nonexistent-file.scm.

Copy link
Owner

@alexpovel alexpovel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be fine not testing the "non-existent file" scenario, as it's kinda "obviously correct" (bubbles the error up, as many other I/O errors do that can occur in main.rs, most of which aren't tested).

The error message for a non-existent file is a bit confusing of course:

Error: Error occurred while creating a tree-sitter query: Query error at 1:1. Invalid syntax:
docs/python_cond_query-file-does-not-exist.scm
^

but...

  • if you're already passing a file name, you must have read the docs which mention that this is even possible. So you know what you're doing at will quickly be able to tell that the file doesn't exist

  • I don't know how to report this better. On "file not found" error, we could check "does this LOOK like a file path", but I think that's fraught with peril... maybe a very simple, best-effort regex roughly like /?([a-zA-Z0-9]+/)+, and then if that matches warn!() the user.

    Alternatively, is it possible the queries ALWAYS have to start with (? If yes, that could also be a give-away.

src/main.rs Outdated Show resolved Hide resolved
src/main.rs Outdated Show resolved Hide resolved
src/main.rs Outdated Show resolved Hide resolved
src/main.rs Show resolved Hide resolved
src/main.rs Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
src/main.rs Show resolved Hide resolved
src/main.rs Outdated Show resolved Hide resolved
@bheylin
Copy link
Contributor Author

bheylin commented Oct 31, 2024

* I don't know how to report this better. On "file not found" error, we could check "does this LOOK like a file path", but I think that's fraught with peril... maybe a very simple, best-effort regex roughly like `/?([a-zA-Z0-9]+/)+`, and then if that matches `warn!()` the user.
  Alternatively, is it possible the queries ALWAYS have to start with `(`? If yes, that could also be a give-away.

Yea I would avoid trying to guess if the string "looks" like a file.
I like the suggestion of having a leading symbol. That leading symbol actually ties in with
#158. I was going to suggest replacing --query and --prepared with a single --query arg
that uses a prefix symbol when using a prepared query. Something like:

# run literal query, 
# The first byte being a `(` signals that the content is intended as a literal query. 
srgn -l rust --query '((macro_invocation macro: (identifier) @_SRGN_IGNORE_name) @macro)'
# run `test-mod` prepared query
# The first byte being a `#` (or whatever we chose) signals that the content is intended as a prepared query.
# This change is clearly for another PR if it's agreed upon. 
srgn -l rust --query '#test-mod'
# run contents of `./query/file.scm`
# If the string doesn't start with a `(` or a `#` then we presume it's a file path
srgn -l rust --query ./query/file.scm`

Copy link
Owner

@alexpovel alexpovel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI still has minor problems, but happy to merge otherwise.

I might look into improving the error message/query-file-path-disambiguation more in the future.

README.md Outdated Show resolved Hide resolved
@bheylin
Copy link
Contributor Author

bheylin commented Nov 1, 2024

CI still has minor problems, but happy to merge otherwise.

I might look into improving the error message/query-file-path-disambiguation more in the future.

All good, I think this PR is a nice contained change. The disambiguation issue is definitely for another PR.

@bheylin bheylin marked this pull request as ready for review November 1, 2024 10:47
bheylin and others added 11 commits November 1, 2024 22:16
Co-authored-by: Alex Povel <git@alexpovel.de>
Co-authored-by: Alex Povel <git@alexpovel.de>
Co-authored-by: Alex Povel <git@alexpovel.de>
Co-authored-by: Alex Povel <git@alexpovel.de>
Co-authored-by: Alex Povel <git@alexpovel.de>
@alexpovel
Copy link
Owner

I asked about what tree-sitter queries can potentially look like at: tree-sitter/tree-sitter#3862 (comment) . The reply makes me think disambiguation isn't going to be straightforward. Even if we get it correct now, updates to tree-sitter might break us.

Currently, I'm leaning more towards a dedicated flag, like --<LANG>-query-file, again. That would get use niceties from clap as well, as it can natively handle PathBuf. We'd be more verbose but 100% safe going forward, leaving the normal tree-sitter --<LANG>-query untouched.

Apologies for beating around the bush here, and changing my mind. I think I originally suggested overloading the existing option.

What do you think about merging this PR as a chore, then I can base the feat off it? I'm happy to take that work, having changed my mind on you and all that (and still not being sure what to do 😄 )

@bheylin
Copy link
Contributor Author

bheylin commented Nov 5, 2024

I asked about what tree-sitter queries can potentially look like at: tree-sitter/tree-sitter#3862 (comment) . The reply makes me think disambiguation isn't going to be straightforward. Even if we get it correct now, updates to tree-sitter might break us.

Currently, I'm leaning more towards a dedicated flag, like --<LANG>-query-file, again. That would get use niceties from clap as well, as it can natively handle PathBuf. We'd be more verbose but 100% safe going forward, leaving the normal tree-sitter --<LANG>-query untouched.

Yea I agree, it's best to be as explicit as possible.

Apologies for beating around the bush here, and changing my mind. I think I originally suggested overloading the existing option.

No worries, that's how exploring an implementation works.

What do you think about merging this PR as a chore, then I can base the feat off it? I'm happy to take that work, having changed my mind on you and all that (and still not being sure what to do 😄 )

I can also make the change, it's no problem at all.

@alexpovel alexpovel merged commit fc49f41 into alexpovel:main Nov 5, 2024
15 checks passed
@alexpovel
Copy link
Owner

Alright! Thanks for your work again and your patience.

Your change is merged now. The first of us to get to it can now build on top of that to implement something like --<lang>-query-file (just an example). I want to focus my time on bug fixes and other features for the immediate time being though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Read query from file
3 participants