Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Supporting custom facts in oxc-resolver to make it work without needing fs #236

Open
tarun-dugar opened this issue Aug 26, 2024 · 13 comments

Comments

@tarun-dugar
Copy link

Hello!

Context

We (my team) are working on a huge yarn workspace repo with hundreds of thousands of files. We have a use case of resolving imports for all these files tried using oxc-resolver for it and it worked reliably apart from the fact that it took a lot of time (around ~20s) which is expected for such a big repo.

We profiled the hotspots and found that reading files using fs takes the most amount of time. One way to get around this is to use the virtual filesystem provided by oxc-resolver. Correct me if I am wrong, but that would require us storing the string representations of all the files in memory which is not feasible for us considering the large amount of files we have.

Proposed alternative

We have a custom dependency graph solution which stores facts about JS/TS files and package.jsons. For example:

Map {
   "@some-workspace": {
      "path": "some-path/package.json",
      "exports": ...,
      "main": ...,
      ...
   }
}

All of these facts are available before the resolution process. Using this information, we can bypass fs in oxc-resolver because all the facts around the files is already there and we just need to plug it in in the correct places. We will be happy to contribute this feature but would need your help and guidance around reviews / approvals. I can raise a quick sample PR as well to demo how it could work. It will be something along the lines of:

Resolver.resolve(path, specifier, Option<facts>) 

and then use facts in places where we are reading the fs to get facts about files.

@Boshen Boshen self-assigned this Aug 26, 2024
@Boshen
Copy link
Member

Boshen commented Aug 26, 2024

Can you try a local build with your proposed solution and report back the performance improvement?

btw how much performance improvement did you observe when you switch from your existing resolver?

@tarun-dugar
Copy link
Author

@Boshen we saw a performance regression because we had a custom resolution logic that avoided fs but it was hacky because it didn't account for all the cases supported by Node's resolution algorithm which is why we wanted to move to a production grade resolver.

@Boshen
Copy link
Member

Boshen commented Aug 26, 2024

enhanced-resolve has a few options for the cache, are you able to improve things if oxc-resolver expose some APIs for the cache as well?

@tarun-dugar
Copy link
Author

@Boshen I am assuming you mean caching requests based on the path and specifier? That would help but it will still be much less performant than supporting custom facts imo. I can do a benchmark comparison between both the approaches if the caching API is exposed.

Anyways, I'll raise a sample PR with support for custom facts this week for you folks to check. I'll share the performance numbers there as well.

@Boshen
Copy link
Member

Boshen commented Aug 26, 2024

@Boshen I am assuming you mean caching requests based on the path and specifier? That would help but it will still be much less performant than supporting custom facts imo. I can do a benchmark comparison between both the approaches if the caching API is exposed.

Anyways, I'll raise a sample PR with support for custom facts this week for you folks to check. I'll share the performance numbers there as well.

Thank you.

@tarun-dugar
Copy link
Author

@Boshen I was unable to create a branch here so forked the repo and created a branch there: #246

Using facts, we are able to process 430k .resolve calls in ~2s. Without using facts, it takes around ~8s. This is for a production build. Thanks for the support so far! Let me know your thoughts.

@Boshen
Copy link
Member

Boshen commented Sep 10, 2024

@Boshen I was unable to create a branch here so forked the repo and created a branch there: #246

Using facts, we are able to process 430k .resolve calls in ~2s. Without using facts, it takes around ~8s. This is for a production build. Thanks for the support so far! Let me know your thoughts.

I'll take a look tomorrow.


Quick thought:

This is actually a really interesting approach for enterprise / huge monorepo support. I'm also interested in whether we can make the "produce pre computed file facts" part reproducible, either built-in or through other tools.

I'll ask around.

@tomgasson
Copy link

tomgasson commented Sep 10, 2024

@Boshen, for reference this is an implementation of the same Facts pattern used in Hacklang and used (more complexly) in turborepo and is behind how Haste operates at Meta

The overall structure is as follows:

  • You get a (Path, SHA) index of the repo from an existing updated-on-write source (git, watchman)
  • You parse each file in raw form (e.g. imports stay as "../../foo.ts" and are explicitly explicitly not resolved) to get just the important bits of info
  • You re-use the previous cache for unchanged (cached by Path|Sha keys), and parse any newly seen files
  • You write out the cache

Depending what info you extract out, this leaves us with ~50MB of serialisable state which can answer all the cross-file questions we have on our large repo.

Locally, as devs are rotating this cache (cache -> update -> cache). We also have our CI output this (just a dumb zstd S3 store of it, which packs to ~15MB), which allows our devs to start from their mergebase-with-master's cache when they switch branches or take large jumps across many commits (as described in watchman's SCMQuery docs)
The same state can be updated in a delta form in a watch mode / from a daemon, as we find that just reading the index is the slowest part.

For some numbers, with oxc_parser we can do this in 8 seconds (without a cache), 2 seconds (with a cache) and 200ms for a delta update.

Downstream steps (resolution, linting etc) can read that cache and do the work they need, already knowing the most importing info for them, and hence why we're now limited by resolution (which we do at run-time for each downstream step, due to it not being cachable)

We use (/ plan to use) this for:

  • Understanding the module graph
  • quickly identifying code cycles
  • extracting feature flags used in files
  • extracting translatable messages
  • extracting atomic CSS styles and building a single stylesheet
  • assertions on uniqueness of certain things (event names, error message keys etc)

@Boshen
Copy link
Member

Boshen commented Sep 11, 2024

@tomgasson Can you check your twitter DM, I'd like to know more about the things you guys are building.

@Boshen
Copy link
Member

Boshen commented Sep 19, 2024

This has not been forgotten, I'll come back to work on the resolver once I finish my other work, probably next week.

@Boshen
Copy link
Member

Boshen commented Sep 25, 2024

I spent some time studying the implementation.

To me, it is mounting another cache on top of the existing one with a few interceptors in the resolution algorithm.

There are two directions we can go:

  1. Expose the cache and add a few callbacks in the resolver.
  2. Add first class support with a proper specification.

I'm leaning towards 1 unless there are other companies who wants to collaborate and standardize this solution.

Please feel free to chat in the oxc's discord channel.

@tarun-dugar
Copy link
Author

hey @Boshen thanks a lot for this ❤️ I'll reach out to you in the discord channel when I get the chance.

@Boshen Boshen removed their assignment Dec 12, 2024
@Boshen
Copy link
Member

Boshen commented Dec 14, 2024

Any updates?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants