Consider an option to generate WACZ files after a crawl is done for better replay with ReplayWeb.page #179

ikreymer · 2021-02-21T20:04:10Z

It would be helpful for folks using grab-site and then replaying via replayweb.page to have grab-site generate a WACZ file after the crawl is done. (This workflow is mentioned in webrecorder/replayweb.page#6)

WACZ (https://github.com/webrecorder/wacz-format) provides a way to package the WARC, CDX and an optional page list into a single file (a zip file) such that it can be loaded quickly for replay.

The Python wacz library (https://pypi.org/project/wacz) can be used to create the WACZ package (https://github.com/webrecorder/wacz-format/tree/main/py-wacz)

I think should just be able to call the create command from:
https://github.com/webrecorder/wacz-format/blob/main/py-wacz/wacz/main.py#L19

It might make sense to pass in a page list, and there is an experimental option to do full-text extraction on pages as well.

The library is still new, so can definitely make any changes needed to support integration!

The text was updated successfully, but these errors were encountered:

ivan · 2021-02-23T03:57:22Z

grab-site currently doesn't really have anyone developing it (I just try to keep the install steps working), but I have no objections to the addition of WACZ support.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider an option to generate WACZ files after a crawl is done for better replay with ReplayWeb.page #179

Consider an option to generate WACZ files after a crawl is done for better replay with ReplayWeb.page #179

ikreymer commented Feb 21, 2021 •

edited

Loading

ivan commented Feb 23, 2021

Consider an option to generate WACZ files after a crawl is done for better replay with ReplayWeb.page #179

Consider an option to generate WACZ files after a crawl is done for better replay with ReplayWeb.page #179

Comments

ikreymer commented Feb 21, 2021 • edited Loading

ivan commented Feb 23, 2021

ikreymer commented Feb 21, 2021 •

edited

Loading