You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It would be helpful for folks using grab-site and then replaying via replayweb.page to have grab-site generate a WACZ file after the crawl is done. (This workflow is mentioned in webrecorder/replayweb.page#6)
WACZ (https://github.com/webrecorder/wacz-format) provides a way to package the WARC, CDX and an optional page list into a single file (a zip file) such that it can be loaded quickly for replay.
grab-site currently doesn't really have anyone developing it (I just try to keep the install steps working), but I have no objections to the addition of WACZ support.
It would be helpful for folks using grab-site and then replaying via replayweb.page to have grab-site generate a WACZ file after the crawl is done. (This workflow is mentioned in webrecorder/replayweb.page#6)
WACZ (https://github.com/webrecorder/wacz-format) provides a way to package the WARC, CDX and an optional page list into a single file (a zip file) such that it can be loaded quickly for replay.
The Python
wacz
library (https://pypi.org/project/wacz) can be used to create the WACZ package (https://github.com/webrecorder/wacz-format/tree/main/py-wacz)I think should just be able to call the create command from:
https://github.com/webrecorder/wacz-format/blob/main/py-wacz/wacz/main.py#L19
It might make sense to pass in a page list, and there is an experimental option to do full-text extraction on pages as well.
The library is still new, so can definitely make any changes needed to support integration!
The text was updated successfully, but these errors were encountered: