-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
integrity integration #10
base: master
Are you sure you want to change the base?
Conversation
Few remarks:
While ia-hadoop-tools is a public repository, crawl-tools isn't. Please, keep in mind that it may be annoying for anybody reading about this issue if they cannot read the information given in the linked issue. So, all information related to this issue should be shared here or in other public repositories. And, of course, it's possible to link a public repository from a private one, for example to discuss the integration of the new job into internal tools and workflows. |
…ile yet, local env not available today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jt55401, thanks! Just a few comments, didn't try to run it.
|
||
if(path.endsWith(".gz")) { | ||
watOutputBasename = inputBasename.substring(0,inputBasename.length()-3) + ".wat.gz"; | ||
wetOutputBasename = inputBasename.substring(0,inputBasename.length()-3) + ".wet.gz"; | ||
cdxWatOutputBasename = inputBasename.substring(0,inputBasename.length()-3) + ".cdxwat.gz"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This results in:
name.warc.gz name.cdx.gz
name.warc.wat.gz name.warc.wat.cdxwat.gz
name.warc.wet.gz name.warc.wet.cdxwet.gz
- "wat" is given twice
- ".warc" is removed for CDX files derived from WARC files
- should be the same for WAT/WET files
- a CDX file does not follow the WARC format
- (a WAT or WET file does)
Maybe the following looks better?
name.warc.gz name.cdx.gz
name.warc.wat.gz name.wat.cdx.gz
name.warc.wet.gz name.wet.cdx.gz
} else { | ||
watOutputBasename = inputBasename + ".wat.gz"; | ||
wetOutputBasename = inputBasename + ".wet.gz"; | ||
cdxWatOutputBasename = inputBasename + ".cdxwat.gz"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above.
} | ||
|
||
String watOutputFileString = basePath.toString() + "/wat/" + watOutputBasename; | ||
String wetOutputFileString = basePath.toString() + "/wet/" + wetOutputBasename; | ||
String cdxWetOutputFileString = basePath.toString() + "/cdxwet/" + cdxWetOutputBasename; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a fixed output path. Do we want to have the CDX files for WAT/WET there?
The configuration for the CDX indexing uses different output paths, cf. https://github.com/commoncrawl/webarchive-indexing/blob/main/run_index_hadoop.sh
@sebastian-nagel - I've made a few commits and this code should be more to your liking.
I compiled and tested this with a pretty vanilla Java 11 environment, and everything seemed to work fine. The only issue I ran into is that the internetarchvie maven repo has numerous http (as opposed to https) dependencies, which Maven doesn't like by default. I overrode this behavior in my local maven settings, and everything worked fine. in ~/.m2/settings.xml:
|
Hi @jt55401, I've run a test with on a Hadoop single-node cluster. The job run by
finished with status success. However, the generated CDX files index the WARC file but not the WAT resp. WET file:
This needs to be fixed. I'd start to try implementing this in ia-web-commons by extending the classes org.archive.extract.WATExtractorOutput (resp. WETExtractorOutput) to, say, WatCdxExtractorOutput. I'm not 100% sure whether this approach works, needs a try. I've observed three more points which can be ignored for now:
|
OK, I will take a look @sebastian-nagel , thank you. Did you just run single node hadoop with config in our nutch project? (and feed it some small seed list of 1 site or something?) or did you do something more to test this? (I will try to get some time set aside to set this up for myself as well) |
A plain single-node setup with minimal configuration, see nutch-test-single-node-cluster but without Nutch installed. For testing I took one WARC file from April 2024 and copied it from local disk to HDFS via:
See above for the command to launch the job. Output is then in hdfs:/user/$USER/{cdx,wat,wet}/ |
|
The intention of this PR is to enable integrity file output.
This will involve:
I've started by laying out TODO's in the places where I think we will need to make changes.
This will need to coordinate with a PR in crawl-tools: https://github.com/commoncrawl/crawl-tools/pull/37