Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for sitemap index #65

Merged
merged 2 commits into from
Oct 5, 2020

Conversation

jswilson
Copy link
Contributor

As first mentioned in this issue, sitemaps over 50,000 URLs should be split into multiple sitemap files, with a single master index file pointing to all of the sitemap files. This PR adds support for outputting a sitemap index and multiple sitemap files.

  • Right now, --output is not a required parameter; but outputting an index and multiple sitemap files without writing them to files isn't quite sensible. The index contains pointers to files, so what would be the contents of the index when no output file is specified? Because of this, I'm currently requiring --output when using the new --as-index flag.

  • In order to output an index, you have to include the --as-index flag. If you don't include the --as-index flag, then the sitemap will be written to a single file, even if there are more than 50,000 URLs. My thinking was this would maintain backward compatibility; presumably everyone using the library right now is happy with the output so why change it? Another possibility would be to take this away as an option and just always write an index if there are more than 50,000 URLs, since this would be in line with the specification. If we do go with this, we would likely need to make --output required due to the first bullet described above.

@Garrett-R
Copy link
Contributor

I personally like your choice of requiring including the --as-index flag, but curious to hear @c4software's take on that.

One idea is that if the sitemap ends up being 50K+ links, then at the end of the run, if they ran it without --as-index, you could output a note to the terminal with a link to some educational resource on why they should consider breaking it into multiple sitemaps with --as-index.

crawler.py Outdated
# index: zero-based index from which to start writing url strings contained in
# self.url_strings_to_output
try:
sitemap_file = open(filename, 'w')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonder if it would be good to use a context manager here, like with open(filename, 'w') as sitemap_file:?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, definitely; changed.

@c4software
Copy link
Owner

I'm quite busy right now.

I will also take a look at it this week.

Sorry for the delay.


num_sitemap_files = math.ceil(len(self.url_strings_to_output) / self.MAX_URLS_PER_SITEMAP)
sitemap_filenames = []
for i in range(0, num_sitemap_files):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably doesn't matter, but I notice that in the Google documentation, they 1-index the sitemaps whereas they're 0-indexed there. Curious to hear your thoughts there.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its really matter ?

Copy link
Contributor

@Garrett-R Garrett-R Sep 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm pretty certain it doesn't matter, so either way is fine I reckon.

@c4software
Copy link
Owner

Seems great for me. @Garrett-R Did you also confirm ?

@Garrett-R
Copy link
Contributor

Yup, looks good to me, and I've used this branch already.

@c4software c4software merged commit 1bf1f7f into c4software:master Oct 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants