-
Notifications
You must be signed in to change notification settings - Fork 906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve SEO and maintenance of documentation versions #3741
Improve SEO and maintenance of documentation versions #3741
Comments
And on the topic of the flyout, here's my thinking: I think I have become numb to the whole Now, look at what happens when I change the default version to be
By having the number in the URL and also in the flyout by default, I think it's more obvious how the user should go and switch to their version of choice. @stichbury in your opinion, do you think this would make our docs journey more palatable? |
I think this is good, but doesn't it mean that you have to remember to increment the version number for |
It does... but sadly RTD doesn't allow lots of customization about the versioning rules for now. It's a small price to pay though, would happen only a handful of times per year. |
TIL: |
To note, RTD has automation rules https://docs.readthedocs.io/en/stable/automation-rules.html#actions-for-versions although the |
I think the Here's the 📣 proposal
The only thing we need to understand is what would be the impact on indexing and SEO cc @noklam @ankatiyar Thoughts @stichbury ? |
I've somewhat lost track of what your I would personally consider if it's sufficient to just keep |
In principle this is related to our indexing strategy, Let's chat next week |
Renamed this issue to better reflect what should we do here. In readthedocs/readthedocs.org#10648 (comment), RTD staff gave an option to inject meta It's clear that we have to shift our strategy by:
|
Today I had to manually index https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-3.0.0 on Google (maybe there are no inbound links?) and I couldn't index 3.0.1 (it's currently blocked by our |
Summary of things to do here:
Refs: https://www.stevenhicks.me/blog/2023/11/how-to-deindex-your-docs-from-google/, https://developers.google.com/search/docs/crawling-indexing/consolidate-duplicate-urls |
Today I've been researching about this again (yeah, I have weird hobbies...) I noticed that projects hosted on https://docs.rs don't seem to exhibit these SEO problems, and also that they seemingly take a basic, but effective, approach. Compare https://docs.rs/clap/latest/clap/ with https://docs.rs/clap/2.34.0/clap/. There is no trace of What they do though is having very lean sitemaps. If you look at https://docs.rs/-/sitemap/c/sitemap.xml, there's only 2 entries for <url>
<loc>https://docs.rs/clap/latest/clap/</loc>
<lastmod>2024-08-10T00:24:50.344647+00:00</lastmod>
<priority>1.0</priority>
</url>
<url>
<loc>https://docs.rs/clap/latest/clap/all.html</loc>
<lastmod>2024-08-10T00:24:50.344647+00:00</lastmod>
<priority>0.8</priority>
</url> Compare it with https://docs.kedro.org/sitemap.xml, which is, in comparison... less than ideal: <url>
<loc>https://docs.kedro.org/en/stable/</loc>
<lastmod>2024-08-01T18:53:11.571849+00:00</lastmod>
<changefreq>weekly</changefreq>
<priority>1</priority>
</url>
<url>
<loc>https://docs.kedro.org/en/latest/</loc>
<lastmod>2024-08-09T09:39:27.628501+00:00</lastmod>
<changefreq>daily</changefreq>
<priority>0.9</priority>
</url>
<url>
<loc>https://docs.kedro.org/en/0.19.7/</loc>
<lastmod>2024-08-01T18:53:11.647322+00:00</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>https://docs.kedro.org/en/0.19.6/</loc>
<lastmod>2024-05-27T16:32:42.584307+00:00</lastmod>
<changefreq>monthly</changefreq>
<priority>0.7</priority>
</url>
<url>
<loc>https://docs.kedro.org/en/0.19.5/</loc>
<lastmod>2024-04-22T11:56:55.928132+00:00</lastmod>
<changefreq>monthly</changefreq>
<priority>0.6</priority>
</url>
<url>
<loc>https://docs.kedro.org/en/0.19.4.post1/</loc>
<lastmod>2024-05-17T12:25:27.050615+00:00</lastmod>
<changefreq>monthly</changefreq>
<priority>0.5</priority>
</url>
... The way I read this is that RTD is treating tags as long-lived branches, and as a result telling search engines that docs of old versions will be updated monthly, which in our current scheme is incorrect. I am not sure if this is something worth reporting to RTD, but maybe we should look at uploading a custom |
Reopening until we solve the issue (whether improving the sitemaps, retroactively changing the tags, or painting a pentagon with a turkey's head...) |
Added @DimedS to our Google Search Console, hope this will help! |
Thank you for bringing this new idea and providing me access to GSC, @astrojuanlu! What I understand after investigation:
This file is large and somewhat inconvenient. More importantly, it does not disallow many older versions, such as everything related to
This configuration means we will disallow indexing anything except the stable version of Kedro and Viz docs, and the latest version of Datasets docs (since we do not have a stable version of them). If for some reason we are unhappy with the latest Datasets approach, we can start with this and create an additional ticket to explore alternative solutions for Datasets docs versioning, such as canonical tags. |
Thanks for the investigation @DimedS. One very important thing to have in mind is this:
https://support.google.com/webmasters/answer/7440203#indexed_though_blocked_by_robots_txt
https://developers.google.com/search/docs/crawling-indexing/block-indexing Google is very clear: blocking a page in With this in mind, addressing your comments:
Can we instead try generating a <url>
<loc>https://docs.kedro.org/en/stable/</loc>
<lastmod>2024-08-01T18:53:11.571849+00:00</lastmod>
<changefreq>monthly</changefreq>
<priority>1.0</priority>
</url>
<url>
<loc>https://docs.kedro.org/en/latest/</loc>
<lastmod>2024-08-09T09:39:27.628501+00:00</lastmod>
<changefreq>daily</changefreq>
<priority>0.5</priority>
</url> and nothing else? |
And notice that generating the sitemap is not even guaranteed to work, but the next step is quite complicated: to retrofit |
And also I'm guessing that generating the |
Opened readthedocs/readthedocs.org#11584 upstream. |
Thank you for your valuable insights about I agree that Here are my thoughts:
Therefore, I support your proposal to give |
I fully agree 👍🏼 Let's at least try to apply the scientific method, and change 1 bit at a time. If a custom |
xref in case it's useful https://github.com/jdillard/sphinx-sitemap |
Would it make sense to manually create the sitemap first and see if it works as expected? If successful, we could then consider incorporating an automated generation process in the next step, if needed. |
For reference, I tried the redirection trick described in #3741 (comment) for kedro-datasets #4145 (comment) and seems to be working. I don't want to boil the ocean right now because we're in the middle of some delicate SEO experimentation phase, but when the dust settles, I will propose this for all our projects. |
The sitemap hasn't changed 😬 https://docs.kedro.org/sitemap.xml |
Newsflash: RTD now excludes hidden versions from the automatically generated sitemap readthedocs/readthedocs.org#11675 |
After a discussion with @astrojuanlu and an unsuccessful attempt to apply a custom sitemap.xml to the Kedro documentation in issue #4261, we changed all Kedro documentation versions, except for "stable" and "latest," to hidden in the Read the Docs (RTD) web dashboard. This immediately updated our However, there is still an issue with subfolders, "viz" and "datasets." Hiding versions for these subfolders does not affect |
I received an answer from the RTD team:
If I understand correctly, this means that to implement a manual I think we should give this a try. What do you think, @astrojuanlu? After we hid all versions of the main Kedro project, the search results improved for Kedro, but for datasets and Viz, it still seems to be referencing old versions. For example, if I search "kedro matplotlib dataset" on Google, I see everything except the correct link:
|
From my understanding, removing all old version from the sitemap didn't hide them from the search results: In fact, none of these URLs are referenced in any of our current sitemaps. Not even
Long story short, the hypothesis I proposed in #3741 (comment) has been disproven. Just limiting the Now, if we use The method suggested by Google has 2 flavors:
|
In #2980 we discussed about the fact that too many Kedro versions appear in search results.
We fixed that in #3030 by manually controlling what versions did we want to be indexed.
This caused a number of issues though, most importantly #3710: we had been accidentally excluded our subprojects from our search results.
We fixed that in #3729 in a somewhat unsatisfactory fashion. In particular, there are concerns about consistency and maintainability #3729 (comment) (see also #2600 (comment) about the problem of projects under
kedro-org/kedro-plugins
not having astable
version).In addition, my mind has evolved a bit and I think we should only index 1 version in search engines:
stable
. There were concerns about users not understanding the flyout menu #2980 (comment) and honestly thelatest
part is also quite confusing (#2823, readthedocs/readthedocs.org#10674) but that's a whole separate discussion.For now, the problems we want to solve are
robots.txt
, ideally by not having to ever touch it again.The text was updated successfully, but these errors were encountered: