Added a rate limiter for load reduction on the website #52

Bash- · 2018-12-13T14:25:28Z

We found that websites found the scraper too resource consuming. Therefore I added this configurable rate limiter, to be able to decrease the number of requests per time period.

c4software · 2018-12-13T15:57:54Z

Hi,

It's a nice addition, however ratelimit seems not present in the default library. It's an additional library (https://pypi.org/project/ratelimit/) ?

Bash- · 2018-12-13T16:00:39Z

Hi,

It's a nice addition, however ratelimit seems not present in the default library. It's an additional library (https://pypi.org/project/ratelimit/) ?

Thank you. Yes that is indeed the library which is needed

Garrett-R · 2020-06-27T13:50:10Z

config.py

@@ -9,3 +9,6 @@
 xml_footer = "</urlset>"

 crawler_user_agent = 'Sitemap crawler'
+
+number_calls = 1  # number of requests per call period
+call_period = 15  # time in seconds per number of requests


The default should be no rate limiting, right?

No limit would be in line with how the original code works, yes. I could not find a default option in the ratelimit package, you could set the default to a very high number as a workaround though

One possibility would be having something like this in crawler.py:

if number_calls is None or call_period is None: rate_limit_decorator = lambda func: func else: rate_limit_decorator = limits(calls=config.number_calls, period=config.call_period)

Haven't dealt with @sleep_and_retry, but I believe should be possible to combine that into rate_limit_decorator. Of course, this strategy comes at the cost of increasing the complexity of the code.

Garrett-R · 2020-06-27T13:53:13Z

Coincidentally, I'm adding a requirements.txt file in this PR. If the decision is made that that is desirable and the PR is merged, then you could add a line to that file:

ratelimit==2.2.1

Without this, I reckon this PR can't be merged, otherwise, it won't work for people who don't happen to have ratelimite pre-installed.

Bash- · 2020-06-27T14:38:35Z

@Garrett-R I think it still would be a nice addition
@c4software Shall we include the ratelimiter?

c4software · 2020-06-29T14:20:48Z

Hi,

Ratelimiting is a good addition, but i'm not a big fan of a tierce library. Not because its a tierce library, but I really like the idea of a « bare metal » tool.

What do you think @Garrett-R @Bash- having to rely to a tierce library is not a problem for you ?

Garrett-R · 2020-07-03T15:18:08Z

Yeah, it's an interesting point and wasn't sure what you'd think of introducing a requirements.txt file. There's definitely some benefit to having no dependencies. Going from 0 to 1 dependencies is a big difference (while going from 1 to 2 is not). I think it's really up to you and your vision for this already successful project.

A couple factors I'd be weighing if I were you:

Watch out for "boiling the frog". Given it currently has 0 dependencies, it might seem adding the first one is not worth it for that particular change, but if that decision is made multiple times, it may be that all the changes in aggregate are worth adding dependencies.
A bit harder to build the repo from source and use / contribute (favors not adding dependencies)
How many more dependencies might we benefit from? If these are kind of the last 2, then we have an idea of total benefits of adding dependencies. On the other hand, if we think there could be more in the future, it'll make those contributions easier. I don't have a good sense here
How hard is it to do without the dependencies (definitely seems doable to make both this change and my change without BTW)

On the point about how tough the package is to use for folks, I actually think we should put this on PYPI (made issue here). In that case folks would just have to do:

pip install python-sitemap

and the extra dependencies will automatically be handled.

I'd probably cast my vote for having dependencies, but I can definitely see benefits to both approaches, so I support whatever you think is best.

Garrett-R · 2024-04-09T22:11:40Z

You know, in light of the xz backdoor, I have a new appreciation for avoiding dependencies...

Maybe close this one?

Bash- and others added 3 commits December 13, 2018 15:21

Added a rate limiter for load reduction on the website

946412a

Update README.md

9c500b4

Update README.md

050b92e

c4software self-requested a review December 13, 2018 15:58

Garrett-R reviewed Jun 27, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added a rate limiter for load reduction on the website #52

Added a rate limiter for load reduction on the website #52

Bash- commented Dec 13, 2018

c4software commented Dec 13, 2018

Bash- commented Dec 13, 2018

Garrett-R Jun 27, 2020

Bash- Jun 27, 2020

Garrett-R Jun 29, 2020

Garrett-R commented Jun 27, 2020

Bash- commented Jun 27, 2020

c4software commented Jun 29, 2020

Garrett-R commented Jul 3, 2020

Garrett-R commented Apr 9, 2024

Added a rate limiter for load reduction on the website #52

Are you sure you want to change the base?

Added a rate limiter for load reduction on the website #52

Conversation

Bash- commented Dec 13, 2018

c4software commented Dec 13, 2018

Bash- commented Dec 13, 2018

Garrett-R Jun 27, 2020

Choose a reason for hiding this comment

Bash- Jun 27, 2020

Choose a reason for hiding this comment

Garrett-R Jun 29, 2020

Choose a reason for hiding this comment

Garrett-R commented Jun 27, 2020

Bash- commented Jun 27, 2020

c4software commented Jun 29, 2020

Garrett-R commented Jul 3, 2020

Garrett-R commented Apr 9, 2024