Replies: 3 comments 2 replies
-
As far as I can see the strange HTML class names are originating from Google and Whoogle just inherits them? |
Beta Was this translation helpful? Give feedback.
-
Yes, all classes are inherited from Google directly. They also change fairly frequently, as I've discovered in the past. It'd probably be better to use BeautifulSoup to look for patterns in the HTML rather than classes (i.e. extract all |
Beta Was this translation helpful? Give feedback.
-
That seems to be quite evil by Google...
Thanks for the suggestion. I'll see what I can do. |
Beta Was this translation helpful? Give feedback.
-
Describe the bug
I am using Whoogle to scrape Google search results. The reason I am using Whoogle and not Google directly is because Google regularly asks my scraper to solve a Recaptcha.
My technical setup is Python with
Selenium
andFirefox
as webdriver. The rendered HTML gets parsed withbeautiful soup
.The naming scheme of the HTML classes on Whoogle's search result page makes it very difficult to scrape and parse what I am actually looking for. I would like to scrape the website's title, the description and the URL.
For example the URLs on the search results page are contained in
<div class="kCrYT">
. However there are other elements of the same class that don't contain a link. It would be very helpful for the<a href=""
tag to have it's own class.The website description is contained in
<div class="BNeawe s3v9rd AP7Wnd">
. The links to the previous page and next page are contained in the sameclass="nBDE1b G5eFlf"
. It would be helpful for both page links to have different classes or specific ids.To Reproduce
Steps to reproduce the behavior:
Deployment Method
Download git repo and use
run
executableVersion of Whoogle Search
Whoogle Search v0.3.0
Notes to myself:
Beta Was this translation helpful? Give feedback.
All reactions