Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cookies.txt is not properly used during crawls #21

Open
systwi-again opened this issue May 12, 2022 · 5 comments
Open

cookies.txt is not properly used during crawls #21

systwi-again opened this issue May 12, 2022 · 5 comments

Comments

@systwi-again
Copy link

systwi-again commented May 12, 2022

What I wanted/expected: Cookies, read from the provided cookies.txt, to be used during crawls with wpull.

What happened: wpull ignores the provided cookies.txt file and crawls without it.

The command or website causes the problem: --load-cookies=/absolute/path/to/cookies.txt

Operating system: Debian GNU/Linux 11 (x86_64)

Python version: 3.8.13

Wpull version: 3.0.9

Options used with wpull (obtained using grab-site's --which-wpull-args-partial):

-U 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0'
--header 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
--header 'Accept-Language: en-US,en;q=0.5'
--no-check-certificate
--no-robots
--inet4-only
--dns-timeout 20
--connect-timeout 20
--read-timeout 900
--session-timeout 172800
--tries 3
--waitretry 5
--max-redirect 8
--output-file wpull.log
--database wpull.db
--save-cookies cookies.txt
--delete-after
--page-requisites
--concurrent 2
--warc-file example.com-2022-05-12-099f53ca
--warc-max-size 5368709120
--warc-cdx
--strip-session-id
--escaped-fragment
--level inf
--page-requisites-level 5
--span-hosts-allow page-requisites,linked-pages
--debug-manhole
--sitemaps
--load-cookies=/absolute/path/to/cookies.txt
--keep-session-cookies
https://example.com/

Further details and temporary workaround here.

Even giving cookies.txt 777 permissions, wpull still refuses to use the cookies in cookies.txt during crawls.

The filesystem used for everything is ext4, has no I/O errors, has ample free space, passes fsck.ext4, and the absolute path contains no spaces or special characters of any kind (just lowercase a-z).

cookies.txt was exported using version 0.3 of this Firefox extension under Firefox 78.15.0esr on the same OS, and was not modified after exporting.

@TheTechRobo
Copy link

TheTechRobo commented May 12, 2022

I can't test this (busy), sorry.

To potentially narrow down when this bug happens, could you go to http://thetechrobo.ca:1111, verify it says that the cookie isn't set, then go to http://thetechrobo.ca:1111/set to set the cookie? Then export the cookies, run wpull on http://thetechrobo.ca:1111 with the cookies, and see if it says the cookies are set.

@systwi-again
Copy link
Author

Hmm, oddly enough your site did work with --load-cookies. Can't explain why it's an outlier...

@TheTechRobo
Copy link

Again, I've had --load-cookies work before, like with Planet French, but not with Infos-Ados.

Are you sure there aren't any #HttpOnly lines int he cookies.txt...?

@systwi-again
Copy link
Author

systwi-again commented May 19, 2022

Okay, I thought maybe it was an issue with my particular setup for some reason.

Regarding #HttpOnly lines, there is but one instance. It's my school's proprietary web portal that I'm trying to save. I can save it only using the aforementioned workaround, which doesn't send any #HttpOnly cookies anyway, so I take it that that cookie is not as important. ¯\_(ツ)_/¯ I don't know.

@TheTechRobo
Copy link

TheTechRobo commented May 19, 2022

What was that workaround? I can't find it.`Neverind, found it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants