Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cookies not staying #187

Open
TheTechRobo opened this issue Jun 24, 2021 · 5 comments
Open

Cookies not staying #187

TheTechRobo opened this issue Jun 24, 2021 · 5 comments

Comments

@TheTechRobo
Copy link
Contributor

TheTechRobo commented Jun 24, 2021

I'm trying to archive infos-ados.com and after I get the cookies (to stay signed in), and pass them through to grab-site, the WARCS aren't signed in, and my session in the browser expires. How do I fix this?

@TheTechRobo
Copy link
Contributor Author

TheTechRobo commented Jun 24, 2021

Update: Actually no, maybe my session doesn't expire??

In any case, I still can't stay signed in by adding the cookies. Even after adding my user-agent.

@systwi-again
Copy link

thuban and I on #archiveteam-bs were troubleshooting this very issue a while ago. It seems to be an issue with wpull, where it does not import the cookies.txt file as instructed.

There is a workaround for the time being, however (steps written assuming you're using Firefox):

  1. Launch a web browser that supports copying the cURL request of a loaded resource (e.g. Firefox)
  2. Press F12 (or fn+F12 on some keyboards), or click on Tools > Web Developer > Toggle Tools to show the web developer toolbox
  3. In the web developer toolbox, click the "Network" tab
  4. Enter and load the website/webpage you wish to crawl in that browser tab/window (you need to be logged in, of course)
  5. In the filter text box, paste in the same URL and click the "All" type filter button (or "HTML")
    5.5. If nothing comes up, truncate the end of the URL slowly until you see something appear in the list
  6. Right click (or Control-click on macOS) on the entry and click Copy > Copy as cURL
  7. Paste the clipboard data into a new text document and look for -H 'Cookie: ....... If you don't see this, try choosing a different entry in the list
  8. Remove everything else from that curl query, keeping only the entire cookie entry (Cookie: ......)
  9. Craft your grab-site query in the text file like the example below:
#!/bin/bash
~/gs-venv/bin/grab-site --1 --wpull-args='--header '"'"'Cookie: SESSIONID=848a0415-98c0-45fc-b281-b805e470b714; EXPIRE=1652325000'"'"' --keep-session-cookies' 'https://auth.example.com/home.html'

  1. Save the text file, chmod +x it and run it. The page should then be saved using the provided cookies.

It's also worth noting that some cookies, primarily (or solely?) ones that begin with #, are newer (by Mozilla) and out of the Netscape spec, and thus are not supported by grab-site, wpull or even wget at the time of writing (1652325299). If your website happens to require such cookies, your crawl may not work at all. As an extra archival measure, I also export a cookies.txt using this Firefox extension and move it to the grab-site output directory when the crawl is complete. It's better than nothing, I suppose.

@TheTechRobo
Copy link
Contributor Author

Cookies work for me most of the time. I've recently crawled Planet French which requires login. Infos-Ados didn't work, and I don't have the cookies anymore to check if there are #HttpOnly.

Yeah, the #HttpOnly ones gave me headaches in my DeviantArt scraper.

@TomLucidor
Copy link

@TheTechRobo wait DeviantArt? Isn't that the job of Gallery-DL or are you trying to get other things from them?

@TheTechRobo
Copy link
Contributor Author

@TomLucidor I wasn't aware of DeviantArt back then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants