Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relative URLs are parsed incorrectly #48

Open
ghost opened this issue Jul 6, 2018 · 2 comments
Open

Relative URLs are parsed incorrectly #48

ghost opened this issue Jul 6, 2018 · 2 comments
Assignees

Comments

@ghost
Copy link

ghost commented Jul 6, 2018

If http://domain/dir/page1.html contains a link to page2.html the parser interprets this as http://domain/page2.html, correct is http://domain/dir/page2.html.

Furthermore on a page containing references to the upper directories (..), these are changed to . by self.clean_link.

I recommend to use urllib.parse.urljoin(crawling_url, link) to make a link to an absolute URL. This will handle everything except "//" in the path.

@ghost
Copy link
Author

ghost commented Jul 11, 2018

Found another program, that suits my needs.

Nevertheless thanks to @c4software for this nice piece of software.

@ghost ghost closed this as completed Jul 11, 2018
@c4software
Copy link
Owner

Sorry i didn't answer quickly to your issue…

I will fix asap the problem you have found

@c4software c4software reopened this Jul 11, 2018
@c4software c4software self-assigned this Jul 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant