Relative URLs are parsed incorrectly #48

ghost · 2018-07-06T14:33:58Z

If http://domain/dir/page1.html contains a link to page2.html the parser interprets this as http://domain/page2.html, correct is http://domain/dir/page2.html.

Furthermore on a page containing references to the upper directories (..), these are changed to . by self.clean_link.

I recommend to use urllib.parse.urljoin(crawling_url, link) to make a link to an absolute URL. This will handle everything except "//" in the path.

The text was updated successfully, but these errors were encountered:

ghost · 2018-07-11T07:44:12Z

Found another program, that suits my needs.

Nevertheless thanks to @c4software for this nice piece of software.

c4software · 2018-07-11T08:01:54Z

Sorry i didn't answer quickly to your issue…

I will fix asap the problem you have found

ghost closed this as completed Jul 11, 2018

c4software reopened this Jul 11, 2018

c4software self-assigned this Jul 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Relative URLs are parsed incorrectly #48

Relative URLs are parsed incorrectly #48

ghost commented Jul 6, 2018

ghost commented Jul 11, 2018

c4software commented Jul 11, 2018

Relative URLs are parsed incorrectly #48

Relative URLs are parsed incorrectly #48

Comments

ghost commented Jul 6, 2018

ghost commented Jul 11, 2018

c4software commented Jul 11, 2018