-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lxml.etree.ParserError: Document is empty #207
Comments
I'm facing the same issue for cnn articles (e.g. https://edition.cnn.com/2023/08/09/politics/georgia-medicaid-eligibility-work-requirements/index.html). It seems that |
I did some more analysis. It seems that for the same article goose3 works well while extruct crashes. Both libraries use Additionally, there is a To summarize, either of the two worked for me: from goose3.utils.encoding import smart_str
html = '...'
extruct.extract(smart_str(html), syntaxes=['json-ld']) from lxml.html import soupparser
html = '...'
extruct.extract(soupparser.fromstring(html), syntaxes=['json-ld']) |
Interesting ! Thanks for the info. I'll definitely try that ! |
Another option is to parse the HTML on your end and pass an already parsed tree (in lxml.html format) to the extruct library, most syntaxes support that in the last release. For example we're internally using an HTML5 parser https://github.com/kovidgoyal/html5-parser/ passing |
Hi everyone, Example: import requests
import extruct
u = "https://edition.cnn.com/2023/08/09/politics/georgia-medicaid-eligibility-work-requirements/index.html"
r = requests.get(u) # note that r.content is a bytes object
# crashes
extruct.extract(r.content.decode("utf-8"))
# works
extruct.extract(r.content) |
I made a small script in order to try the scrapping process.
I have a case when If I use extruct as the CLI, I get lots of information about the schema extracted.
extruct [url]
However, if I use for the same url
schema = extruct.extract(html_content, base_url=url)
, I get the error"lxml.etree.ParserError: Document is empty"
The url is valid and the content of html_content (response.text) is valid and full.
I tried also with a fresh python environment when I've installed only extruct and I still get the error.
Any insights about why it failed by using the python code ?
The text was updated successfully, but these errors were encountered: