Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lxml.etree.ParserError: Document is empty #207

Open
lironesamoun opened this issue Jul 3, 2023 · 5 comments
Open

lxml.etree.ParserError: Document is empty #207

lironesamoun opened this issue Jul 3, 2023 · 5 comments

Comments

@lironesamoun
Copy link

lironesamoun commented Jul 3, 2023

I made a small script in order to try the scrapping process.

I have a case when If I use extruct as the CLI, I get lots of information about the schema extracted.
extruct [url]

However, if I use for the same url schema = extruct.extract(html_content, base_url=url) , I get the error "lxml.etree.ParserError: Document is empty"
The url is valid and the content of html_content (response.text) is valid and full.

I tried also with a fresh python environment when I've installed only extruct and I still get the error.

import requests
import sys
import extruct

def get_html(url):
    response = requests.get(url)
    if response.status_code == 200:
        return response.text
    else:
        return None

# Check if URL is provided as a command-line argument
if len(sys.argv) < 2:
    print("Please provide a URL as a command-line argument.")
    sys.exit(1)

url = sys.argv[1]  # Get the URL from the command-line argument
html_content = get_html(url)
if html_content:
    #print(html_content)
    schema = extruct.extract(html_content, base_url=url)
    print(schema)
else:
    print("Failed to retrieve HTML.")

Any insights about why it failed by using the python code ?

@Vasniktel
Copy link

I'm facing the same issue for cnn articles (e.g. https://edition.cnn.com/2023/08/09/politics/georgia-medicaid-eligibility-work-requirements/index.html). It seems that lxml.etree.fromstring(resp.text, parser=lxml.html.HTMLParser()) returns None for some reason. Haven't investigated it any further but it seems that this is an issue with the lxml.

@Vasniktel
Copy link

Vasniktel commented Aug 9, 2023

I did some more analysis. It seems that for the same article goose3 works well while extruct crashes. Both libraries use lxml. The only difference is in goose3.utils.encoding.smart_str function being applied (https://github.com/goose3/goose3/blob/d3c404a79e0e15b7957355083bd5a7590d4103ba/goose3/parsers.py#L59). I've checked it manually and it seems to do the trick for me.

Additionally, there is a lxml.html.soupparser module that can also be used.

To summarize, either of the two worked for me:

from goose3.utils.encoding import smart_str

html = '...'
extruct.extract(smart_str(html), syntaxes=['json-ld'])
from lxml.html import soupparser

html = '...'
extruct.extract(soupparser.fromstring(html), syntaxes=['json-ld'])

@lironesamoun
Copy link
Author

Interesting ! Thanks for the info. I'll definitely try that !

@lopuhin
Copy link
Member

lopuhin commented Aug 11, 2023

Another option is to parse the HTML on your end and pass an already parsed tree (in lxml.html format) to the extruct library, most syntaxes support that in the last release. For example we're internally using an HTML5 parser https://github.com/kovidgoyal/html5-parser/ passing treebuilder='lxml_html' (happy to share more details), which is more compatible compared to default lxml.html parser.

@trifle
Copy link

trifle commented Oct 3, 2023

Hi everyone,
I believe this is fundamentally an encoding issue, as vasniktel suggested. Try to feed extruct directly with bytes instead of (mistakenly) utf-8 decoded strings to prevent it from happening.

Example:

import requests
import extruct

u = "https://edition.cnn.com/2023/08/09/politics/georgia-medicaid-eligibility-work-requirements/index.html"
r = requests.get(u) # note that r.content is a bytes object

# crashes
extruct.extract(r.content.decode("utf-8"))
# works
extruct.extract(r.content)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants