Skip to content

Remove DIVs, style stuff and normalize HTML preserving structure information

License

Notifications You must be signed in to change notification settings

zytedata/clear-html

Repository files navigation

clear-html

PyPI Version Supported Python Versions Build Status Coverage report

Clean and normalize HTML. Preserve embeddings (e.g. Twitter, Instagram, etc)

Install the library with pip:

pip install clear-html

Example usage with lxml:

from lxml.html import fromstring
from clear_html import clean_node, cleaned_node_to_html

html="""
        <div style="color:blue" id="main_content">
            Some text to be
            <div>cleaned up!</div>
        </div>
     """
node = fromstring(html)
cleaned_node = clean_node(node)
cleaned_html = cleaned_node_to_html(cleaned_node)
print(cleaned_html)

Example usage with Parsel:

from parsel import Selector
from clear_html import clean_node, cleaned_node_to_html

selector = Selector(text="""<html>
                            <body>
                                <h1>Hello!</h1>
                                <div style="color:blue" id="main_content">
                                    Some text to be
                                    <div>cleaned up!</div>
                                </div>
                            </body>
                            </html>""")
selector = selector.css("#main_content")
cleaned_node = clean_node(selector[0].root)
cleaned_html = cleaned_node_to_html(cleaned_node)
print(cleaned_html)

Both of the different approaches above would print the following:

<article>

<p>Some text to be</p>

<p>cleaned up!</p>

</article>

Other interesting functions:

  • cleaned_node_to_text: convert the cleaned node to plain text
  • formatted_text.clean_doc: low level method to control more aspects of the cleaning up

About

Remove DIVs, style stuff and normalize HTML preserving structure information

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages