UrlStore.get_download_urls()
:timelimit
removed, fix type hints (#119, 19c580e)extract_links()
: deprecatebase_url
parameter (#121)- setup: simplify workflow (#118)
UrlStore
compression: make bz2 and zlib optional, update pickle protocol (#113)extract_links()
: review and document, add deprecation warning forbase_url
argument (#115)- maintenance: add
__all__
toinit.py
and lint code (#116)
- parsing: validate netloc with port number by @naz-theori in #104
- cleaning: fix handling of apostrophes (#107)
- maintenance: deprecate Python 3.6 & 3.7, add
pyproject.toml
setup file (#59, #105)
- more compact UrlStore: use bytes instead of str for URL paths (#88)
- UrlStore maintenance: deprecate
timelimit
argument (#101) - maintenance: simplify code (#103)
- support for Python 3.13
- replace
langcodes
bybabel
and use its information on locales (#89, #92) - simplified and faster code: domain extraction, cleaning, filters and UrlStore (#90, #93, #94, #95)
- UrlStore: better url batches, replace
timelimit
parameter bytime_limit
(#91) - maintenance: update readme and convert it to markdown (#97)
- license change from GPLv3+ to Apache 2.0 (#81)
- UrlStore:
write()
method andload_store()
function added (#83) - add parameter
trailing_slash
to keep of discard slashes at the end of URLs (#52) - maintenance: fix whitespace in
clean_url()
(#77), simplify code (#79)
- IRI to URI normalization: encode path, query and fragments (#58, #60)
- normalization: strip common trackers (#65)
- new function
is_valid_url()
(#63) - hardening of domain filter (#64)
- new UrlStore functions:
add_from_html()
(#42),discard()
(#44),get_unvisited_domains
- CLI: removed
--samplesize
, use--sample
with an integer instead (#54) - added plausibility filter for domains/hosts (#48)
- speedups and more efficient processing (#47, #49, #50)
- fixed handling of relative URLs with @feltcat in #46
- fixed bugs and ensured compatibility (#41, #43, #51, #56)
- official support for Python 3.12
- more efficient URL parsing (#33)
- refined link extraction and link filters (#30, #36)
- more efficient normalization (#32)
- more efficient sampling strategy (#31, #35)
- added meta function to clear LRU caches (#34)
- added parallel option in command-line interface (#37, #39)
- added
get_unvisited_domains()
method toUrlStore
(#40)
- add blogspot archives to type filter
- maintenance: upgrade
urllib3
and review code
- network tests: larger throughput
- UrlStore: optional compression of rules (#21), added
reset()
(#22) andget_all_counts()
methods - UrlStore fixes:
signal
in #18,total_url_number
- updated Readme
- hardening of filters and URL parses (#14)
- normalize punicode to unicode
- methods added to
UrlStore
:get_crawl_delay()
,print_unvisited_urls()
UrlStore
now triggers exit code 1 when interrupted- argument added to
extract_links()
:no_filter
- code refactoring: simplifications
- fixed bug in domain name extraction
- uniform logging parameters
- full type hinting
- maintenance: code linted
- add type annotations and check with
mypy
url_filter()
function moved from Trafilatura- code style: use
black
- performance optimizations
- fast track for domain extraction (
extract_domain(url, fast=True)
), now taking subdomains into account
- UrlStore: threading lock and convenience functions added
- bug in sampling fixed
- UrlStore: validation by default
- UrlStore class added: data store containing URLs with relevant information
- code cleaning and maintenance (bugs, simplification)
- reviewed code base: simplicity and execution speed
- dropped support for Python 3.5
- more complex language heuristics, use langcodes
- extended blacklists and whitelists
- more precise filters and more efficient code
- support for Python 3.10
- enhanced cleaning
- fixed language filter
- keep trailing slashes to avoid redirection
- fixes: normalization and crawlable URLs
- URL manipulation tools added: extract parts, fix relative URLs
- filters added: language, navigation and crawls
- more robust link handling and extraction
- removed support for Python 3.4
- improve filter precision
- reduced dependencies: replace requests with bare urllib3, and tldextract with tld for Python 3.6 upwards
- better path and fragment normalization
- Python 3.9 compatibility
- Simplified imports
- Bug fixes
- English and German language filters
- Function to detect external links
- Support for domain blacklisting
- Less aggressive strict filters
- CLI bug fixed
- Cleaner and more efficient filtering
- Helper functions to scrub, clean and normalize
- Removed two dependencies with more extensive usage of urllib.parse
- Cleaning and filtering targeting non-spam HTML pages with primarily text
- URL validation
- Sampling by domain name
- Command-line interface (CLI) and Python tool