Skip to content

Latest commit

 

History

History
222 lines (129 loc) · 5.18 KB

HISTORY.md

File metadata and controls

222 lines (129 loc) · 5.18 KB

History / Changelog

1.3.2

  • UrlStore.get_download_urls(): timelimit removed, fix type hints (#119, 19c580e)
  • extract_links(): deprecate base_url parameter (#121)
  • setup: simplify workflow (#118)

1.3.1

  • UrlStore compression: make bz2 and zlib optional, update pickle protocol (#113)
  • extract_links(): review and document, add deprecation warning for base_url argument (#115)
  • maintenance: add __all__ to init.py and lint code (#116)

1.3.0

  • parsing: validate netloc with port number by @naz-theori in #104
  • cleaning: fix handling of apostrophes (#107)
  • maintenance: deprecate Python 3.6 & 3.7, add pyproject.toml setup file (#59, #105)

1.2.0

  • more compact UrlStore: use bytes instead of str for URL paths (#88)
  • UrlStore maintenance: deprecate timelimit argument (#101)
  • maintenance: simplify code (#103)
  • support for Python 3.13

1.1.0

  • replace langcodes by babel and use its information on locales (#89, #92)
  • simplified and faster code: domain extraction, cleaning, filters and UrlStore (#90, #93, #94, #95)
  • UrlStore: better url batches, replace timelimit parameter by time_limit (#91)
  • maintenance: update readme and convert it to markdown (#97)

1.0.0

  • license change from GPLv3+ to Apache 2.0 (#81)
  • UrlStore: write() method and load_store() function added (#83)
  • add parameter trailing_slash to keep of discard slashes at the end of URLs (#52)
  • maintenance: fix whitespace in clean_url() (#77), simplify code (#79)

0.9.5

  • IRI to URI normalization: encode path, query and fragments (#58, #60)
  • normalization: strip common trackers (#65)
  • new function is_valid_url() (#63)
  • hardening of domain filter (#64)

0.9.4

  • new UrlStore functions: add_from_html() (#42), discard() (#44), get_unvisited_domains
  • CLI: removed --samplesize, use --sample with an integer instead (#54)
  • added plausibility filter for domains/hosts (#48)
  • speedups and more efficient processing (#47, #49, #50)
  • fixed handling of relative URLs with @feltcat in #46
  • fixed bugs and ensured compatibility (#41, #43, #51, #56)
  • official support for Python 3.12

0.9.3

  • more efficient URL parsing (#33)
  • refined link extraction and link filters (#30, #36)
  • more efficient normalization (#32)
  • more efficient sampling strategy (#31, #35)
  • added meta function to clear LRU caches (#34)
  • added parallel option in command-line interface (#37, #39)
  • added get_unvisited_domains() method to UrlStore (#40)

0.9.2

  • add blogspot archives to type filter
  • maintenance: upgrade urllib3 and review code

0.9.1

  • network tests: larger throughput
  • UrlStore: optional compression of rules (#21), added reset() (#22) and get_all_counts() methods
  • UrlStore fixes: signal in #18, total_url_number
  • updated Readme

0.9.0

  • hardening of filters and URL parses (#14)
  • normalize punicode to unicode
  • methods added to UrlStore: get_crawl_delay(), print_unvisited_urls()
  • UrlStore now triggers exit code 1 when interrupted
  • argument added to extract_links(): no_filter
  • code refactoring: simplifications

0.8.3

  • fixed bug in domain name extraction
  • uniform logging parameters

0.8.2

  • full type hinting
  • maintenance: code linted

0.8.1

  • add type annotations and check with mypy
  • url_filter() function moved from Trafilatura
  • code style: use black

0.8.0

  • performance optimizations
  • fast track for domain extraction (extract_domain(url, fast=True)), now taking subdomains into account

0.7.2

  • UrlStore: threading lock and convenience functions added

0.7.1

  • bug in sampling fixed
  • UrlStore: validation by default

0.7.0

  • UrlStore class added: data store containing URLs with relevant information
  • code cleaning and maintenance (bugs, simplification)

0.6.0

  • reviewed code base: simplicity and execution speed
  • dropped support for Python 3.5

0.5.0

  • more complex language heuristics, use langcodes
  • extended blacklists and whitelists
  • more precise filters and more efficient code
  • support for Python 3.10

0.4.2

  • enhanced cleaning
  • fixed language filter

0.4.1

  • keep trailing slashes to avoid redirection
  • fixes: normalization and crawlable URLs

0.4.0

  • URL manipulation tools added: extract parts, fix relative URLs
  • filters added: language, navigation and crawls
  • more robust link handling and extraction
  • removed support for Python 3.4

0.3.1

  • improve filter precision

0.3.0

  • reduced dependencies: replace requests with bare urllib3, and tldextract with tld for Python 3.6 upwards
  • better path and fragment normalization

0.2.3

  • Python 3.9 compatibility
  • Simplified imports
  • Bug fixes

0.2.2

  • English and German language filters
  • Function to detect external links
  • Support for domain blacklisting

0.2.1

  • Less aggressive strict filters
  • CLI bug fixed

0.2.0

  • Cleaner and more efficient filtering
  • Helper functions to scrub, clean and normalize
  • Removed two dependencies with more extensive usage of urllib.parse

0.1.0

  • Cleaning and filtering targeting non-spam HTML pages with primarily text
  • URL validation
  • Sampling by domain name
  • Command-line interface (CLI) and Python tool