Skip to content

Releases: apache/incubator-stormcrawler

Apache StormCrawler 3.1.0 (Incubating)

13 Sep 09:35
Compare
Choose a tag to compare

Disclaimer

Apache StormCrawler is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Incubator. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.

Release Summary

This is our 2nd release after joining the ASF incubator as a poddling. It contains the new playwright module, which can be used for scraping dynamic content.

What's Changed

New Contributors

  • @sigee made their first contribution in #1255
  • @github-actions made their first contribution in #1280

Full Changelog: stormcrawler-3.0...stormcrawler-3.1.0

Apache StormCrawler 3.0 (Incubating)

07 May 09:04
Compare
Choose a tag to compare

Disclaimer

Apache StormCrawler is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Incubator. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.

Release Summary

This is our first release after joining the ASF incubator as a poddling. It is a breaking change with renamings in the group ids and
the removal of the elasticsearch module.

What's Changed

New Contributors

Full Changelog: 2.11...stormcrawler-3.0

StormCrawler 2.11

02 Jan 12:57
Compare
Choose a tag to compare

Disclaimer

This is a Pre-ASF release and did not undergo a formal review by the PMC.

What's Changed

New Contributors

Full Changelog: 2.10...2.11

What's new in StormCrawler 2.10

25 Oct 13:58
Compare
Choose a tag to compare

Disclaimer

This is a Pre-ASF release and did not undergo a formal review by the PMC.

What's Changed

and a lot more!

Full Changelog: 2.9...2.10

See https://digitalpebble.blogspot.com/2023/10/focus-on-protocol-improvements-in.html for more details on the protocol improvements

What's new in StormCrawler 2.9

04 Sep 13:52
Compare
Choose a tag to compare

Disclaimer

This is a Pre-ASF release and did not undergo a formal review by the PMC.

What's Changed

New Contributors

Full Changelog: 2.8...2.9

What's new in StormCrawler 2.8

28 Mar 15:17
Compare
Choose a tag to compare

Disclaimer

This is a Pre-ASF release and did not undergo a formal review by the PMC.

What's Changed

New Contributors

Full Changelog: 2.7...2.8

What's new in StormCrawler 2.7

20 Dec 15:28
Compare
Choose a tag to compare

Disclaimer

This is a Pre-ASF release and did not undergo a formal review by the PMC.

What's Changed

  • Dependency upgrades #1016
  • Opensearch module in #1011
  • Maven archetype for Opensearch
  • [WARC] Backward compatible storage of HTTP/2 headers by @sebastian-nagel in #1010
  • Ignore empty fields indexer in #1019
  • Handle single quotes in value of http-equiv="refresh" #1020

Full Changelog: 2.6...2.7

What's new in StormCrawler 2.6

28 Nov 10:19
Compare
Choose a tag to compare

Disclaimer

This is a Pre-ASF release and did not undergo a formal review by the PMC.

Highlights

Full Changelog: storm-crawler-2.5...2.6

What's new in Stormcrawler 2.5

31 Aug 13:28
Compare
Choose a tag to compare

Disclaimer

This is a Pre-ASF release and did not undergo a formal review by the PMC.

In a nutshell

  • various dependency upgrades (JSoup, CrawlerCommons, Tika, Elasticsearch)
  • Java 11
  • bugfix AggregationSpout does not release IsInQuery boolean sometimes
  • various improvements to URLFrontier module

In more details

  • FEATURE-964: custom crawl delay per page by @juli-alvarez in #967
  • Issue 970 HttpProtocol doesn't consider http.content.limit in test for filesize by @wowasa in #972
  • Add ChannelManager for local channel management and constants to Spout.java by @FelixEngl in #982
  • Fix error when spaces in path to test-resources of StatusBoltTest in ElasticSearch-Module by @FelixEngl in #985
  • Add unit test basics for URLFrontier. by @FelixEngl in #984
  • Fix starvation and busy waiting of StatusUpdaterBolt.java, add Constants. by @FelixEngl in #983
  • Fix starvation and busy waiting of ES StatusUpdaterBolt (Fixes #986) by @FelixEngl in #988
  • Fix starvation and busy waiting of ES IndexerBolt by @FelixEngl in #989
  • HttpProtocol use the md protocol.set-headers to add custom header by url by @Mikwiss in #993

New Contributors

Full Changelog: 2.4...storm-crawler-2.5

StormCrawler 2.4

13 Apr 10:46
Compare
Choose a tag to compare

Disclaimer

This is a Pre-ASF release and did not undergo a formal review by the PMC.

Upgrade to Apache Storm 2.4
Upgrade to Elasticsearch 7.17.2
bugfix Setting "maxDepth": 0 in urlfilter.json prevents ES seed injection #959
Allow compatibility.mode for rest client to connect to ES8+ #962

Full Changelog: 2.3...2.4