Skip to content

Releases: apache/incubator-stormcrawler

Apache StormCrawler 3.2.0 (Incubating)

22 Nov 21:39
949ff8d
Compare
Choose a tag to compare

What's Changed

  • Release 3.1.0 by @rzo1 in #1316
  • Bump Apache Storm from 3.1.1 to 2.6.4 & archetype 3.0 to 3.1.0 by @kunalpal97 in #1319
  • #1299 - Add DISCLAIMER to JAR files by @rzo1 in #1320
  • #1300 - Fix "files in jars have odd dates" by @rzo1 in #1321
  • Bump org.yaml:snakeyaml from 2.2 to 2.3 by @dependabot in #1307
  • Bump org.awaitility:awaitility from 4.2.0 to 4.2.2 by @dependabot in #1310
  • Bump org.jacoco:jacoco-maven-plugin from 0.8.11 to 0.8.12 by @dependabot in #1305
  • Bump org.netpreserve:jwarc from 0.29.0 to 0.30.0 by @dependabot in #1304
  • Bump org.apache.maven.plugins:maven-surefire-plugin from 3.2.1 to 3.5.0 by @dependabot in #1308
  • Bump aws.version from 1.12.663 to 1.12.772 by @dependabot in #1302
  • Bump org.apache.solr:solr-solrj from 9.6.1 to 9.7.0 by @dependabot in #1309
  • Bump com.microsoft.playwright:playwright from 1.46.0 to 1.47.0 by @dependabot in #1306
  • Bump org.wiremock:wiremock from 3.5.4 to 3.9.1 by @dependabot in #1311
  • Bump selenium.version from 4.24.0 to 4.25.0 by @dependabot in #1314
  • #1323 Update archetype Storm version from 2.6.4 by @mvolikas in #1325
  • Regenerated License file after dependency upgrades by @github-actions in #1322
  • Bump OpenSearch to 2.17 + fix archetype version in README by @jnioche in #1324
  • Bump org.mockito:mockito-core from 5.13.0 to 5.14.0 by @dependabot in #1334
  • Bump junit.version from 5.11.0 to 5.11.1 by @dependabot in #1333
  • Bump org.apache.maven.plugins:maven-archetype-plugin from 3.2.1 to 3.3.0 by @dependabot in #1332
  • Bump org.apache.maven.archetype:archetype-packaging from 3.2.1 to 3.3.0 by @dependabot in #1330
  • Regenerated License file after dependency upgrades by @github-actions in #1326
  • Regenerated License file after dependency upgrades by @github-actions in #1335
  • Bump log4j2.version from 2.23.0 to 2.24.1 by @dependabot in #1328
  • Regenerated License file after dependency upgrades by @github-actions in #1337
  • Bump org.jetbrains:annotations from 24.1.0 to 25.0.0 by @dependabot in #1331
  • Regenerated License file after dependency upgrades by @github-actions in #1338
  • Bump com.github.crawler-commons:urlfrontier-API from 2.3.1 to 2.4 by @dependabot in #1327
  • Regenerated License file after dependency upgrades by @github-actions in #1340
  • Store metadata as WARC Metadata records by @jnioche in #1341
  • Improve robustness of WARC generation by @jnioche in #1342
  • Bump org.apache.maven.plugins:maven-surefire-plugin from 3.5.0 to 3.5.1 by @dependabot in #1350
  • Bump junit.version from 5.11.1 to 5.11.2 by @dependabot in #1345
  • Fix configuration for Github's linguist by @mvolikas in #1344
  • Bump testcontainers.version from 1.20.1 to 1.20.2 by @dependabot in #1346
  • Bump org.mockito:mockito-core from 5.14.0 to 5.14.1 by @dependabot in #1349
  • Bump aws.version from 1.12.772 to 1.12.773 by @dependabot in #1351
  • Bump org.apache.maven.plugins:maven-javadoc-plugin from 3.10.0 to 3.10.1 by @dependabot in #1347
  • Regenerated License file after dependency upgrades by @github-actions in #1352
  • #1354 Fix: fix some typos in project by @psxjoy in #1355
  • Fix #1312 "Sha512 hash of source release is missing the file part " by @rzo1 in #1356
  • Bump de.thetaphi:forbiddenapis from 3.7 to 3.8 by @dependabot in #1359
  • Bump org.jetbrains:annotations from 25.0.0 to 26.0.0 by @dependabot in #1358
  • Regenerated License file after dependency upgrades by @github-actions in #1360
  • Trivial: version number in warc/README fix #1317 by @jnioche in #1363
  • Bugfix nofollow instructions in rel tags ignored by @jnioche in #1362
  • Bump org.jetbrains:annotations from 26.0.0 to 26.0.1 by @dependabot in #1368
  • Bump com.microsoft.playwright:playwright from 1.47.0 to 1.48.0 by @dependabot in #1366
  • Connect to a remote instance using web sockets by @jnioche in #1361
  • Bump aws.version from 1.12.773 to 1.12.776 by @dependabot in #1367
  • Bump org.mockito:mockito-core from 5.14.1 to 5.14.2 by @dependabot in #1369
  • Regenerated License file after dependency upgrades by @github-actions in #1370
  • Bump tika.version from 2.9.2 to 3.0.0 by @dependabot in #1365
  • Apache Storm 2.7.0 by @rzo1 in #1371
  • Regenerated License file after dependency upgrades by @github-actions in #1372
  • #1353 Fix for URLFrontier spout not taking into account the crawl ID by @klockla in #1373
  • Bump junit.version from 5.11.2 to 5.11.3 by @dependabot in #1375
  • Bump com.ibm.icu:icu4j from 75.1 to 76.1 by @dependabot in #1376
  • Bump aws.version from 1.12.776 to 1.12.777 by @dependabot in #1377
  • Bump org.wiremock:wiremock from 3.9.1 to 3.9.2 by @dependabot in #1378
  • Bump testcontainers.version from 1.20.2 to 1.20.3 by @dependabot in #1379
  • Remove references to ES in OpenSearch module by @jnioche in #1374
  • Regenerated License file after dependency upgrades by @github-actions in #1380
  • Fix #1313 "Exclude "__files" from Source Release Artifacts"" by @rzo1 in #1384
  • #1301 - add build doc for the source release by @rzo1 in #1383
  • [1385] bugfix - check for null before the for-each loop by @jnioche in #1386
  • Sync conf files in root and archetype + explicit values for sniff conf by @jnioche in #1388
  • Detect multi addresses separated by ; in a single String. Fixes #1382 by @jnioche in #1387
  • Bump org.apache.maven.plugins:maven-archetype-plugin from 3.3.0 to 3.3.1 by @dependabot in #1390
  • Bump selenium.version from 4.25.0 to 4.26.0 by @dependabot in #1393
  • Bump org.apache.maven.plugins:maven-surefire-plugin from 3.5.1 to 3.5.2 by @dependabot in #1392
  • Bump org.apache.maven.plugins:maven-javadoc-plugin from 3.10.1 to 3.11.1 by @dependabot in #1394
  • Bump org.apache.maven.archetype:archetype-packaging from 3.3.0 to 3.3.1 by @dependabot in #1395
  • Regenerated License file after dependency upgrades by @github-actions in #1398
  • #620 Add support for shards - SolrSpout by @mvolikas in #1343
  • #1403 - Downgrade log4j2 to Storm's version. Fixes #1403 by @tballison in #1404
  • #140...
Read more

Apache StormCrawler 3.1.0 (Incubating)

13 Sep 09:35
Compare
Choose a tag to compare

Disclaimer

Apache StormCrawler is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Incubator. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.

Release Summary

This is our 2nd release after joining the ASF incubator as a poddling. It contains the new playwright module, which can be used for scraping dynamic content.

What's Changed

New Contributors

  • @sigee made their first contribution in #1255
  • @github-actions made their first contribution in #1280

Full Changelog: stormcrawler-3.0...stormcrawler-3.1.0

Apache StormCrawler 3.0 (Incubating)

07 May 09:04
Compare
Choose a tag to compare

Disclaimer

Apache StormCrawler is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Incubator. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.

Release Summary

This is our first release after joining the ASF incubator as a poddling. It is a breaking change with renamings in the group ids and
the removal of the elasticsearch module.

What's Changed

New Contributors

Full Changelog: 2.11...stormcrawler-3.0

StormCrawler 2.11

02 Jan 12:57
Compare
Choose a tag to compare

Disclaimer

This is a Pre-ASF release and did not undergo a formal review by the PMC.

What's Changed

New Contributors

Full Changelog: 2.10...2.11

What's new in StormCrawler 2.10

25 Oct 13:58
Compare
Choose a tag to compare

Disclaimer

This is a Pre-ASF release and did not undergo a formal review by the PMC.

What's Changed

and a lot more!

Full Changelog: 2.9...2.10

See https://digitalpebble.blogspot.com/2023/10/focus-on-protocol-improvements-in.html for more details on the protocol improvements

What's new in StormCrawler 2.9

04 Sep 13:52
Compare
Choose a tag to compare

Disclaimer

This is a Pre-ASF release and did not undergo a formal review by the PMC.

What's Changed

New Contributors

Full Changelog: 2.8...2.9

What's new in StormCrawler 2.8

28 Mar 15:17
Compare
Choose a tag to compare

Disclaimer

This is a Pre-ASF release and did not undergo a formal review by the PMC.

What's Changed

New Contributors

Full Changelog: 2.7...2.8

What's new in StormCrawler 2.7

20 Dec 15:28
Compare
Choose a tag to compare

Disclaimer

This is a Pre-ASF release and did not undergo a formal review by the PMC.

What's Changed

  • Dependency upgrades #1016
  • Opensearch module in #1011
  • Maven archetype for Opensearch
  • [WARC] Backward compatible storage of HTTP/2 headers by @sebastian-nagel in #1010
  • Ignore empty fields indexer in #1019
  • Handle single quotes in value of http-equiv="refresh" #1020

Full Changelog: 2.6...2.7

What's new in StormCrawler 2.6

28 Nov 10:19
Compare
Choose a tag to compare

Disclaimer

This is a Pre-ASF release and did not undergo a formal review by the PMC.

Highlights

Full Changelog: storm-crawler-2.5...2.6

What's new in Stormcrawler 2.5

31 Aug 13:28
Compare
Choose a tag to compare

Disclaimer

This is a Pre-ASF release and did not undergo a formal review by the PMC.

In a nutshell

  • various dependency upgrades (JSoup, CrawlerCommons, Tika, Elasticsearch)
  • Java 11
  • bugfix AggregationSpout does not release IsInQuery boolean sometimes
  • various improvements to URLFrontier module

In more details

  • FEATURE-964: custom crawl delay per page by @juli-alvarez in #967
  • Issue 970 HttpProtocol doesn't consider http.content.limit in test for filesize by @wowasa in #972
  • Add ChannelManager for local channel management and constants to Spout.java by @FelixEngl in #982
  • Fix error when spaces in path to test-resources of StatusBoltTest in ElasticSearch-Module by @FelixEngl in #985
  • Add unit test basics for URLFrontier. by @FelixEngl in #984
  • Fix starvation and busy waiting of StatusUpdaterBolt.java, add Constants. by @FelixEngl in #983
  • Fix starvation and busy waiting of ES StatusUpdaterBolt (Fixes #986) by @FelixEngl in #988
  • Fix starvation and busy waiting of ES IndexerBolt by @FelixEngl in #989
  • HttpProtocol use the md protocol.set-headers to add custom header by url by @Mikwiss in #993

New Contributors

Full Changelog: 2.4...storm-crawler-2.5