Releases: apache/beam
Beam 2.33.0 release
We are happy to present the new 2.33.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.
For more information on changes in 2.33.0, check out the detailed release
notes.
Highlights
- Go SDK is no longer experimental, and is officially part of the Beam release process.
- Matching Go SDK containers are published on release.
- Batch usage is well supported, and tested on Flink, Spark, and the Python Portable Runner.
- SDK Tests are also run against Google Cloud Dataflow, but this doesn't indicate reciprocal support.
- The SDK supports Splittable DoFns, Cross Language transforms, and most Beam Model basics.
- Go Modules are now used for dependency management.
- This is a breaking change, see Breaking Changes for resolution.
- Easier path to contribute to the Go SDK, no need to set up a GO_PATH.
- Minimum Go version is now Go v1.16
- See the announcement blogpost for full information once published.
New Features / Improvements
- Projection pushdown in SchemaIO (BEAM-12609).
- Upgrade Flink runner to Flink versions 1.13.2, 1.12.5 and 1.11.4 (BEAM-10955).
Breaking Changes
- Since release 2.30.0, "The AvroCoder changes for BEAM-2303 [changed] the reader/writer from the Avro ReflectDatum* classes to the SpecificDatum* classes" (Java). This default behavior change has been reverted in this release. Use the
useReflectApi
setting to control it (BEAM-12628).
Deprecations
- Python GBK will stop supporting unbounded PCollections that have global windowing and a default trigger in Beam 2.34. This can be overriden with
--allow_unsafe_triggers
. (BEAM-9487). - Python GBK will start requiring safe triggers or the
--allow_unsafe_triggers
flag starting with Beam 2.34. (BEAM-9487).
Bugfixes
- UnsupportedOperationException when reading from BigQuery tables and converting
TableRows to Beam Rows (Java)
(BEAM-12479). - SDFBoundedSourceReader behaves much slower compared with the original behavior
of BoundedSource (Python)
(BEAM-12781). - ORDER BY column not in SELECT crashes (ZetaSQL)
(BEAM-12759).
Known Issues
- Spark 2.x users will need to update Spark's Jackson runtime dependencies (
spark.jackson.version
) to at least version 2.9.2, due to Beam updating its dependencies. - See a full list of open issues that affect this version.
- Go SDK jobs may produce "Failed to deduce Step from MonitoringInfo" messages following successful job execution. The messages are benign and don't indicate job failure. These are due to not yet handling PCollection metrics.
List of Contributors
According to git shortlog, the following people contributed to the 2.33.0 release. Thank you to all contributors!
Ahmet Altay,
Alex Amato,
Alexey Romanenko,
Andreas Bergmeier,
Andres Rodriguez,
Andrew Pilloud,
Andy Xu,
Ankur Goenka,
anthonyqzhu,
Benjamin Gonzalez,
Bhupinder Sindhwani,
Chamikara Jayalath,
Claire McGinty,
Daniel Mateus Pires,
Daniel Oliveira,
David Huntsperger,
Dylan Hercher,
emily,
Emily Ye,
Etienne Chauchot,
Eugene Nikolaiev,
Heejong Lee,
iindyk,
Iñigo San Jose Visiers,
Ismaël Mejía,
Jack McCluskey,
Jan Lukavský,
Jeff Ruane,
Jeremy Lewi,
KevinGG,
Ke Wu,
Kyle Weaver,
lostluck,
Luke Cwik,
Marwan Tammam,
masahitojp,
Mehdi Drissi,
Minbo Bae,
Ning Kang,
Pablo Estrada,
Pascal Gillet,
Pawas Chhokra,
Reuven Lax,
Ritesh Ghorse,
Robert Bradshaw,
Robert Burke,
Rodrigo Benenson,
Ryan Thompson,
Saksham Gupta,
Sam Rohde,
Sam Whittle,
Sayat,
Sayat Satybaldiyev,
Siyuan Chen,
Slava Chernyak,
Steve Niemitz,
Steven Niemitz,
tvalentyn,
Tyson Hamilton,
Udi Meiri,
vachan-shetty,
Venkatramani Rajgopal,
Yichi Zhang,
zhoufek
Beam 2.32.0 release
We are happy to present the new 2.32.0 release of Apache Beam. This release includes both improvements and new functionality.
See the download page for this release.
For more information on changes in 2.32.0, check out the
Highlights
- The Beam DataFrame
API is no
longer experimental! We've spent the time since the 2.26.0 preview
announcement
implementing the most frequently used pandas operations
(BEAM-9547), improving
documentation
and error messages,
adding
examples,
integrating DataFrames with interactive
Beam,
and of course finding and fixing
bugs.
Leaving experimental just means that we now have high confidence in the API
and recommend its use for production workloads. We will continue to improve
the API, guided by your
feedback.
I/Os
- Support for X source added (Java/Python) (BEAM-X).
- Added ability to use JdbcIO.Write.withResults without statement and preparedStatementSetter. (BEAM-12511)
- Added ability to register URI schemes to use the S3 protocol via FileIO. (BEAM-12435).
- Respect number of shards set in SnowflakeWrite batch mode. (BEAM-12715)
- Java SDK: Update Google Cloud Healthcare IO connectors from using v1beta1 to using the GA version.
New Features / Improvements
- Add support to convert Beam Schema to Avro Schema for JDBC LogicalTypes:
VARCHAR
,NVARCHAR
,LONGVARCHAR
,LONGNVARCHAR
,DATE
,TIME
(Java)(BEAM-12385). - Reading from JDBC source by partitions (Java) (BEAM-12456).
- PubsubIO can now write to a dead-letter topic after a parsing error (Java)(BEAM-12474).
- New append-only option for Elasticsearch sink (Java) BEAM-12601
Breaking Changes
- ListShards (with DescribeStreamSummary) is used instead of DescribeStream to list shards in Kinesis streams. Due to this change, as mentioned in AWS documentation, for fine-grained IAM policies it is required to update them to allow calls to ListShards and DescribeStreamSummary APIs. For more information, see Controlling Access to Amazon Kinesis Data Streams (BEAM-12225).
Deprecations
Known Issues
- Fixed race condition in RabbitMqIO causing duplicate acks (Java) (BEAM-6516))
List of Contributors
According to git shortlog, the following people contributed to the 2.32.0 release. Thank you to all contributors!
Ahmet Altay, Ajo Thomas, Alex Amato, Alexey Romanenko, Alex Koay, allenpradeep, Anant Damle, Andrew Pilloud, Ankur Goenka, Ashwin Ramaswami, Benjamin Gonzalez, BenWhitehead, Blake Williams, Boyuan Zhang, Brian Hulette, Chamikara Jayalath, Daniel Oliveira, Daniel Thevessen, daria-malkova, David Cavazos, David Huntsperger, dennisylyung, Dennis Yung, dmkozh, egalpin, emily, Esun Kim, Gabriel Melo de Paula, Harch Vardhan, Heejong Lee, heidimhurst, hoshimura, Iñigo San Jose Visiers, Ismaël Mejía, Jack McCluskey, Jan Lukavský, Justin King, Kenneth Knowles, KevinGG, Ke Wu, kileys, Kyle Weaver, Luke Cwik, Maksym Skorupskyi, masahitojp, Matthew Ouyang, Matthias Baetens, Matt Rudary, MiguelAnzoWizeline, Miguel Hernandez, Nikita Petunin, Ning Ding, Ning Kang, odidev, Pablo Estrada, Pascal Gillet, rafal.ochyra, raphael.sanamyan, Reuven Lax, Robert Bradshaw, Robert Burke, roger-mike, Ryan McDowell, Sam Rohde, Sam Whittle, Siyuan Chen, Teng Qiu, Tianzi Cai, Tobias Hermann, Tomo Suzuki, tvalentyn, Tyson Hamilton, Udi Meiri, Valentyn Tymofieiev, Vitaly Terentyev, Yichi Zhang, Yifan Mai, yoshiki.obata, Yu Feng, YuqiHuai, yzhang559, Zachary Houfek, zhoufek
Beam 2.31.0 release
We are happy to present the new 2.31.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.
For more information on changes in 2.31.0, check out the detailed release notes.
Highlights
I/Os
- Fixed bug in ReadFromBigQuery when a RuntimeValueProvider is used as value of table argument (Python) (BEAM-12514).
New Features / Improvements
CREATE FUNCTION
DDL statement added to Calcite SQL syntax.JAR
andAGGREGATE
are now reserved keywords. (BEAM-12339).- Flink 1.13 is now supported by the Flink runner (BEAM-12277).
- DatastoreIO: Write and delete operations now follow automatic gradual ramp-up,
in line with best practices (Java/Python) (BEAM-12260, BEAM-12272). - Python
TriggerFn
has a newmay_lose_data
method to signal potential data loss. Default behavior assumes safe (necessary for backwards compatibility). See Deprecations for potential impact of overriding this. (BEAM-9487).
Breaking Changes
- Python Row objects are now sensitive to field order. So
Row(x=3, y=4)
is no
longer considered equal toRow(y=4, x=3)
(BEAM-11929). - Kafka Beam SQL tables now ascribe meaning to the LOCATION field; previously
it was ignored if provided. TopCombineFn
disallowcompare
as its argument (Python) (BEAM-7372).- Drop support for Flink 1.10 (BEAM-12281).
Deprecations
- Python GBK will stop supporting unbounded PCollections that have global windowing and a default trigger in Beam 2.33. This can be overriden with
--allow_unsafe_triggers
. (BEAM-9487). - Python GBK will start requiring safe triggers or the
--allow_unsafe_triggers
flag starting with Beam 2.33. (BEAM-9487).
Known Issues
- See a full list of issues that affect this version.
List of Contributors
According to git shortlog
, the following people contributed to the 2.31.0 release. Thank you to all contributors!
Ahmet Altay, ajo thomas, Alan Myrvold, Alex Amato, Alexey Romanenko,
AlikRodriguez, Anant Damle, Andrew Pilloud, Benjamin Gonzalez, Boyuan Zhang,
Brian Hulette, Chamikara Jayalath, Daniel Oliveira, David Cavazos,
David Huntsperger, David Moravek, Dmytro Kozhevin, dpcollins-google, Emily Ye,
Ernesto Valentino, Evan Galpin, Fernando Morales, Heejong Lee, Ismaël Mejía,
Jan Lukavský, Josias Rico, jrynd, Kenneth Knowles, Ke Wu, kileys, Kyle Weaver,
masahitojp, Matthias Baetens, Maximilian Michels, Milena Bukal,
Nathan J. Mehl, Pablo Estrada, Peter Sobot, Reuven Lax, Robert Bradshaw,
Robert Burke, roger-mike, Sam Rohde, Sam Whittle, Stephan Hoyer, Tom Underhill,
tvalentyn, Uday Singh, Udi Meiri, Vitaly Terentyev, Xinyu Liu, Yichi Zhang,
Yifan Mai, yoshiki.obata, zhoufek
Beam 2.30.0 release
We are happy to present the new 2.30.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.
For more information on changes in 2.30.0, check out the detailed release notes.
Highlights
- Legacy Read transform (non-SDF based Read) is used by default for non-FnAPI opensource runners. Use
use_sdf_read
experimental flag to re-enable SDF based Read transforms (BEAM-10670) - Upgraded vendored gRPC dependency to 1.36.0 (BEAM-11227)
I/Os
- Fixed the issue that WriteToBigQuery with batch file loads does not respect schema update options when there are multiple load jobs (BEAM-11277)
- Fixed the issue that the job didn't properly retry since BigQuery sink swallows HttpErrors when performing streaming inserts (BEAM-12362)
New Features / Improvements
- Added capability to declare resource hints in Java and Python SDKs (BEAM-2085)
- Added Spanner IO Performance tests for read and write in Python SDK (BEAM-10029)
- Added support for accessing GCP PubSub Message ordering keys, message IDs and message publish timestamp in Python SDK (BEAM-7819)
- DataFrame API: Added support for collecting DataFrame objects in interactive Beam (BEAM-11855)
- DataFrame API: Added apache_beam.examples.dataframe module (BEAM-12024)
- Upgraded the GCP Libraries BOM version to 20.0.0 (BEAM-11205). For Google Cloud client library versions set by this BOM, see this table
- Added
sdkContainerImage
flag to (eventually) replaceworkerHarnessContainerImage
(BEAM-12212) - Added support for Dataflow update when schemas are used (BEAM-12198)
- Fixed the issue that
ZipFiles.zipDirectory
leaks native JVM memory (BEAM-12220) - Fixed the issue that
Reshuffle.withNumBuckets
creates(N*2)-1
buckets (BEAM-12361)
Breaking Changes
- Drop support for Flink 1.8 and 1.9 (BEAM-11948)
- MongoDbIO: Read.withFilter() and Read.withProjection() are removed since they are deprecated since Beam 2.12.0 (BEAM-12217)
- RedisIO.readAll() was removed since it was deprecated since Beam 2.13.0. Please use RedisIO.readKeyPatterns() for the equivalent functionality (BEAM-12214)
- MqttIO.create() with clientId constructor removed because it was deprecated since Beam 2.13.0 (BEAM-12216)
Known Issues
- See a full list of open issues that affect this version.
List of Contributors
According to git shortlog
, the following people contributed to the 2.30.0 release. Thank you to all contributors!
Ahmet Altay, Alex Amato, Alexey Romanenko, Anant Damle, Andreas Bergmeier, Andrew Pilloud, Ankur Goenka,
Anup D, Artur Khanin, Benjamin Gonzalez, Bipin Upadhyaya, Boyuan Zhang, Brian Hulette, Bulat Shakirzyanov,
Chamikara Jayalath, Chun Yang, Daniel Kulp, Daniel Oliveira, David Cavazos, Elliotte Rusty Harold, Emily Ye,
Eric Roshan-Eisner, Evan Galpin, Fabien Caylus, Fernando Morales, Heejong Lee, Iñigo San Jose Visiers,
Isidro Martínez, Ismaël Mejía, Ke Wu, Kenneth Knowles, KevinGG, Kyle Weaver, Ludovic Post, MATTHEW Ouyang (LCL),
Mackenzie Clark, Masato Nakamura, Matthias Baetens, Max, Nicholas Azar, Ning Kang, Pablo Estrada, Patrick McCaffrey,
Quentin Sommer, Reuven Lax, Robert Bradshaw, Robert Burke, Rui Wang, Sam Rohde, Sam Whittle, Shoaib Zafar,
Siyuan Chen, Sruthi Sree Kumar, Steve Niemitz, Sylvain Veyrié, Tomo Suzuki, Udi Meiri, Valentyn Tymofieiev,
Vitaly Terentyev, Wenbing, Xinyu Liu, Yichi Zhang, Yifan Mai, Yueyang Qiu, Yunqing Zhou, ajo thomas, brucearctor,
dmkozh, dpcollins-google, emily, jordan-moore, kileys, lostluck, masahitojp, roger-mike, sychen, tvalentyn,
vachan-shetty, yoshiki.obata
Beam 2.29.0 Release
NOTE: This version was originally released on 2021-04-29 and added to GitHub releases late.
We are happy to present the new 2.29.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.
For more information on changes in 2.29.0, check out the detailed release notes.
Highlights
- Spark Classic and Portable runners officially support Spark 3 (BEAM-7093).
- Official Java 11 support for most runners (Dataflow, Flink, Spark) (BEAM-2530).
- DataFrame API now supports GroupBy.apply (BEAM-11628).
I/Os
- Added support for S3 filesystem on AWS SDK V2 (Java) (BEAM-7637)
- GCP BigQuery sink (file loads) uses runner determined sharding for unbounded data (BEAM-11772)
- KafkaIO now recognizes the
partition
property in writing records (BEAM-11806) - Support for Hadoop configuration on ParquetIO (BEAM-11913)
New Features / Improvements
- DataFrame API now supports pandas 1.2.x (BEAM-11531).
- Multiple DataFrame API bugfixes (BEAM-12071, BEAM-11929)
- DDL supported in SQL transforms (BEAM-11850)
- Upgrade Flink runner to Flink version 1.12.2 (BEAM-11941)
Breaking Changes
- Deterministic coding enforced for GroupByKey and Stateful DoFns. Previously non-deterministic coding was allowed, resulting in keys not properly being grouped in some cases. (BEAM-11719)
To restore the old behavior, one can registerFakeDeterministicFastPrimitivesCoder
with
beam.coders.registry.register_fallback_coder(beam.coders.coders.FakeDeterministicFastPrimitivesCoder())
or use theallow_non_deterministic_key_coders
pipeline option.
Deprecations
- Support for Flink 1.8 and 1.9 will be removed in the next release (2.30.0) (BEAM-11948).
Known Issues
- See a full list of open issues that affect this version.
List of Contributors
According to git shortlog
, the following people contributed to the 2.29.0 release. Thank you to all contributors!
Ahmet Altay, Alan Myrvold, Alex Amato, Alexander Chermenin, Alexey Romanenko,
Allen Pradeep Xavier, Amy Wu, Anant Damle, Andreas Bergmeier, Andrei Balici,
Andrew Pilloud, Andy Xu, Ankur Goenka, Bashir Sadjad, Benjamin Gonzalez, Boyuan
Zhang, Brian Hulette, Chamikara Jayalath, Chinmoy Mandayam, Chuck Yang,
dandy10, Daniel Collins, Daniel Oliveira, David Cavazos, David Huntsperger,
David Moravek, Dmytro Kozhevin, Emily Ye, Esun Kim, Evgeniy Belousov, Filip
Popić, Fokko Driesprong, Gris Cuevas, Heejong Lee, Ihor Indyk, Ismaël Mejía,
Jakub-Sadowski, Jan Lukavský, John Edmonds, Juan Sandoval, 谷口恵輔, Kenneth
Jung, Kenneth Knowles, KevinGG, Kiley Sok, Kyle Weaver, MabelYC, Mackenzie
Clark, Masato Nakamura, Milena Bukal, Miltos, Minbo Bae, Miraç Vuslat Başaran,
mynameborat, Nahian-Al Hasan, Nam Bui, Niel Markwick, Niels Basjes, Ning Kang,
Nir Gazit, Pablo Estrada, Ramazan Yapparov, Raphael Sanamyan, Reuven Lax, Rion
Williams, Robert Bradshaw, Robert Burke, Rui Wang, Sam Rohde, Sam Whittle,
Shehzaad Nakhoda, Shehzaad Nakhoda, Siyuan Chen, Sonam Ramchand, Steve Niemitz,
sychen, Sylvain Veyrié, Tim Robertson, Tobias Kaymak, Tomasz Szerszeń, Tomasz
Szerszeń, Tomo Suzuki, Tyson Hamilton, Udi Meiri, Valentyn Tymofieiev, Yichi
Zhang, Yifan Mai, Yixing Zhang, Yoshiki Obata
Beam 2.28.0 release
We are happy to present the new 2.28.0 release of Apache Beam. This release includes both improvements and new functionality.
See the download page for this release.
For more information on changes in 2.28.0, check out the
detailed release notes.
Highlights
- Many improvements related to Parquet support (BEAM-11460, BEAM-8202, and BEAM-11526)
- Hash Functions in BeamSQL (BEAM-10074)
- Hash functions in ZetaSQL (BEAM-11624)
- Create ApproximateDistinct using HLL Impl (BEAM-10324)
I/Os
- SpannerIO supports using BigDecimal for Numeric fields (BEAM-11643)
- Add Beam schema support to ParquetIO (BEAM-11526)
- Support ParquetTable Writer (BEAM-8202)
- GCP BigQuery sink (streaming inserts) uses runner determined sharding (BEAM-11408)
- PubSub support types: TIMESTAMP, DATE, TIME, DATETIME (BEAM-11533)
New Features / Improvements
- ParquetIO add methods readGenericRecords and readFilesGenericRecords can read files with an unknown schema. See PR-13554 and (BEAM-11460)
- Added support for thrift in KafkaTableProvider (BEAM-11482)
- Added support for HadoopFormatIO to skip key/value clone (BEAM-11457)
- Support Conversion to GenericRecords in Convert.to transform (BEAM-11571).
- Support writes for Parquet Tables in Beam SQL (BEAM-8202).
- Support reading Parquet files with unknown schema (BEAM-11460)
- Support user configurable Hadoop Configuration flags for ParquetIO (BEAM-11527)
- Expose commit_offset_in_finalize and timestamp_policy to ReadFromKafka (BEAM-11677)
- S3 options does not provided to boto3 client while using FlinkRunner and Beam worker pool container (BEAM-11799)
- HDFS not deduplicating identical configuration paths (BEAM-11329)
- Hash Functions in BeamSQL (BEAM-10074)
- Create ApproximateDistinct using HLL Impl (BEAM-10324)
- Add Beam schema support to ParquetIO (BEAM-11526)
- Add a Deque Encoder (BEAM-11538)
- Hash functions in ZetaSQL (BEAM-11624)
- Refactor ParquetTableProvider ()
- Add JVM properties to JavaJobServer (BEAM-8344)
- Single source of truth for supported Flink versions ()
- Use metric for Python BigQuery streaming insert API latency logging (BEAM-11018)
- Use metric for Java BigQuery streaming insert API latency logging (BEAM-11032)
- Upgrade Flink runner to Flink versions 1.12.1 and 1.11.3 (BEAM-11697)
- Upgrade Beam base image to use Tensorflow 2.4.1 (BEAM-11762)
- Create Beam GCP BOM (BEAM-11665)
Breaking Changes
- The Java artifacts "beam-sdks-java-io-kinesis", "beam-sdks-java-io-google-cloud-platform", and
"beam-sdks-java-extensions-sql-zetasql" declare Guava 30.1-jre dependency (It was 25.1-jre in Beam 2.27.0).
This new Guava version may introduce dependency conflicts if your project or dependencies rely
on removed APIs. If affected, ensure to use an appropriate Guava version viadependencyManagement
in Maven and
force
in Gradle.
List of Contributors
According to git shortlog, the following people contributed to the 2.28.0 release. Thank you to all contributors!
Ahmet Altay, Alex Amato, Alexey Romanenko, Allen Pradeep Xavier, Anant Damle, Artur Khanin,
Boyuan Zhang, Brian Hulette, Chamikara Jayalath, Chris Roth, Costi Ciudatu, Damon Douglas,
Daniel Collins, Daniel Oliveira, David Cavazos, David Huntsperger, Elliotte Rusty Harold,
Emily Ye, Etienne Chauchot, Etta Rapp, Evan Palmer, Eyal, Filip Krakowski, Fokko Driesprong,
Heejong Lee, Ismaël Mejía, janeliulwq, Jan Lukavský, John Edmonds, Jozef Vilcek, Kenneth Knowles
Ke Wu, kileys, Kyle Weaver, MabelYC, masahitojp, Masato Nakamura, Milena Bukal, Miraç Vuslat Başaran,
Nelson Osacky, Niel Markwick, Ning Kang, omarismail94, Pablo Estrada, Piotr Szuberski,
ramazan-yapparov, Reuven Lax, Reza Rokni, rHermes, Robert Bradshaw, Robert Burke, Robert Gruener,
Romster, Rui Wang, Sam Whittle, shehzaadn-vd, Siyuan Chen, Sonam Ramchand, Tobiasz Kędzierski,
Tomo Suzuki, tszerszen, tvalentyn, Tyson Hamilton, Udi Meiri, Xinbin Huang, Yichi Zhang,
Yifan Mai, yoshiki.obata, Yueyang Qiu, Yusaku Matsuki
Beam 2.27.0 release
We are happy to present the new 2.27.0 release of Apache Beam. This release includes both improvements and new functionality.
See the download page for this release.
For more information on changes in 2.27.0, check out the
detailed release notes.
Highlights
- Java 11 Containers are now published with all Beam releases.
- There is a new transform
ReadAllFromBigQuery
that can receive multiple requests to read data from BigQuery at pipeline runtime. See PR 13170, and BEAM-9650.
I/Os
- ReadFromMongoDB can now be used with MongoDB Atlas (Python) (BEAM-11266.)
- ReadFromMongoDB/WriteToMongoDB will mask password in display_data (Python) (BEAM-11444.)
- There is a new transform
ReadAllFromBigQuery
that can receive multiple requests to read data from BigQuery at pipeline runtime. See PR 13170, and BEAM-9650.
New Features / Improvements
- Beam modules that depend on Hadoop are now tested for compatibility with Hadoop 3 (BEAM-8569). (Hive/HCatalog pending)
- Publishing Java 11 SDK container images now supported as part of Apache Beam release process. (BEAM-8106)
- Added Cloud Bigtable Provider extension to Beam SQL (BEAM-11173, BEAM-11373)
- Added a schema provider for thrift data (BEAM-11338)
- Added combiner packing pipeline optimization to Dataflow runner. (BEAM-10641)
Breaking Changes
- HBaseIO hbase-shaded-client dependency should be now provided by the users (BEAM-9278).
--region
flag in amazon-web-services2 was replaced by--awsRegion
(BEAM-11331).
List of Contributors
According to git shortlog, the following people contributed to the 2.27.0 release. Thank you to all contributors!
Ahmet Altay, Alan Myrvold, Alex Amato, Alexey Romanenko, Aliraza Nagamia, Allen Pradeep Xavier,
Andrew Pilloud, andreyKaparulin, Ashwin Ramaswami, Boyuan Zhang, Brent Worden, Brian Hulette,
Carlos Marin, Chamikara Jayalath, Costi Ciudatu, Damon Douglas, Daniel Collins,
Daniel Oliveira, David Huntsperger, David Lu, David Moravek, David Wrede,
dennis, Dennis Yung, dpcollins-google, Emily Ye, emkornfield,
Esun Kim, Etienne Chauchot, Eugene Nikolaiev, Frank Zhao, Haizhou Zhao,
Hector Acosta, Heejong Lee, Ilya, Iñigo San Jose Visiers, InigoSJ,
Ismaël Mejía, janeliulwq, Jan Lukavský, Kamil Wasilewski, Kenneth Jung,
Kenneth Knowles, Ke Wu, kileys, Kyle Weaver, lostluck,
Matt Casters, Maximilian Michels, Michal Walenia, Mike Dewar, nehsyc,
Nelson Osacky, Niels Basjes, Ning Kang, Pablo Estrada, palmere-google,
Pawel Pasterz, Piotr Szuberski, purbanow, Reuven Lax, rHermes,
Robert Bradshaw, Robert Burke, Rui Wang, Sam Rohde, Sam Whittle,
Siyuan Chen, Tim Robertson, Tobiasz Kędzierski, tszerszen,
Valentyn Tymofieiev, Tyson Hamilton, Udi Meiri, vachan-shetty, Xinyu Liu,
Yichi Zhang, Yifan Mai, yoshiki.obata, Yueyang Qiu
Beam 2.26.0 release
We are happy to present the new 2.26.0 release of Apache Beam. This release includes both improvements and new functionality.
See the download page for this release.
For more information on changes in 2.26.0, check out the
detailed release notes.
Highlights
- Splittable DoFn is now the default for executing the Read transform for Java based runners (Spark with bounded pipelines) in addition to existing runners from the 2.25.0 release (Direct, Flink, Jet, Samza, Twister2). The expected output of the Read transform is unchanged. Users can opt-out using
--experiments=use_deprecated_read
. The Apache Beam community is looking for feedback for this change as the community is planning to make this change permanent with no opt-out. If you run into an issue requiring the opt-out, please send an e-mail to user@beam.apache.org specifically referencing BEAM-10670 in the subject line and why you needed to opt-out. (Java) (BEAM-10670)
I/Os
- Java BigQuery streaming inserts now have timeouts enabled by default. Pass
--HTTPWriteTimeout=0
to revert to the old behavior. (BEAM-6103)
New Features / Improvements
- Added support for avro payload format in Beam SQL Kafka Table (BEAM-10885)
- Added support for json payload format in Beam SQL Kafka Table (BEAM-10893)
- Added support for protobuf payload format in Beam SQL Kafka Table (BEAM-10892)
- Added support for avro payload format in Beam SQL Pubsub Table (BEAM-5504)
- Added option to disable unnecessary copying between operators in Flink Runner (Java) (BEAM-11146)
- Added CombineFn.setup and CombineFn.teardown to Python SDK. These methods let you initialize the CombineFn's state before any of the other methods of the CombineFn is executed and clean that state up later on. If you are using Dataflow, you need to enable Dataflow Runner V2 by passing
--experiments=use_runner_v2
before using this feature. (BEAM-3736)
Breaking Changes
- BigQuery's DATETIME type now maps to Beam logical type org.apache.beam.sdk.schemas.logicaltypes.SqlTypes.DATETIME
- Pandas 1.x is now required for dataframe operations.
List of Contributors
According to git shortlog, the following people contributed to the 2.26.0 release. Thank you to all contributors!
Abhishek Yadav, AbhiY98, Ahmet Altay, Alan Myrvold, Alex Amato, Alexey Romanenko,
Andrew Pilloud, Ankur Goenka, Boyuan Zhang, Brian Hulette, Chad Dombrova,
Chamikara Jayalath, Curtis "Fjord" Hawthorne, Damon Douglas, dandy10, Daniel Oliveira,
David Cavazos, dennis, Derrick Qin, dpcollins-google, Dylan Hercher, emily, Esun Kim,
Gleb Kanterov, Heejong Lee, Ismaël Mejía, Jan Lukavský, Jean-Baptiste Onofré, Jing,
Jozef Vilcek, Justin White, Kamil Wasilewski, Kenneth Knowles, kileys, Kyle Weaver,
lostluck, Luke Cwik, Mark, Maximilian Michels, Milan Cermak, Mohammad Hossein Sekhavat,
Nelson Osacky, Neville Li, Ning Kang, pabloem, Pablo Estrada, pawelpasterz,
Pawel Pasterz, Piotr Szuberski, PoojaChandak, purbanow, rarokni, Ravi Magham,
Reuben van Ammers, Reuven Lax, Reza Rokni, Robert Bradshaw, Robert Burke,
Romain Manni-Bucau, Rui Wang, rworley-monster, Sam Rohde, Sam Whittle, shollyman,
Simone Primarosa, Siyuan Chen, Steve Niemitz, Steven van Rossum, sychen, Teodor Spæren,
Tim Clemons, Tim Robertson, Tobiasz Kędzierski, tszerszen, Tudor Marian, tvalentyn,
Tyson Hamilton, Udi Meiri, Vasu Gupta, xasm83, Yichi Zhang, yichuan66, Yifan Mai,
yoshiki.obata, Yueyang Qiu, yukihira1992
Beam 2.25.0 release
We are happy to present the new 2.25.0 release of Apache Beam. This release includes both improvements and new functionality.
See the download page for this release.
For more information on changes in 2.25.0, check out the
detailed release notes.
Highlights
- Splittable DoFn is now the default for executing the Read transform for Java based runners (Direct, Flink, Jet, Samza, Twister2). The expected output of the Read transform is unchanged. Users can opt-out using
--experiments=use_deprecated_read
. The Apache Beam community is looking for feedback for this change as the community is planning to make this change permanent with no opt-out. If you run into an issue requiring the opt-out, please send an e-mail to user@beam.apache.org specifically referencing BEAM-10670 in the subject line and why you needed to opt-out. (Java) (BEAM-10670)
I/Os
- Added cross-language support to Java's KinesisIO, now available in the Python module
apache_beam.io.kinesis
(BEAM-10138, BEAM-10137). - Update Snowflake JDBC dependency for SnowflakeIO (BEAM-10864)
- Added cross-language support to Java's SnowflakeIO.Write, now available in the Python module
apache_beam.io.snowflake
(BEAM-9898). - Added delete function to Java's
ElasticsearchIO#Write
. Now, Java's ElasticsearchIO can be used to selectively delete documents usingwithIsDeleteFn
function (BEAM-5757). - Java SDK: Added new IO connector for InfluxDB - InfluxDbIO (BEAM-2546).
New Features / Improvements
- Support for repeatable fields in JSON decoder for
ReadFromBigQuery
added. (Python) (BEAM-10524) - Added an opt-in, performance-driven runtime type checking system for the Python SDK (BEAM-10549).
More details will be in an upcoming blog post. - Added support for Python 3 type annotations on PTransforms using typed PCollections (BEAM-10258).
More details will be in an upcoming blog post. - Improved the Interactive Beam API where recording streaming jobs now start a long running background recording job. Running ib.show() or ib.collect() samples from the recording (BEAM-10603).
- In Interactive Beam, ib.show() and ib.collect() now have "n" and "duration" as parameters. These mean read only up to "n" elements and up to "duration" seconds of data read from the recording (BEAM-10603).
- Initial preview of Dataframes support.
See also example at apache_beam/examples/wordcount_dataframe.py - Fixed support for type hints on
@ptransform_fn
decorators in the Python SDK.
(BEAM-4091)
This has not enabled by default to preserve backwards compatibility; use the
--type_check_additional=ptransform_fn
flag to enable. It may be enabled by
default in future versions of Beam.
Breaking Changes
- Python 2 and Python 3.5 support dropped (BEAM-10644, BEAM-9372).
- Pandas 1.x allowed. Older version of Pandas may still be used, but may not be as well tested.
Deprecations
- Python transform ReadFromSnowflake has been moved from
apache_beam.io.external.snowflake
toapache_beam.io.snowflake
. The previous path will be removed in the future versions.
Known Issues
- Dataflow streaming timers once against not strictly time ordered when set earlier mid-bundle, as the fix for BEAM-8543 introduced more severe bugs and has been rolled back.
- Default compressor change breaks dataflow python streaming job update compatibility. Please use python SDK version <= 2.23.0 or > 2.25.0 if job update is critical.(BEAM-11113)
List of Contributors
According to git shortlog, the following people contributed to the 2.25.0 release. Thank you to all contributors!
Ahmet Altay, Alan Myrvold, Aldair Coronel Ruiz, Alexey Romanenko, Andrew Pilloud, Ankur Goenka,
Ayoub ENNASSIRI, Bipin Upadhyaya, Boyuan Zhang, Brian Hulette, Brian Michalski, Chad Dombrova,
Chamikara Jayalath, Damon Douglas, Daniel Oliveira, David Cavazos, David Janicek, Doug Roeper, Eric
Roshan-Eisner, Etta Rapp, Eugene Kirpichov, Filipe Regadas, Heejong Lee, Ihor Indyk, Irvi Firqotul
Aini, Ismaël Mejía, Jan Lukavský, Jayendra, Jiadai Xia, Jithin Sukumar, Jozsef Bartok, Kamil
Gałuszka, Kamil Wasilewski, Kasia Kucharczyk, Kenneth Jung, Kenneth Knowles, Kevin Puthusseri, Kevin
Sijo Puthusseri, KevinGG, Kyle Weaver, Leiyi Zhang, Lourens Naudé, Luke Cwik, Matthew Ouyang,
Maximilian Michels, Michal Walenia, Milan Cermak, Monica Song, Nelson Osacky, Neville Li, Ning Kang,
Pablo Estrada, Piotr Szuberski, Qihang, Rehman, Reuven Lax, Robert Bradshaw, Robert Burke, Rui Wang,
Saavan Nanavati, Sam Bourne, Sam Rohde, Sam Whittle, Sergiy Kolesnikov, Sindy Li, Siyuan Chen, Steve
Niemitz, Terry Xian, Thomas Weise, Tobiasz Kędzierski, Truc Le, Tyson Hamilton, Udi Meiri, Valentyn
Tymofieiev, Yichi Zhang, Yifan Mai, Yueyang Qiu, annaqin418, danielxjd, dennis, dp, fuyuwei,
lostluck, nehsyc, odeshpande, odidev, pulasthi, purbanow, rworley-monster, sclukas77, terryxian78,
tvalentyn, yoshiki.obata
Beam 2.24.0 release
We are happy to present the new 2.24.0 release of Apache Beam. This release includes both improvements and new functionality.
See the download page for this release.
For more information on changes in 2.24.0, check out the
detailed release notes.
Highlights
- Apache Beam 2.24.0 is the last release with Python 2 and Python 3.5
support.
I/Os
- New overloads for BigtableIO.Read.withKeyRange() and BigtableIO.Read.withRowFilter()
methods that take ValueProvider as a parameter (Java) (BEAM-10283). - The WriteToBigQuery transform (Python) in Dataflow Batch no longer relies on BigQuerySink by default. It relies on
a new, fully-featured transform based on file loads into BigQuery. To revert the behavior to the old implementation,
you may use--experiments=use_legacy_bq_sink
. - Add cross-language support to Java's JdbcIO, now available in the Python module
apache_beam.io.jdbc
(BEAM-10135, BEAM-10136). - Add support of AWS SDK v2 for KinesisIO.Read (Java) (BEAM-9702).
- Add streaming support to SnowflakeIO in Java SDK (BEAM-9896)
- Support reading and writing to Google Healthcare DICOM APIs in Python SDK (BEAM-10601)
- Add dispositions for SnowflakeIO.write (BEAM-10343)
- Add cross-language support to SnowflakeIO.Read now available in the Python module
apache_beam.io.external.snowflake
(BEAM-9897).
New Features / Improvements
- Shared library for simplifying management of large shared objects added to Python SDK. Example use case is sharing a large TF model object across threads (BEAM-10417).
- Dataflow streaming timers are not strictly time ordered when set earlier mid-bundle (BEAM-8543).
- OnTimerContext should not create a new one when processing each element/timer in FnApiDoFnRunner (BEAM-9839)
- Key should be available in @ontimer methods (Spark Runner) (BEAM-9850)
Breaking Changes
- WriteToBigQuery transforms now require a GCS location to be provided through either
custom_gcs_temp_location in the constructor of WriteToBigQuery or the fallback option
--temp_location, or pass method="STREAMING_INSERTS" to WriteToBigQuery (BEAM-6928). - Python SDK now understands
typing.FrozenSet
type hints, which are not interchangeable withtyping.Set
. You may need to update your pipelines if type checking fails. (BEAM-10197)
List of Contributors
According to git shortlog, the following people contributed to the 2.24.0 release. Thank you to all contributors!
adesormi, Ahmet Altay, Alex Amato, Alexey Romanenko, Andrew Pilloud, Ashwin Ramaswami, Borzoo,
Boyuan Zhang, Brian Hulette, Brian M, Bu Sun Kim, Chamikara Jayalath, Colm O hEigeartaigh,
Corvin Deboeser, Damian Gadomski, Damon Douglas, Daniel Oliveira, Dariusz Aniszewski,
davidak09, David Cavazos, David Moravek, David Yan, dhodun, Doug Roeper, Emil Hessman, Emily Ye,
Etienne Chauchot, Etta Rapp, Eugene Kirpichov, fuyuwei, Gleb Kanterov,
Harrison Green, Heejong Lee, Henry Suryawirawan, InigoSJ, Ismaël Mejía, Israel Herraiz,
Jacob Ferriero, Jan Lukavský, Jayendra, jfarr, jhnmora000, Jiadai Xia, JIahao wu, Jie Fan,
Jiyong Jung, Julius Almeida, Kamil Gałuszka, Kamil Wasilewski, Kasia Kucharczyk, Kenneth Knowles,
Kevin Puthusseri, Kyle Weaver, Łukasz Gajowy, Luke Cwik, Mark-Zeng, Maximilian Michels,
Michal Walenia, Niel Markwick, Ning Kang, Pablo Estrada, pawel.urbanowicz, Piotr Szuberski,
Rafi Kamal, rarokni, Rehman Murad Ali, Reuben van Ammers, Reuven Lax, Ricardo Bordon,
Robert Bradshaw, Robert Burke, Robin Qiu, Rui Wang, Saavan Nanavati, sabhyankar, Sam Rohde,
Scott Lukas, Siddhartha Thota, Simone Primarosa, Sławomir Andrian,
Steve Niemitz, Tobiasz Kędzierski, Tomo Suzuki, Tyson Hamilton, Udi Meiri,
Valentyn Tymofieiev, viktorjonsson, Xinyu Liu, Yichi Zhang, Yixing Zhang, yoshiki.obata,
Yueyang Qiu, zijiesong