Releases: facebook/rocksdb
Releases · facebook/rocksdb
RocksDB 5.18.4
Special release for ARM. (Note: the originally tagged commit for this release was wrong but the tag has been updated a couple of times. You might need to delete your copy of the tag with git tag -d v5.18.4
to get the new one. See https://git-scm.com/docs/git-tag#_on_re_tagging)
RocksDB v6.6.3
Rocksdb Change Log
6.6.3 (2020-01-24)
Bug Fixes
- Fix a bug that can cause write threads to hang when a slowdown/stall happens and there is a mix of writers with WriteOptions::no_slowdown set/unset.
6.6.2 (2020-01-13)
Bug Fixes
- Fixed a bug where non-L0 compaction input files were not considered to compute the
creation_time
of new compaction outputs.
6.6.1 (2020-01-02)
Bug Fixes
- Fix a bug in WriteBatchWithIndex::MultiGetFromBatchAndDB, which is called by Transaction::MultiGet, that causes due to stale pointer access when the number of keys is > 32
- Fixed two performance issues related to memtable history trimming. First, a new SuperVersion is now created only if some memtables were actually trimmed. Second, trimming is only scheduled if there is at least one flushed memtable that is kept in memory for the purposes of transaction conflict checking.
- BlobDB no longer updates the SST to blob file mapping upon failed compactions.
- Fix a bug in which a snapshot read through an iterator could be affected by a DeleteRange after the snapshot (#6062).
- Fixed a bug where BlobDB was comparing the
ColumnFamilyHandle
pointers themselves instead of only the column family IDs when checking whether an API call uses the default column family or not. - Delete superversions in BackgroundCallPurge.
- Fix use-after-free and double-deleting files in BackgroundCallPurge().
6.6.0 (2019-11-25)
Bug Fixes
- Fix data corruption casued by output of intra-L0 compaction on ingested file not being placed in correct order in L0.
- Fix a data race between Version::GetColumnFamilyMetaData() and Compaction::MarkFilesBeingCompacted() for access to being_compacted (#6056). The current fix acquires the db mutex during Version::GetColumnFamilyMetaData(), which may cause regression.
- Fix a bug in DBIter that is_blob_ state isn't updated when iterating backward using seek.
- Fix a bug when format_version=3, partitioned fitlers, and prefix search are used in conjunction. The bug could result into Seek::(prefix) returning NotFound for an existing prefix.
- Revert the feature "Merging iterator to avoid child iterator reseek for some cases (#5286)" since it might cause strong results when reseek happens with a different iterator upper bound.
- Fix a bug causing a crash during ingest external file when background compaction cause severe error (file not found).
- Fix a bug when partitioned filters and prefix search are used in conjunction, ::SeekForPrev could return invalid for an existing prefix. ::SeekForPrev might be called by the user, or internally on ::Prev, or within ::Seek if the return value involves Delete or a Merge operand.
- Fix OnFlushCompleted fired before flush result persisted in MANIFEST when there's concurrent flush job. The bug exists since OnFlushCompleted was introduced in rocksdb 3.8.
- Fixed an sst_dump crash on some plain table SST files.
- Fixed a memory leak in some error cases of opening plain table SST files.
- Fix a bug when a crash happens while calling WriteLevel0TableForRecovery for multiple column families, leading to a column family's log number greater than the first corrutped log number when the DB is being opened in PointInTime recovery mode during next recovery attempt (#5856).
New Features
- Universal compaction to support options.periodic_compaction_seconds. A full compaction will be triggered if any file is over the threshold.
GetLiveFilesMetaData
andGetColumnFamilyMetaData
now expose the file number of SST files as well as the oldest blob file referenced by each SST.- A batched MultiGet API (DB::MultiGet()) that supports retrieving keys from multiple column families.
- Full and partitioned filters in the block-based table use an improved Bloom filter implementation, enabled with format_version 5 (or above) because previous releases cannot read this filter. This replacement is faster and more accurate, especially for high bits per key or millions of keys in a single (full) filter. For example, the new Bloom filter has the same false postive rate at 9.55 bits per key as the old one at 10 bits per key, and a lower false positive rate at 16 bits per key than the old one at 100 bits per key.
- Added AVX2 instructions to USE_SSE builds to accelerate the new Bloom filter and XXH3-based hash function on compatible x86_64 platforms (Haswell and later, ~2014).
- Support options.ttl or options.periodic_compaction_seconds with options.max_open_files = -1. File's oldest ancester time and file creation time will be written to manifest. If it is availalbe, this information will be used instead of creation_time and file_creation_time in table properties.
- Setting options.ttl for universal compaction now has the same meaning as setting periodic_compaction_seconds.
- SstFileMetaData also returns file creation time and oldest ancester time.
- The
sst_dump
command line toolrecompress
command now displays how many blocks were compressed and how many were not, in particular how many were not compressed because the compression ratio was not met (12.5% threshold for GoodCompressionRatio), as seen in thenumber.block.not_compressed
counter stat since version 6.0.0. - The block cache usage is now takes into account the overhead of metadata per each entry. This results into more accurate managment of memory. A side-effect of this feature is that less items are fit into the block cache of the same size, which would result to higher cache miss rates. This can be remedied by increasing the block cache size or passing kDontChargeCacheMetadata to its constuctor to restore the old behavior.
- When using BlobDB, a mapping is maintained and persisted in the MANIFEST between each SST file and the oldest non-TTL blob file it references.
db_bench
now supports and by default issues non-TTL Puts to BlobDB. TTL Puts can be enabled by specifying a non-zero value for theblob_db_max_ttl_range
command line parameter explicitly.sst_dump
now supports printing BlobDB blob indexes in a human-readable format. This can be enabled by specifying thedecode_blob_index
flag on the command line.- A number of new information elements are now exposed through the EventListener interface. For flushes, the file numbers of the new SST file and the oldest blob file referenced by the SST are propagated. For compactions, the level, file number, and the oldest blob file referenced are passed to the client for each compaction input and output file.
Public API Change
- RocksDB release 4.1 or older will not be able to open DB generated by the new release. 4.2 was released on Feb 23, 2016.
- TTL Compactions in Level compaction style now initiate successive cascading compactions on a key range so that it reaches the bottom level quickly on TTL expiry.
creation_time
table property for compaction output files is now set to the minimum of the creation times of all compaction inputs. - With FIFO compaction style, options.periodic_compaction_seconds will have the same meaning as options.ttl. Whichever stricter will be used. With the default options.periodic_compaction_seconds value with options.ttl's default of 0, RocksDB will give a default of 30 days.
- Added an API GetCreationTimeOfOldestFile(uint64_t* creation_time) to get the file_creation_time of the oldest SST file in the DB.
- FilterPolicy now exposes additional API to make it possible to choose filter configurations based on context, such as table level and compaction style. See
LevelAndStyleCustomFilterPolicy
in db_bloom_filter_test.cc. While most existing custom implementations of FilterPolicy should continue to work as before, those wrapping the return of NewBloomFilterPolicy will require overriding new functionGetBuilderWithContext()
, because callingGetFilterBitsBuilder()
on the FilterPolicy returned by NewBloomFilterPolicy is no longer supported. - An unlikely usage of FilterPolicy is no longer supported. Calling GetFilterBitsBuilder() on the FilterPolicy returned by NewBloomFilterPolicy will now cause an assertion violation in debug builds, because RocksDB has internally migrated to a more elaborate interface that is expected to evolve further. Custom implementations of FilterPolicy should work as before, except those wrapping the return of NewBloomFilterPolicy, which will require a new override of a protected function in FilterPolicy.
- NewBloomFilterPolicy now takes bits_per_key as a double instead of an int. This permits finer control over the memory vs. accuracy trade-off in the new Bloom filter implementation and should not change source code compatibility.
- The option BackupableDBOptions::max_valid_backups_to_open is now only used when opening BackupEngineReadOnly. When opening a read/write BackupEngine, anything but the default value logs a warning and is treated as the default. This change ensures that backup deletion has proper accounting of shared files to ensure they are deleted when no longer referenced by a backup.
- Deprecate
snap_refresh_nanos
option. - Added DisableManualCompaction/EnableManualCompaction to stop and resume manual compaction.
- Add TryCatchUpWithPrimary() to StackableDB in non-LITE mode.
- Add a new Env::LoadEnv() overloaded function to return a shared_ptr to Env.
- Flush sets file name to "(nil)" for OnTableFileCreationCompleted() if the flush does not produce any L0. This can happen if the file is empty thus delete by RocksDB.
Default Option Changes
- Changed the default value of periodic_compaction_seconds to
UINT64_MAX - 1
which allows RocksDB to auto-tune periodic compaction scheduling. When using the default value, periodic compactions are now auto-enabled if a compaction filter is used. A value of0
will turn off the feature completely. - Changed the default value of ttl to
UINT64_MAX - 1
which allows RocksDB to auto-tune ttl value. When using the default value, TTL will be auto-enabled to 30 days, when the fe...
RocksDB v6.5.3
Rocksdb Change Log
6.5.3 (2020-01-10)
Bug Fixes
- Fixed two performance issues related to memtable history trimming. First, a new SuperVersion is now created only if some memtables were actually trimmed. Second, trimming is only scheduled if there is at least one flushed memtable that is kept in memory for the purposes of transaction conflict checking.
RocksDB v6.5.2
6.5.2 (2019-11-15)
Bug Fixes
- Fix a assertion failure in MultiGe4t() when BlockBasedTableOptions::no_block_cache is true and there is no compressed block cache
- Fix a buffer overrun problem in BlockBasedTable::MultiGet() when compression is enabled and no compressed block cache is configured.
- If a call to BackupEngine::PurgeOldBackups or BackupEngine::DeleteBackup suffered a crash, power failure, or I/O error, files could be left over from old backups that could only be purged with a call to GarbageCollect. Any call to PurgeOldBackups, DeleteBackup, or GarbageCollect should now suffice to purge such files.
6.5.1 (2019-10-16)
Bug Fixes
- Revert the feature "Merging iterator to avoid child iterator reseek for some cases (#5286)" since it might cause strange results when reseek happens with a different iterator upper bound.
- Fix a bug in BlockBasedTableIterator that might return incorrect results when reseek happens with a different iterator upper bound.
- Fix a bug when partitioned filters and prefix search are used in conjunction, ::SeekForPrev could return invalid for an existing prefix. ::SeekForPrev might be called by the user, or internally on ::Prev, or within ::Seek if the return value involves Delete or a Merge operand.
6.5.0 (2019-09-13)
Bug Fixes
- Fixed a number of data races in BlobDB.
- Fix a bug where the compaction snapshot refresh feature is not disabled as advertised when
snap_refresh_nanos
is set to 0.. - Fix bloom filter lookups by the MultiGet batching API when BlockBasedTableOptions::whole_key_filtering is false, by checking that a key is in the perfix_extractor domain and extracting the prefix before looking up.
- Fix a bug in file ingestion caused by incorrect file number allocation when the number of column families involved in the ingestion exceeds 2.
New Features
- Introduced DBOptions::max_write_batch_group_size_bytes to configure maximum limit on number of bytes that are written in a single batch of WAL or memtable write. It is followed when the leader write size is larger than 1/8 of this limit.
- VerifyChecksum() by default will issue readahead. Allow ReadOptions to be passed in to those functions to override the readhead size. For checksum verifying before external SST file ingestion, a new option IngestExternalFileOptions.verify_checksums_readahead_size, is added for this readahead setting.
- When user uses options.force_consistency_check in RocksDb, instead of crashing the process, we now pass the error back to the users without killing the process.
- Add an option
memtable_insert_hint_per_batch
to WriteOptions. If it is true, each WriteBatch will maintain its own insert hints for each memtable in concurrent write. See include/rocksdb/options.h for more details.
Public API Change
- Added max_write_buffer_size_to_maintain option to better control memory usage of immutable memtables.
- Added a lightweight API GetCurrentWalFile() to get last live WAL filename and size. Meant to be used as a helper for backup/restore tooling in a larger ecosystem such as MySQL with a MyRocks storage engine.
- The MemTable Bloom filter, when enabled, now always uses cache locality. Options::bloom_locality now only affects the PlainTable SST format.
Performance Improvements
- Improve the speed of the MemTable Bloom filter, reducing the write overhead of enabling it by 1/3 to 1/2, with similar benefit to read performance.
RocksDB v6.4.6
Rocksdb Change Log
6.4.6 (2019-10-16)
Bug Fixes
- Fix a bug when partitioned filters and prefix search are used in conjunction, ::SeekForPrev could return invalid for an existing prefix. ::SeekForPrev might be called by the user, or internally on ::Prev, or within ::Seek if the return value involves Delete or a Merge operand.
6.4.5 (2019-10-01)
Bug Fixes
- Revert the feature "Merging iterator to avoid child iterator reseek for some cases (#5286)" since it might cause strange results when reseek happens with a different iterator upper bound.
- Fix a bug in BlockBasedTableIterator that might return incorrect results when reseek happens with a different iterator upper bound.
6.4.4 (2019-09-17)
- Fix a bug introduced 6.3 which could cause wrong results in a corner case when prefix bloom filter is used and the iterator is reseeked.
6.4.2 (2019-09-03)
Bug Fixes
- Fix a bug in file ingestion caused by incorrect file number allocation when the number of column families involved in the ingestion exceeds 2.
6.4.1 (2019-08-20)
Bug Fixes
- Fix a bug where the compaction snapshot refresh feature is not disabled as advertised when
snap_refresh_nanos
is set to 0..
6.4.0 (2019-07-30)
Default Option Change
- LRUCacheOptions.high_pri_pool_ratio is set to 0.5 (previously 0.0) by default, which means that by default midpoint insertion is enabled. The same change is made for the default value of high_pri_pool_ratio argument in NewLRUCache(). When block cache is not explictly created, the small block cache created by BlockBasedTable will still has this option to be 0.0.
- Change BlockBasedTableOptions.cache_index_and_filter_blocks_with_high_priority's default value from false to true.
Public API Change
- Filter and compression dictionary blocks are now handled similarly to data blocks with regards to the block cache: instead of storing objects in the cache, only the blocks themselves are cached. In addition, filter and compression dictionary blocks (as well as filter partitions) no longer get evicted from the cache when a table is closed.
- Due to the above refactoring, block cache eviction statistics for filter and compression dictionary blocks are temporarily broken. We plan to reintroduce them in a later phase.
- The semantics of the per-block-type block read counts in the performance context now match those of the generic block_read_count.
- Errors related to the retrieval of the compression dictionary are now propagated to the user.
- db_bench adds a "benchmark" stats_history, which prints out the whole stats history.
- Overload GetAllKeyVersions() to support non-default column family.
- Added new APIs ExportColumnFamily() and CreateColumnFamilyWithImport() to support export and import of a Column Family. #3469
- ldb sometimes uses a string-append merge operator if no merge operator is passed in. This is to allow users to print keys from a DB with a merge operator.
- Replaces old Registra with ObjectRegistry to allow user to create custom object from string, also add LoadEnv() to Env.
- Added new overload of GetApproximateSizes which gets SizeApproximationOptions object and returns a Status. The older overloads are redirecting their calls to this new method and no longer assert if the include_flags doesn't have either of INCLUDE_MEMTABLES or INCLUDE_FILES bits set. It's recommended to use the new method only, as it is more type safe and returns a meaningful status in case of errors.
New Features
- Add argument
--secondary_path
to ldb to open the database as the secondary instance. This would keep the original DB intact. - Compression dictionary blocks are now prefetched and pinned in the cache (based on the customer's settings) the same way as index and filter blocks.
- Added DBOptions::log_readahead_size which specifies the number of bytes to prefetch when reading the log. This is mostly useful for reading a remotely located log, as it can save the number of round-trips. If 0 (default), then the prefetching is disabled.
- Support loading custom objects in unit tests. In the affected unit tests, RocksDB will create custom Env objects based on environment variable TEST_ENV_URI. Users need to make sure custom object types are properly registered. For example, a static library should expose a
RegisterCustomObjects
function. By linking the unit test binary with the static library, the unit test can execute this function.
Performance Improvements
- Reduce iterator key comparision for upper/lower bound check.
- Improve performance of row_cache: make reads with newer snapshots than data in an SST file share the same cache key, except in some transaction cases.
- The compression dictionary is no longer copied to a new object upon retrieval.
Bug Fixes
- Fix ingested file and directory not being fsync.
- Return TryAgain status in place of Corruption when new tail is not visible to TransactionLogIterator.
- Fixed a regression where the fill_cache read option also affected index blocks.
- Fixed an issue where using cache_index_and_filter_blocks==false affected partitions of partitioned indexes/filters as well.
RocksDB v6.3.6
Rocksdb Change Log
6.3.6 (2019-10-01)
- Revert the feature "Merging iterator to avoid child iterator reseek for some cases (#5286)" since it might cause strong results when reseek happens with a different iterator upper bound.
6.3.5 (2019-09-17)
- Fix a bug introduced 6.3 which could cause wrong results in a corner case when prefix bloom filter is used and the iterator is reseeked.
6.3.4 (2019-09-03)
Bug Fixes
- Fix a bug in file ingestion caused by incorrect file number allocation when the number of column families involved in the ingestion exceeds 2.
6.3.3 (2019-08-20)
Bug Fixes
- Fix a bug where the compaction snapshot refresh feature is not disabled as advertised when
snap_refresh_nanos
is set to 0..
6.3.2 (2019-08-15)
Public API Change
- The semantics of the per-block-type block read counts in the performance context now match those of the generic block_read_count.
Bug Fixes
- Fixed a regression where the fill_cache read option also affected index blocks.
- Fixed an issue where using cache_index_and_filter_blocks==false affected partitions of partitioned indexes as well.
6.3.1 (2019-07-24)
Bug Fixes
- Fix auto rolling bug introduced in 6.3.0, which causes segfault if log file creation fails.
6.3.0 (2019-06-18)
Public API Change
- Now DB::Close() will return Aborted() error when there is unreleased snapshot. Users can retry after all snapshots are released.
- Index blocks are now handled similarly to data blocks with regards to the block cache: instead of storing objects in the cache, only the blocks themselves are cached. In addition, index blocks no longer get evicted from the cache when a table is closed, can now use the compressed block cache (if any), and can be shared among multiple table readers.
- Partitions of partitioned indexes no longer affect the read amplification statistics.
- Due to the above refactoring, block cache eviction statistics for indexes are temporarily broken. We plan to reintroduce them in a later phase.
- options.keep_log_file_num will be enforced strictly all the time. File names of all log files will be tracked, which may take significantly amount of memory if options.keep_log_file_num is large and either of options.max_log_file_size or options.log_file_time_to_roll is set.
- Add initial support for Get/Put with user timestamps. Users can specify timestamps via ReadOptions and WriteOptions when calling DB::Get and DB::Put.
- Accessing a partition of a partitioned filter or index through a pinned reference is no longer considered a cache hit.
- Add C bindings for secondary instance, i.e. DBImplSecondary.
- Rate limited deletion of WALs is only enabled if DBOptions::wal_dir is not set, or explicitly set to db_name passed to DB::Open and DBOptions::db_paths is empty, or same as db_paths[0].path
New Features
- Add an option
snap_refresh_nanos
(default to 0) to periodically refresh the snapshot list in compaction jobs. Assign to 0 to disable the feature. - Add an option
unordered_write
which trades snapshot guarantees with higher write throughput. When used with WRITE_PREPARED transactions with two_write_queues=true, it offers higher throughput with however no compromise on guarantees. - Allow DBImplSecondary to remove memtables with obsolete data after replaying MANIFEST and WAL.
- Add an option
failed_move_fall_back_to_copy
(default is true) for external SST ingestion. Whenmove_files
is true and hard link fails, ingestion falls back to copy iffailed_move_fall_back_to_copy
is true. Otherwise, ingestion reports an error.
Performance Improvements
- Reduce binary search when iterator reseek into the same data block.
- DBIter::Next() can skip user key checking if previous entry's seqnum is 0.
- Merging iterator to avoid child iterator reseek for some cases
- Log Writer will flush after finishing the whole record, rather than a fragment.
- Lower MultiGet batching API latency by reading data blocks from disk in parallel
General Improvements
- Added new status code kColumnFamilyDropped to distinguish between Column Family Dropped and DB Shutdown in progress.
- Improve ColumnFamilyOptions validation when creating a new column family.
Bug Fixes
- Fix a bug in WAL replay of secondary instance by skipping write batches with older sequence numbers than the current last sequence number.
- Fix flush's/compaction's merge processing logic which allowed
Put
s covered by range tombstones to reappear. NotePut
s may exist even if the user only ever calledMerge()
due to an internal conversion during compaction to the bottommost level. - Fix/improve memtable earliest sequence assignment and WAL replay so that WAL entries of unflushed column families will not be skipped after replaying the MANIFEST and increasing db sequence due to another flushed/compacted column family.
- Fix a bug caused by secondary not skipping the beginning of new MANIFEST.
- On DB open, delete WAL trash files left behind in wal_dir
RocksDB v6.2.4
Rocksdb Change Log
6.2.4 (9/18/2019)
Bug Fixes
- Disable snap_refresh_nanos by default. The feature is to be deprecated in the next release.
6.2.3 (9/3/2019)
Bug Fixes
- Fix a bug in file ingestion caused by incorrect file number allocation when the number of column families involved in the ingestion exceeds 2.
RocksDB v6.2.2
Rocksdb Change Log
6.2.2 (6/7/2019)
Bug Fixes
- Disable dynamic extension support by default for CMake.
6.2.1 (6/4/2019)
Bug Fixes
- Fix flush's/compaction's merge processing logic which allowed
Put
s covered by range tombstones to reappear. NotePut
s may exist even if the user only ever calledMerge()
due to an internal conversion during compaction to the bottommost level.
6.2.0 (4/30/2019)
New Features
- Add an option
strict_bytes_per_sync
that causes a file-writing thread to block rather than exceed the limit on bytes pending writeback specified bybytes_per_sync
orwal_bytes_per_sync
. - Improve range scan performance by avoiding per-key upper bound check in BlockBasedTableIterator.
- Introduce Periodic Compaction for Level style compaction. Files are re-compacted periodically and put in the same level.
- Block-based table index now contains exact highest key in the file, rather than an upper bound. This may improve Get() and iterator Seek() performance in some situations, especially when direct IO is enabled and block cache is disabled. A setting BlockBasedTableOptions::index_shortening is introduced to control this behavior. Set it to kShortenSeparatorsAndSuccessor to get the old behavior.
- When reading from option file/string/map, customized envs can be filled according to object registry.
- Add an option
snap_refresh_nanos
(default to 0.5s) to periodically refresh the snapshot list in compaction jobs. Assign to 0 to disable the feature. - Improve range scan performance when using explicit user readahead by not creating new table readers for every iterator.
Public API Change
- Change the behavior of OptimizeForPointLookup(): move away from hash-based block-based-table index, and use whole key memtable filtering.
- Change the behavior of OptimizeForSmallDb(): use a 16MB block cache, put index and filter blocks into it, and cost the memtable size to it. DBOptions.OptimizeForSmallDb() and ColumnFamilyOptions.OptimizeForSmallDb() start to take an optional cache object.
- Added BottommostLevelCompaction::kForceOptimized to avoid double compacting newly compacted files in the bottommost level compaction of manual compaction. Note this option may prohibit the manual compaction to produce a single file in the bottommost level.
Bug Fixes
- Adjust WriteBufferManager's dummy entry size to block cache from 1MB to 256KB.
- Fix a race condition between WritePrepared::Get and ::Put with duplicate keys.
- Fix crash when memtable prefix bloom is enabled and read/write a key out of domain of prefix extractor.
- Close a WAL file before another thread deletes it.
- Fix an assertion failure
IsFlushPending() == true
caused by one bg thread releasing the db mutex in ~ColumnFamilyData and another thread clearingflush_requested_
flag.
RocksDB v6.1.2
Rocksdb Change Log
6.1.2 (6/4/2019)
Bug Fixes
- Fix flush's/compaction's merge processing logic which allowed
Put
s covered by range tombstones to reappear. NotePut
s may exist even if the user only ever calledMerge()
due to an internal conversion during compaction to the bottommost level.