Releases: facebook/rocksdb
Releases · facebook/rocksdb
RocksDB 6.29.3
Rocksdb Change Log
6.29.3 (2022-02-17)
Bug Fixes
- Fix a data loss bug for 2PC write-committed transaction caused by concurrent transaction commit and memtable switch (#9571).
6.29.2 (2022-02-15)
Performance Improvements
- DisableManualCompaction() doesn't have to wait scheduled manual compaction to be executed in thread-pool to cancel the job.
6.29.1 (2022-01-31)
Bug Fixes
- Fixed a major bug in which batched MultiGet could return old values for keys deleted by DeleteRange when memtable Bloom filter is enabled (memtable_prefix_bloom_size_ratio > 0). (The fix includes a substantial MultiGet performance improvement in the unusual case of both memtable_whole_key_filtering and prefix_extractor.)
6.29.0 (2022-01-21)
Note: The next release will be major release 7.0. See #9390 for more info.
Public API change
- Added values to
TraceFilterType
:kTraceFilterIteratorSeek
,kTraceFilterIteratorSeekForPrev
, andkTraceFilterMultiGet
. They can be set inTraceOptions
to filter out the operation types after which they are named. - Added
TraceOptions::preserve_write_order
. When enabled it guarantees write records are traced in the same order they are logged to WAL and applied to the DB. By default it is disabled (false) to match the legacy behavior and prevent regression. - Made the Env class extend the Customizable class. Implementations need to be registered with the ObjectRegistry and to implement a Name() method in order to be created via this method.
Options::OldDefaults
is marked deprecated, as it is no longer maintained.- Add ObjectLibrary::AddFactory and ObjectLibrary::PatternEntry classes. This method and associated class are the preferred mechanism for registering factories with the ObjectLibrary going forward. The ObjectLibrary::Register method, which uses regular expressions and may be problematic, is deprecated and will be in a future release.
- Changed
BlockBasedTableOptions::block_size
fromsize_t
touint64_t
. - Added API warning against using
Iterator::Refresh()
together withDB::DeleteRange()
, which are incompatible and have always risked causing the refreshed iterator to return incorrect results.
Behavior Changes
DB::DestroyColumnFamilyHandle()
will return Status::InvalidArgument() if called withDB::DefaultColumnFamily()
.- On 32-bit platforms, mmap reads are no longer quietly disabled, just discouraged.
New Features
- Added
Options::DisableExtraChecks()
that can be used to improve peak write performance by disabling checks that should not be necessary in the absence of software logic errors or CPU+memory hardware errors. (Default options are slowly moving toward some performance overheads for extra correctness checking.)
Performance Improvements
- Improved read performance when a prefix extractor is used (Seek, Get, MultiGet), even compared to version 6.25 baseline (see bug fix below), by optimizing the common case of prefix extractor compatible with table file and unchanging.
Bug Fixes
- Fix a bug that FlushMemTable may return ok even flush not succeed.
- Fixed a bug of Sync() and Fsync() not using
fcntl(F_FULLFSYNC)
on OS X and iOS. - Fixed a significant performance regression in version 6.26 when a prefix extractor is used on the read path (Seek, Get, MultiGet). (Excessive time was spent in SliceTransform::AsString().)
New Features
- Added RocksJava support for MacOS universal binary (ARM+x86)
RocksDB 6.28.2
6.28.2 (2022-01-31)
Bug Fixes
- Fixed a major bug in which batched MultiGet could return old values for keys deleted by DeleteRange when memtable Bloom filter is enabled (memtable_prefix_bloom_size_ratio > 0). (The fix includes a substantial MultiGet performance improvement in the unusual case of both memtable_whole_key_filtering and prefix_extractor.)
6.28.1 (2022-01-10)
Bug Fixes
- Fixed compilation errors on newer compiler, e.g. clang-12
6.28.0 (2021-12-17)
New Features
- Introduced 'CommitWithTimestamp' as a new tag. Currently, there is no API for user to trigger a write with this tag to the WAL. This is part of the efforts to support write-commited transactions with user-defined timestamps.
Bug Fixes
- Fixed a bug in rocksdb automatic implicit prefetching which got broken because of new feature adaptive_readahead and internal prefetching got disabled when iterator moves from one file to next.
- Fixed a bug in TableOptions.prepopulate_block_cache which causes segmentation fault when used with TableOptions.partition_filters = true and TableOptions.cache_index_and_filter_blocks = true.
- Fixed a bug affecting custom memtable factories which are not registered with the
ObjectRegistry
. The bug could result in failure to save the OPTIONS file. - Fixed a bug causing two duplicate entries to be appended to a file opened in non-direct mode and tracked by
FaultInjectionTestFS
. - Fixed a bug in TableOptions.prepopulate_block_cache to support block-based filters also.
- Block cache keys no longer use
FSRandomAccessFile::GetUniqueId()
(previously used when available), so a filesystem recycling unique ids can no longer lead to incorrect result or crash (#7405). For files generated by RocksDB >= 6.24, the cache keys are stable across DB::Open and DB directory move / copy / import / export / migration, etc. Although collisions are still theoretically possible, they are (a) impossible in many common cases, (b) not dependent on environmental factors, and (c) much less likely than a CPU miscalculation while executing RocksDB.
Behavior Changes
- MemTableList::TrimHistory now use allocated bytes when max_write_buffer_size_to_maintain > 0(default in TrasactionDB, introduced in PR#5022) Fix #8371.
Public API change
- Extend WriteBatch::AssignTimestamp and AssignTimestamps API so that both functions can accept an optional
checker
argument that performs additional checking on timestamp sizes. - Introduce a new EventListener callback that will be called upon the end of automatic error recovery.
Performance Improvements
- Replaced map property
TableProperties::properties_offsets
with uint64_t propertyexternal_sst_file_global_seqno_offset
to save table properties's memory. - Block cache accesses are faster by RocksDB using cache keys of fixed size (16 bytes).
Java API Changes
- Removed Java API
TableProperties.getPropertiesOffsets()
as it exposed internal details to external users.
RocksDB 6.27.3
6.27.3 (2021-12-10)
Bug Fixes
- Fixed a bug in TableOptions.prepopulate_block_cache which causes segmentation fault when used with TableOptions.partition_filters = true and TableOptions.cache_index_and_filter_blocks = true.
- Fixed a bug affecting custom memtable factories which are not registered with the
ObjectRegistry
. The bug could result in failure to save the OPTIONS file.
6.27.2 (2021-12-01)
Bug Fixes
- Fixed a bug in rocksdb automatic implicit prefetching which got broken because of new feature adaptive_readahead and internal prefetching got disabled when iterator moves from one file to next.
6.27.1 (2021-11-29)
Bug Fixes
- Fixed a bug that could, with WAL enabled, cause backups, checkpoints, and
GetSortedWalFiles()
to fail randomly with an error likeIO error: 001234.log: No such file or directory
6.27.0 (2021-11-19)
New Features
- Added new ChecksumType kXXH3 which is faster than kCRC32c on almost all x86_64 hardware.
- Added a new online consistency check for BlobDB which validates that the number/total size of garbage blobs does not exceed the number/total size of all blobs in any given blob file.
- Provided support for tracking per-sst user-defined timestamp information in MANIFEST.
- Added new option "adaptive_readahead" in ReadOptions. For iterators, RocksDB does auto-readahead on noticing sequential reads and by enabling this option, readahead_size of current file (if reads are sequential) will be carried forward to next file instead of starting from the scratch at each level (except L0 level files). If reads are not sequential it will fall back to 8KB. This option is applicable only for RocksDB internal prefetch buffer and isn't supported with underlying file system prefetching.
- Added the read count and read bytes related stats to Statistics for tiered storage hot, warm, and cold file reads.
- Added an option to dynamically charge an updating estimated memory usage of block-based table building to block cache if block cache available. It currently only includes charging memory usage of constructing (new) Bloom Filter and Ribbon Filter to block cache. To enable this feature, set
BlockBasedTableOptions::reserve_table_builder_memory = true
. - Add a new API OnIOError in listener.h that notifies listeners when an IO error occurs during FileSystem operation along with filename, status etc.
- Added compaction readahead support for blob files to the integrated BlobDB implementation, which can improve compaction performance when the database resides on higher-latency storage like HDDs or remote filesystems. Readahead can be configured using the column family option
blob_compaction_readahead_size
.
Bug Fixes
- Prevent a
CompactRange()
withCompactRangeOptions::change_level == true
from possibly causing corruption to the LSM state (overlapping files within a level) when run in parallel with another manual compaction. Note that settingforce_consistency_checks == true
(the default) would cause the DB to enter read-only mode in this scenario and returnStatus::Corruption
, rather than committing any corruption. - Fixed a bug in CompactionIterator when write-prepared transaction is used. A released earliest write conflict snapshot may cause assertion failure in dbg mode and unexpected key in opt mode.
- Fix ticker WRITE_WITH_WAL("rocksdb.write.wal"), this bug is caused by a bad extra
RecordTick(stats_, WRITE_WITH_WAL)
(at 2 place), this fix remove the extraRecordTick
s and fix the corresponding test case. - EventListener::OnTableFileCreated was previously called with OK status and file_size==0 in cases of no SST file contents written (because there was no content to add) and the empty file deleted before calling the listener. Now the status is Aborted.
- Fixed a bug in CompactionIterator when write-preared transaction is used. Releasing earliest_snapshot during compaction may cause a SingleDelete to be output after a PUT of the same user key whose seq has been zeroed.
- Added input sanitization on negative bytes passed into
GenericRateLimiter::Request
. - Fixed an assertion failure in CompactionIterator when write-prepared transaction is used. We prove that certain operations can lead to a Delete being followed by a SingleDelete (same user key). We can drop the SingleDelete.
- Fixed a bug of timestamp-based GC which can cause all versions of a key under full_history_ts_low to be dropped. This bug will be triggered when some of the ikeys' timestamps are lower than full_history_ts_low, while others are newer.
- In some cases outside of the DB read and compaction paths, SST block checksums are now checked where they were not before.
- Explicitly check for and disallow the
BlockBasedTableOptions
if insertion into one of {block_cache
,block_cache_compressed
,persistent_cache
} can show up in another of these. (RocksDB expects to be able to use the same key for different physical data among tiers.) - Users who configured a dedicated thread pool for bottommost compactions by explicitly adding threads to the
Env::Priority::BOTTOM
pool will no longer see RocksDB schedule automatic compactions exceeding the DB's compaction concurrency limit. For details on per-DB compaction concurrency limit, see API docs ofmax_background_compactions
andmax_background_jobs
. - Fixed a bug of background flush thread picking more memtables to flush and prematurely advancing column family's log_number.
- Fixed an assertion failure in ManifestTailer.
Behavior Changes
NUM_FILES_IN_SINGLE_COMPACTION
was only counting the first input level files, now it's including all input files.TransactionUtil::CheckKeyForConflicts
can also perform conflict-checking based on user-defined timestamps in addition to sequence numbers.- Removed
GenericRateLimiter
's minimum refill bytes per period previously enforced.
Public API change
- When options.ttl is used with leveled compaction with compactinon priority kMinOverlappingRatio, files exceeding half of TTL value will be prioritized more, so that by the time TTL is reached, fewer extra compactions will be scheduled to clear them up. At the same time, when compacting files with data older than half of TTL, output files may be cut off based on those files' boundaries, in order for the early TTL compaction to work properly.
- Made FileSystem extend the Customizable class and added a CreateFromString method. Implementations need to be registered with the ObjectRegistry and to implement a Name() method in order to be created via this method.
- Clarified in API comments that RocksDB is not exception safe for callbacks and custom extensions. An exception propagating into RocksDB can lead to undefined behavior, including data loss, unreported corruption, deadlocks, and more.
- Marked
WriteBufferManager
asfinal
because it is not intended for extension. - Removed unimportant implementation details from table_properties.h
- Add API
FSDirectory::FsyncWithDirOptions()
, which provides extra information like directory fsync reason inDirFsyncOptions
. File system like btrfs is using that to skip directory fsync for creating a new file, or when renaming a file, fsync the target file instead of the directory, which improves theDB::Open()
speed by ~20%. DB::Open()
is not going be blocked by obsolete file purge ifDBOptions::avoid_unnecessary_blocking_io
is set to true.- In builds where glibc provides
gettid()
, info log ("LOG" file) lines now print a system-wide thread ID fromgettid()
instead of the process-localpthread_self()
. For all users, the thread ID format is changed from hexadecimal to decimal integer. - In builds where glibc provides
pthread_setname_np()
, the background thread names no longer contain an ID suffix. For example, "rocksdb:bottom7" (and all other threads in theEnv::Priority::BOTTOM
pool) are now named "rocksdb:bottom". Previously large thread pools could breach the name size limit (e.g., naming "rocksdb:bottom10" would fail). - Deprecating
ReadOptions::iter_start_seqnum
andDBOptions::preserve_deletes
, please try using user defined timestamp feature instead. The options will be removed in a future release, currently it logs a warning message when using.
Performance Improvements
- Released some memory related to filter construction earlier in
BlockBasedTableBuilder
forFullFilter
andPartitionedFilter
case (#9070)
RocksDB 6.26.1
6.26.1 (2021-11-18)
Bug Fixes
- Fix builds for some platforms.
6.26.0 (2021-10-20)
Bug Fixes
- Fixes a bug in directed IO mode when calling MultiGet() for blobs in the same blob file. The bug is caused by not sorting the blob read requests by file offsets.
- Fix the incorrect disabling of SST rate limited deletion when the WAL and DB are in different directories. Only WAL rate limited deletion should be disabled if its in a different directory.
- Fix
DisableManualCompaction()
to cancel compactions even when they are waiting on automatic compactions to drain due toCompactRangeOptions::exclusive_manual_compactions == true
. - Fix contract of
Env::ReopenWritableFile()
andFileSystem::ReopenWritableFile()
to specify any existing file must not be deleted or truncated. - Fixed bug in calls to
IngestExternalFiles()
with files for multiple column families. The bug could have introduced a delay in ingested file keys becoming visible afterIngestExternalFiles()
returned. Furthermore, mutations to ingested file keys while they were invisible could have been dropped (not necessarily immediately). - Fixed a possible race condition impacting users of
WriteBufferManager
who constructed it withallow_stall == true
. The race condition led to undefined behavior (in our experience, typically a process crash). - Fixed a bug where stalled writes would remain stalled forever after the user calls
WriteBufferManager::SetBufferSize()
withnew_size == 0
to dynamically disable memory limiting. - Make
DB::close()
thread-safe. - Fix a bug in atomic flush where one bg flush thread will wait forever for a preceding bg flush thread to commit its result to MANIFEST but encounters an error which is mapped to a soft error (DB not stopped).
New Features
- Print information about blob files when using "ldb list_live_files_metadata"
- Provided support for SingleDelete with user defined timestamp.
- Experimental new function DB::GetLiveFilesStorageInfo offers essentially a unified version of other functions like GetLiveFiles, GetLiveFilesChecksumInfo, and GetSortedWalFiles. Checkpoints and backups could show small behavioral changes and/or improved performance as they now use this new API.
- Add remote compaction read/write bytes statistics:
REMOTE_COMPACT_READ_BYTES
,REMOTE_COMPACT_WRITE_BYTES
. - Introduce an experimental feature to dump out the blocks from block cache and insert them to the secondary cache to reduce the cache warmup time (e.g., used while migrating DB instance). More information are in
class CacheDumper
andCacheDumpedLoader
atrocksdb/utilities/cache_dump_load.h
Note that, this feature is subject to the potential change in the future, it is still experimental. - Introduced a new BlobDB configuration option
blob_garbage_collection_force_threshold
, which can be used to trigger compactions targeting the SST files which reference the oldest blob files when the ratio of garbage in those blob files meets or exceeds the specified threshold. This can reduce space amplification with skewed workloads where the affected SST files might not otherwise get picked up for compaction. - Added EXPERIMENTAL support for table file (SST) unique identifiers that are stable and universally unique, available with new function
GetUniqueIdFromTableProperties
. Only SST files from RocksDB >= 6.24 support unique IDs. - Added
GetMapProperty()
support for "rocksdb.dbstats" (DB::Properties::kDBStats
). As a map property, it includes DB-level internal stats accumulated over the DB's lifetime, such as user write related stats and uptime.
Public API change
- Made SystemClock extend the Customizable class and added a CreateFromString method. Implementations need to be registered with the ObjectRegistry and to implement a Name() method in order to be created via this method.
- Made SliceTransform extend the Customizable class and added a CreateFromString method. Implementations need to be registered with the ObjectRegistry and to implement a Name() method in order to be created via this method. The Capped and Prefixed transform classes return a short name (no length); use GetId for the fully qualified name.
- Made FileChecksumGenFactory, SstPartitionerFactory, TablePropertiesCollectorFactory, and WalFilter extend the Customizable class and added a CreateFromString method.
- Some fields of SstFileMetaData are deprecated for compatibility with new base class FileStorageInfo.
- Add
file_temperature
toIngestExternalFileArg
such that when ingesting SST files, we are able to indicate the temperature of the this batch of files. - If
DB::Close()
failed with a non aborted status, callingDB::Close()
again will return the original status instead of Status::OK. - Add CacheTier to advanced_options.h to describe the cache tier we used. Add a
lowest_used_cache_tier
option toDBOptions
(immutable) and pass it to BlockBasedTableReader. By default it isCacheTier::kNonVolatileBlockTier
, which means, we always use both block cache (kVolatileTier) and secondary cache (kNonVolatileBlockTier). By set it toCacheTier::kVolatileTier
, the DB will not use the secondary cache. - Even when options.max_compaction_bytes is hit, compaction output files are only cut when it aligns with grandparent files' boundaries. options.max_compaction_bytes could be slightly violated with the change, but the violation is no more than one target SST file size, which is usually much smaller.
Performance Improvements
Java API Changes
- Add Java API bindings for new integrated BlobDB options
keyMayExist()
supports ByteBuffer.- Fix multiget throwing Null Pointer Exception for num of keys > 70k (#8039).
RocksDB 6.26.0
6.26.0 (2021-10-20)
Bug Fixes
- Fixes a bug in directed IO mode when calling MultiGet() for blobs in the same blob file. The bug is caused by not sorting the blob read requests by file offsets.
- Fix the incorrect disabling of SST rate limited deletion when the WAL and DB are in different directories. Only WAL rate limited deletion should be disabled if its in a different directory.
- Fix
DisableManualCompaction()
to cancel compactions even when they are waiting on automatic compactions to drain due toCompactRangeOptions::exclusive_manual_compactions == true
. - Fix contract of
Env::ReopenWritableFile()
andFileSystem::ReopenWritableFile()
to specify any existing file must not be deleted or truncated. - Fixed bug in calls to
IngestExternalFiles()
with files for multiple column families. The bug could have introduced a delay in ingested file keys becoming visible afterIngestExternalFiles()
returned. Furthermore, mutations to ingested file keys while they were invisible could have been dropped (not necessarily immediately). - Fixed a possible race condition impacting users of
WriteBufferManager
who constructed it withallow_stall == true
. The race condition led to undefined behavior (in our experience, typically a process crash). - Fixed a bug where stalled writes would remain stalled forever after the user calls
WriteBufferManager::SetBufferSize()
withnew_size == 0
to dynamically disable memory limiting. - Make
DB::close()
thread-safe. - Fix a bug in atomic flush where one bg flush thread will wait forever for a preceding bg flush thread to commit its result to MANIFEST but encounters an error which is mapped to a soft error (DB not stopped).
New Features
- Print information about blob files when using "ldb list_live_files_metadata"
- Provided support for SingleDelete with user defined timestamp.
- Experimental new function DB::GetLiveFilesStorageInfo offers essentially a unified version of other functions like GetLiveFiles, GetLiveFilesChecksumInfo, and GetSortedWalFiles. Checkpoints and backups could show small behavioral changes and/or improved performance as they now use this new API.
- Add remote compaction read/write bytes statistics:
REMOTE_COMPACT_READ_BYTES
,REMOTE_COMPACT_WRITE_BYTES
. - Introduce an experimental feature to dump out the blocks from block cache and insert them to the secondary cache to reduce the cache warmup time (e.g., used while migrating DB instance). More information are in
class CacheDumper
andCacheDumpedLoader
atrocksdb/utilities/cache_dump_load.h
Note that, this feature is subject to the potential change in the future, it is still experimental. - Introduced a new BlobDB configuration option
blob_garbage_collection_force_threshold
, which can be used to trigger compactions targeting the SST files which reference the oldest blob files when the ratio of garbage in those blob files meets or exceeds the specified threshold. This can reduce space amplification with skewed workloads where the affected SST files might not otherwise get picked up for compaction. - Added EXPERIMENTAL support for table file (SST) unique identifiers that are stable and universally unique, available with new function
GetUniqueIdFromTableProperties
. Only SST files from RocksDB >= 6.24 support unique IDs. - Added
GetMapProperty()
support for "rocksdb.dbstats" (DB::Properties::kDBStats
). As a map property, it includes DB-level internal stats accumulated over the DB's lifetime, such as user write related stats and uptime.
Public API change
- Made SystemClock extend the Customizable class and added a CreateFromString method. Implementations need to be registered with the ObjectRegistry and to implement a Name() method in order to be created via this method.
- Made SliceTransform extend the Customizable class and added a CreateFromString method. Implementations need to be registered with the ObjectRegistry and to implement a Name() method in order to be created via this method. The Capped and Prefixed transform classes return a short name (no length); use GetId for the fully qualified name.
- Made FileChecksumGenFactory, SstPartitionerFactory, TablePropertiesCollectorFactory, and WalFilter extend the Customizable class and added a CreateFromString method.
- Some fields of SstFileMetaData are deprecated for compatibility with new base class FileStorageInfo.
- Add
file_temperature
toIngestExternalFileArg
such that when ingesting SST files, we are able to indicate the temperature of the this batch of files. - If
DB::Close()
failed with a non aborted status, callingDB::Close()
again will return the original status instead of Status::OK. - Add CacheTier to advanced_options.h to describe the cache tier we used. Add a
lowest_used_cache_tier
option toDBOptions
(immutable) and pass it to BlockBasedTableReader. By default it isCacheTier::kNonVolatileBlockTier
, which means, we always use both block cache (kVolatileTier) and secondary cache (kNonVolatileBlockTier). By set it toCacheTier::kVolatileTier
, the DB will not use the secondary cache. - Even when options.max_compaction_bytes is hit, compaction output files are only cut when it aligns with grandparent files' boundaries. options.max_compaction_bytes could be slightly violated with the change, but the violation is no more than one target SST file size, which is usually much smaller.
Performance Improvements
Java API Changes
- Add Java API bindings for new integrated BlobDB options
keyMayExist()
supports ByteBuffer.- Fix multiget throwing Null Pointer Exception for num of keys > 70k (#8039).
RocksDB 6.25.3
6.25.3 (2021-10-14)
Bug Fixes
- Fixed bug in calls to
IngestExternalFiles()
with files for multiple column families. The bug could have introduced a delay in ingested file keys becoming visible afterIngestExternalFiles()
returned. Furthermore, mutations to ingested file keys while they were invisible could have been dropped (not necessarily immediately). - Fixed a possible race condition impacting users of
WriteBufferManager
who constructed it withallow_stall == true
. The race condition led to undefined behavior (in our experience, typically a process crash). - Fixed a bug where stalled writes would remain stalled forever after the user calls
WriteBufferManager::SetBufferSize()
withnew_size == 0
to dynamically disable memory limiting.
6.25.2 (2021-10-11)
Bug Fixes
- Fix
DisableManualCompaction()
to cancel compactions even when they are waiting on automatic compactions to drain due toCompactRangeOptions::exclusive_manual_compactions == true
. - Fix contract of
Env::ReopenWritableFile()
andFileSystem::ReopenWritableFile()
to specify any existing file must not be deleted or truncated.
RocksDB 6.25.1
6.25.1 (2021-09-28)
Bug Fixes
- Fixes a bug in directed IO mode when calling MultiGet() for blobs in the same blob file. The bug is caused by not sorting the blob read requests by file offsets.
6.25.0 (2021-09-20)
Bug Fixes
- Allow secondary instance to refresh iterator. Assign read seq after referencing SuperVersion.
- Fixed a bug of secondary instance's last_sequence going backward, and reads on the secondary fail to see recent updates from the primary.
- Fixed a bug that could lead to duplicate DB ID or DB session ID in POSIX environments without /proc/sys/kernel/random/uuid.
- Fix a race in DumpStats() with column family destruction due to not taking a Ref on each entry while iterating the ColumnFamilySet.
- Fix a race in item ref counting in LRUCache when promoting an item from the SecondaryCache.
- Fix a race in BackupEngine if RateLimiter is reconfigured during concurrent Restore operations.
- Fix a bug on POSIX in which failure to create a lock file (e.g. out of space) can prevent future LockFile attempts in the same process on the same file from succeeding.
- Fix a bug that backup_rate_limiter and restore_rate_limiter in BackupEngine could not limit read rates.
- Fix the implementation of
prepopulate_block_cache = kFlushOnly
to only apply to flushes rather than to all generated files. - Fix WAL log data corruption when using DBOptions.manual_wal_flush(true) and WriteOptions.sync(true) together. The sync WAL should work with locked log_write_mutex_.
- Add checks for validity of the IO uring completion queue entries, and fail the BlockBasedTableReader MultiGet sub-batch if there's an invalid completion
- Add an interface RocksDbIOUringEnable() that, if defined by the user, will allow them to enable/disable the use of IO uring by RocksDB
- Fix the bug that when direct I/O is used and MultiRead() returns a short result, RandomAccessFileReader::MultiRead() still returns full size buffer, with returned short value together with some data in original buffer. This bug is unlikely cause incorrect results, because (1) since FileSystem layer is expected to retry on short result, returning short results is only possible when asking more bytes in the end of the file, which RocksDB doesn't do when using MultiRead(); (2) checksum is unlikely to match.
New Features
- RemoteCompaction's interface now includes
db_name
,db_id
,session_id
, which could help the user uniquely identify compaction job between db instances and sessions. - Added a ticker statistic, "rocksdb.verify_checksum.read.bytes", reporting how many bytes were read from file to serve
VerifyChecksum()
andVerifyFileChecksums()
queries. - Added ticker statistics, "rocksdb.backup.read.bytes" and "rocksdb.backup.write.bytes", reporting how many bytes were read and written during backup.
- Added properties for BlobDB:
rocksdb.num-blob-files
,rocksdb.blob-stats
,rocksdb.total-blob-file-size
, androcksdb.live-blob-file-size
. The existing propertyrocksdb.estimate_live-data-size
was also extended to include live bytes residing in blob files. - Added two new RateLimiter IOPriorities:
Env::IO_USER
,Env::IO_MID
.Env::IO_USER
will have superior priority over all other RateLimiter IOPriorities without being subject to fair scheduling constraint. SstFileWriter
now supportsPut
s andDelete
s with user-defined timestamps. Note that the ingestion logic itself is not timestamp-aware yet.- Allow a single write batch to include keys from multiple column families whose timestamps' formats can differ. For example, some column families may disable timestamp, while others enable timestamp.
- Add compaction priority information in RemoteCompaction, which can be used to schedule high priority job first.
- Added new callback APIs
OnBlobFileCreationStarted
,OnBlobFileCreated
andOnBlobFileDeleted
inEventListener
class of listener.h. It notifies listeners during creation/deletion of individual blob files in Integrated BlobDB. It also log blob file creation finished event and deletion event in LOG file. - Batch blob read requests for
DB::MultiGet
usingMultiRead
. - Add support for fallback to local compaction, the user can return
CompactionServiceJobStatus::kUseLocal
to instruct RocksDB to run the compaction locally instead of waiting for the remote compaction result. - Add built-in rate limiter's implementation of
RateLimiter::GetTotalPendingRequest(int64_t* total_pending_requests, const Env::IOPriority pri)
for the total number of requests that are pending for bytes in the rate limiter. - Charge memory usage during data buffering, from which training samples are gathered for dictionary compression, to block cache. Unbuffering data can now be triggered if the block cache becomes full and
strict_capacity_limit=true
for the block cache, in addition to existing conditions that can trigger unbuffering.
Public API change
- Remove obsolete implementation details FullKey and ParseFullKey from public API
- Change
SstFileMetaData::size
fromsize_t
touint64_t
. - Made Statistics extend the Customizable class and added a CreateFromString method. Implementations of Statistics need to be registered with the ObjectRegistry and to implement a Name() method in order to be created via this method.
- Extended
FlushJobInfo
andCompactionJobInfo
in listener.h to provide information about the blob files generated by a flush/compaction and garbage collected during compaction in Integrated BlobDB. Added struct membersblob_file_addition_infos
andblob_file_garbage_infos
that contain this information. - Extended parameter
output_file_names
ofCompactFiles
API to also include paths of the blob files generated by the compaction in Integrated BlobDB. - Most
BackupEngine
functions now returnIOStatus
instead ofStatus
. Most existing code should be compatible with this change but some calls might need to be updated.
RocksDB 6.24.2
6.24.2 (2021-09-16)
Bug Fixes
- Add checks for validity of the IO uring completion queue entries, and fail the BlockBasedTableReader MultiGet sub-batch if there's an invalid completion
6.24.1 (2021-08-31)
Bug Fixes
- Fix a race in item ref counting in LRUCache when promoting an item from the SecondaryCache.
6.24.0 (2021-08-20)
Bug Fixes
- If the primary's CURRENT file is missing or inaccessible, the secondary instance should not hang repeatedly trying to switch to a new MANIFEST. It should instead return the error code encountered while accessing the file.
- Restoring backups with BackupEngine is now a logically atomic operation, so that if a restore operation is interrupted, DB::Open on it will fail. Using BackupEngineOptions::sync (default) ensures atomicity even in case of power loss or OS crash.
- Fixed a race related to the destruction of
ColumnFamilyData
objects. The earlier logic unlocked the DB mutex before destroying the thread-localSuperVersion
pointers, which could result in a process crash if another thread managed to get a reference to theColumnFamilyData
object. - Removed a call to
RenameFile()
on a non-existent info log file ("LOG") when opening a new DB. Such a call was guaranteed to fail though did not impact applications since we swallowed the error. Now we also stopped swallowing errors in renaming "LOG" file. - Fixed an issue where
OnFlushCompleted
was not called for atomic flush. - Fixed a bug affecting the batched
MultiGet
API when used with keys spanning multiple column families andsorted_input == false
. - Fixed a potential incorrect result in opt mode and assertion failures caused by releasing snapshot(s) during compaction.
- Fixed passing of BlobFileCompletionCallback to Compaction job and Atomic flush job which was default paramter (nullptr). BlobFileCompletitionCallback is internal callback that manages addition of blob files to SSTFileManager.
- Fixed MultiGet not updating the block_read_count and block_read_byte PerfContext counters
New Features
- Made the EventListener extend the Customizable class.
- EventListeners that have a non-empty Name() and that are registered with the ObjectRegistry can now be serialized to/from the OPTIONS file.
- Insert warm blocks (data blocks, uncompressed dict blocks, index and filter blocks) in Block cache during flush under option BlockBasedTableOptions.prepopulate_block_cache. Previously it was enabled for only data blocks.
- BlockBasedTableOptions.prepopulate_block_cache can be dynamically configured using DB::SetOptions.
- Add CompactionOptionsFIFO.age_for_warm, which allows RocksDB to move old files to warm tier in FIFO compactions. Note that file temperature is still an experimental feature.
- Add a comment to suggest btrfs user to disable file preallocation by setting
options.allow_fallocate=false
. - Fast forward option in Trace replay changed to double type to allow replaying at a lower speed, by settings the value between 0 and 1. This option can be set via
ReplayOptions
inReplayer::Replay()
, or via--trace_replay_fast_forward
in db_bench. - Add property
LiveSstFilesSizeAtTemperature
to retrieve sst file size at different temperature. - Added a stat rocksdb.secondary.cache.hits
- Added a PerfContext counter secondary_cache_hit_count
- The integrated BlobDB implementation now supports the tickers
BLOB_DB_BLOB_FILE_BYTES_READ
,BLOB_DB_GC_NUM_KEYS_RELOCATED
, andBLOB_DB_GC_BYTES_RELOCATED
, as well as the histogramsBLOB_DB_COMPRESSION_MICROS
andBLOB_DB_DECOMPRESSION_MICROS
. - Added hybrid configuration of Ribbon filter and Bloom filter where some LSM levels use Ribbon for memory space efficiency and some use Bloom for speed. See NewRibbonFilterPolicy. This also changes the default behavior of NewRibbonFilterPolicy to use Bloom for flushes under Leveled and Universal compaction and Ribbon otherwise. The C API function
rocksdb_filterpolicy_create_ribbon
is unchanged but adds newrocksdb_filterpolicy_create_ribbon_hybrid
.
Public API change
- Added APIs to decode and replay trace file via Replayer class. Added
DB::NewDefaultReplayer()
to create a default Replayer instance. AddedTraceReader::Reset()
to restart reading a trace file. Created trace_record.h, trace_record_result.h and utilities/replayer.h files to access the decoded Trace records, replay them, and query the actual operation results. - Added Configurable::GetOptionsMap to the public API for use in creating new Customizable classes.
- Generalized bits_per_key parameters in C API from int to double for greater configurability. Although this is a compatible change for existing C source code, anything depending on C API signatures, such as foreign function interfaces, will need to be updated.
Performance Improvements
- Try to avoid updating DBOptions if
SetDBOptions()
does not change any option value.
Behavior Changes
StringAppendOperator
additionally accepts a string as the delimiter.- BackupEngineOptions::sync (default true) now applies to restoring backups in addition to creating backups. This could slow down restores, but ensures they are fully persisted before returning OK. (Consider increasing max_background_operations to improve performance.)
RocksDB 6.23.3
6.23.3 (2021-08-09)
Bug Fixes
- Removed a call to
RenameFile()
on a non-existent info log file ("LOG") when opening a new DB. Such a call was guaranteed to fail though did not impact applications since we swallowed the error. Now we also stopped swallowing errors in renaming "LOG" file. - Fixed a bug affecting the batched
MultiGet
API when used with keys spanning multiple column families andsorted_input == false
.
6.23.2 (2021-08-04)
Bug Fixes
- Fixed a race related to the destruction of
ColumnFamilyData
objects. The earlier logic unlocked the DB mutex before destroying the thread-localSuperVersion
pointers, which could result in a process crash if another thread managed to get a reference to theColumnFamilyData
object. - Fixed an issue where
OnFlushCompleted
was not called for atomic flush.
6.23.1 (2021-07-22)
Bug Fixes
- Fix a race condition during multiple DB instances opening.
6.23.0 (2021-07-16)
Behavior Changes
- Obsolete keys in the bottommost level that were preserved for a snapshot will now be cleaned upon snapshot release in all cases. This form of compaction (snapshot release triggered compaction) previously had an artificial limitation that multiple tombstones needed to be present.
Bug Fixes
- Blob file checksums are now printed in hexadecimal format when using the
manifest_dump
ldb
command. GetLiveFilesMetaData()
now populates thetemperature
,oldest_ancester_time
, andfile_creation_time
fields of itsLiveFileMetaData
results when the information is available. Previously these fields always contained zero indicating unknown.- Fix mismatches of OnCompaction{Begin,Completed} in case of DisableManualCompaction().
- Fix continuous logging of an existing background error on every user write
- Fix a bug that
Get()
return Status::OK() and an empty value for non-existent key whenread_options.read_tier = kBlockCacheTier
. - Fix a bug that stat in
get_context
didn't accumulate to statistics when query is failed.
New Features
- ldb has a new feature,
list_live_files_metadata
, that shows the live SST files, as well as their LSM storage level and the column family they belong to. - The new BlobDB implementation now tracks the amount of garbage in each blob file in the MANIFEST.
- Integrated BlobDB now supports Merge with base values (Put/Delete etc.).
- RemoteCompaction supports sub-compaction, the job_id in the user interface is changed from
int
touint64_t
to support sub-compaction id. - Expose statistics option in RemoteCompaction worker.
Public API change
- Added APIs to the Customizable class to allow developers to create their own Customizable classes. Created the utilities/customizable_util.h file to contain helper methods for developing new Customizable classes.
- Change signature of SecondaryCache::Name(). Make SecondaryCache customizable and add SecondaryCache::CreateFromString method.
RocksDB 6.22.1
6.22.1 (2021-06-25)
Bug Fixes
GetLiveFilesMetaData()
now populates thetemperature
,oldest_ancester_time
, andfile_creation_time
fields of itsLiveFileMetaData
results when the information is available. Previously these fields always contained zero indicating unknown.
6.22.0 (2021-06-18)
Behavior Changes
- Added two additional tickers, MEMTABLE_PAYLOAD_BYTES_AT_FLUSH and MEMTABLE_GARBAGE_BYTES_AT_FLUSH. These stats can be used to estimate the ratio of "garbage" (outdated) bytes in the memtable that are discarded at flush time.
- Added API comments clarifying safe usage of Disable/EnableManualCompaction and EventListener callbacks for compaction.
Bug Fixes
- fs_posix.cc GetFreeSpace() always report disk space available to root even when running as non-root. Linux defaults often have disk mounts with 5 to 10 percent of total space reserved only for root. Out of space could result for non-root users.
- Subcompactions are now disabled when user-defined timestamps are used, since the subcompaction boundary picking logic is currently not timestamp-aware, which could lead to incorrect results when different subcompactions process keys that only differ by timestamp.
- Fix an issue that
DeleteFilesInRange()
may cause ongoing compaction reports corruption exception, or ASSERT for debug build. There's no actual data loss or corruption that we find. - Fixed confusingly duplicated output in LOG for periodic stats ("DUMPING STATS"), including "Compaction Stats" and "File Read Latency Histogram By Level".
- Fixed performance bugs in background gathering of block cache entry statistics, that could consume a lot of CPU when there are many column families with a shared block cache.
New Features
- Marked the Ribbon filter and optimize_filters_for_memory features as production-ready, each enabling memory savings for Bloom-like filters. Use
NewRibbonFilterPolicy
in place ofNewBloomFilterPolicy
to use Ribbon filters instead of Bloom, orribbonfilter
in place ofbloomfilter
in configuration string. - Allow
DBWithTTL
to useDeleteRange
api just like other DBs.DeleteRangeCF()
which executesWriteBatchInternal::DeleteRange()
has been added to the handler inDBWithTTLImpl::Write()
to implement it. - Add BlockBasedTableOptions.prepopulate_block_cache. If enabled, it prepopulate warm/hot data blocks which are already in memory into block cache at the time of flush. On a flush, the data block that is in memory (in memtables) get flushed to the device. If using Direct IO, additional IO is incurred to read this data back into memory again, which is avoided by enabling this option and it also helps with Distributed FileSystem. More details in include/rocksdb/table.h.
- Added a
cancel
field toCompactRangeOptions
, allowing individual in-process manual range compactions to be cancelled.