-
Notifications
You must be signed in to change notification settings - Fork 94
Object Chunking and Garbage Collection
The large file material here should be valid for Riak CS 1.0, and the GC implementation described is currently under development, to be released in Riak CS 1.1.
The simple goal is the support objects up to 5TB in size, with the same API and behavioral characteristics as S3. S3 offers eventually consistent access to objects, with no concurrency control other than last-write-wins. This does mean that there might be one "version" of a file being read on one machine, while another is accepting a write for the same file, and yet another is deleting the file (but who's view for the delete is yet another version). All of this has to happen while also garbage collecting overwritten and failed-uploaded files. To accomplish this, we make heavy use of sibling resolution and Riak's eventual consistency properties.
Riak CS splits files up into small blocks, and stores each block as a separate Riak object. Files less than the size of one block will simply be stored as a single block. There is a separate manifest object that is used to record metadata about the file, such as the key, MD5, and how many blocks have been written so far. Manifests and individual blocks are stored in different buckets. A user bucket called "foo", will, for example, correspond to two buckets in Riak, one for the manifests, and one for the blocks.
There is a manifest for each PUT operation
on a particular key, each with a different
UUID. Because of things like garbage
collection and concurrent PUTs, the object stored at the
manifest {Bucket, Key} is a collection of manifests (an orddict),
not just a single manifest. Furthermore, {allow_mult, true}
is
set on this bucket, so there may be several siblings, each which
are a collection of manifests. Depending on the operation, a piece
of code may examine all of the manifests, or only a particular one.
The erlang record definition for manifests is well commented
here.
Individual blocks are immutable, as they include the UUID as part of their key. They can however, be deleted.
Manifests can be in one of four different states:
-
writing
: A manifest is in this state when it is first created, and as the blocks are being written to Riak. -
active
: A manifest is in this state once all of the blocks have been written to Riak. This is the only state that a manifest can be in and serve GET requests. -
pending_delete
: When a user deletes a file or a file is overwritten, the manifest is first put in thepending_delete
state, before the move to the GC bucket is successful. Once the manifest has been successfully moved, it will go into thescheduled_delete
state. -
scheduled_delete
: A manifest is in this state when its information has been written to theriak-cs-gc
bucket thus scheduling the file blocks for deletion. The manifest stays around for as long asleeway_seconds
, and is pruned lazily.
When two manifests with the same UUID are in conflict (because they came from manifest-collection siblings), they are resolved with this code.
-
GET
:- Retrieve manifests
- Resolve siblings
- Select all
active
manifests, and choose the one with the most recentwrite_start_time
(a property of the manifest). If there are noactive
manifests, return404
.
Note, we specifically don't write the resolved manifests back to Riak, as this can create a situation where to actors both resolve siblings, therefore creating siblings again. This follows the general rule-of-thumb, don't make a write to Riak just to resolve siblings.
-
PUT
:- Create a new manifest (with fresh UUID)
in the
writing
state. Once all of the blocks have been written, change the state toactive
. - Follow the same steps as in
DELETE
, to delete any manifests that this overwrites with the following exception: Manifests found in thewriting
state are not marked aspending_delete
unless theirlast_block_written_time
isleeway_seconds
in the past.
Note, each time the manifest is written back to Riak, we resolve siblings, and write back with the correct vector clock.
- Create a new manifest (with fresh UUID)
in the
-
PUT Copy
: actually a combination ofGET
andPUT
- Spawn a PUT fsm and a GET fsm
- Deliver block data fetched from GET fsm, to PUT fsm
- On every block check the socket by reading 0 bytes to detect disconnected clients and to avoid unnecessary copy
-
DELETE
:- Retrieve manifests
- Resolve siblings
- Select all
active
andwriting
manifests, and mark them aspending_delete
. - "Move" these manifests to the GC bucket, or delete blocks on
the fly if the object size is below
active_delete_threshold
. See Active-Deletion-of-Small-Objects for details. - Mark them as
scheduled_delete
Note: at any of these paths, before retrieving manifests there is a phase of
- Authentication. Fetching user in
moss.users
whose key defined inAuthorization
header in HTTP request. - Owner identification. Fetching bucket record in
moss.buckets
from bucket name, which has owner's access key in its contents. Fetching owner's record atmoss.users
if it is not same as the accessing user. - Authorization. Extracting ACLs and Policies defined in bucket record metadata and evaluating all of them with accessor, owner and other information.
After all of these passed, get/put/delete code may run and otherwise the whole request fails.
There are three types of processes involved in get/put/delete. They're responsibilities are detailed below.
There is a coordinator FSM for each of get/put/delete. This FSM is responsible for keeping track of which blocks remain to be retrieved/written/deleted by a group of gen_servers, and for communicating back to the original caller. It's goals are similar to the Riak get/put/delete FSMs. This FSM only concerns itself with the "active manifest", and should not have knowledge of siblings or manifests with other UUIDs. The coordinator FSM is also responsible for launching the other two types of processes.
The manifest FSM is responsible for get/put/delete'ing manifests from Riak (or coordinating a gen_server to do this), and for keeping track and acting on manifest siblings and "other" UUIDs. It does this so that the coordinator FSM doesn't have to concern itself with these things. This FSM will do things like spawn GC when it comes across manifests that haven't been deleted or whose deletion processes appear to have died, and for making sure that things written back to Riak always include the correct vector clock
The blocks gen_server is a simple wrapper around riakc_pb_socket
that has knowledge of how to reply to the coordinator FSM and how to
turn block identifiers into Riak BKeys. This may be multiple processes
to fetch blocks concurrenlty (configured by
(fetch|put|delete)_concurrency
). The fallback logic and error
handling are compilicated, regarding quorum, replication fallback and
timeouts; this subsection try to demonstrate it.
GET blocks: As blocks are all write-once (as well as delete-once)
and immutable, there is few need to perform n=all
on get
request. In most normal cases without failure the factor of N is
just unnecessary load to network throughput in a large object
storage use case. Thus CS fetches blocks with n=one
get option,
with sloppy quorum disabled. It first tries n=one
local get, if it
fails in notfound
, then it second tries to fetch with
n=all
. Even it does not work then it tries to perform proxy_get
if it is enterprise edition. There are several retry logic, but this
is repeated for all blocks in an object. Sibling resolutions are
nasty, see riak_cs_block_server:resolve_block_object/2
. After
sibling resolution the fetched block is delivered to GET fsm.
PUT blocks: Blocks are put to Riak with n=all, pr=1
to keep it
visible from reads.
DELETE blocks: This might be performed by GC-spawned delete fsm,
or put fsm where active_delete_threshold
is enabled. To prevent
block leak (block leak is an inconsistent state where no reference
from any manifest to the block is lost), HEAD operation to know the
vector clock is performed with n=all, pr=quorum
. If nothing is
found there it is treated as deleted. First trial to delete is with
n=all, pr=all, pw=all
. If it fails second trial is with n=all
and all other options are default of Riak. Operators must be sure
that few handoffs are running or waiting (the actual number depends
upon N), to prevent such block leaks.
Multipart uploads: There is one additional detail here. There is
effectively a new manifest state (call it a "substate") during the
transition from writing
state to the active
state. It is possible
for an upload ID to have 5 parts (call them 1-5), but the completed
object would only use parts 1, 4, and 5. In such a case, we must
somehow garbage collect parts 2 and 3.
In the case when we must garbage collect some parts as we move to the
active
state, there is an additional property added to the
?MANIFEST.props
property list. The properties are:
-
multipart
. The value for themultipart
key is a?MULTIPART_MANIFEST
record. This record contains all of the extra metadata required for multipart upload support. -
multipart_clean
. When this property is present, the GC daemon has been notified of all parts that require garbage collection. The presence (or absence) of this property defines a substate-like difference in theactive
state.
If the "upload finished" command uses all parts, then the manifest
will be written with both state=active
and also the
multipart_clean
property present in the ?MANIFEST.props
property
list.
If the "upload finished" command does not use all parts, then the
manifest will first be written with state=active
but the
multipart_clean
property will be absent. After the GC has been
successfully scheduled, the multipart_clean
property will be added
to the props
list, and the manifest will be updated.
When an object is fetched, and it is state=active
but does not have
the multipart_clean
property, then GC for the unused parts is
rescheduled.
In the deep mists of time, there were a couple of designs discussed. They are listed at https://github.com/basho/riak_cs/issues/300 ... they were "Reid's Strawman" and "Scott's Strawman".
In the end, Scott mostly implemented Reid's proposal.
The ?MANIFEST
record remains unchanged: the #lfs_manifest_v3
record was not changed. All multipart information is stored in the
?MANIFEST.props
property list. The roles of the multipart
and
multipart_clean
properies are explained above.
A ?MULTIPART_MANIFEST
record contains the basic extra metadata for
multipart uploads. Also, there are three lists of parts, each with a
member in the ?MULTIPART_MANIFEST
record.
-
?MULTIPART_MANIFEST.parts
-- Info about all parts. -
?MULTIPART_MANIFEST.done_parts
-- All parts that have finished successfullly. -
?MULTIPART_MANIFEST.cleanup_parts
-- All unused parts, after an "upload finished" operation
The blocks for each part are stored using the same scheme as a single-part file:
- Each part has a UUID
- The data for the part is divided into app.config
lfs_block_size
size blocks/chunks (default = 1 MByte). - Each block/chunk has a
{UUID, BlockId}
name, whereBlockId
is an integer >= 0.
It was then a small change to make the riak_cs_block_server
module
use the {UUID, BlockId}
naming scheme everywhere.
First, a reminder that for a given named object, multiple internal versions of that object may be stored in the system at one time. Each version of the object is accessible by an object manifest that includes a UUID identifying that particular version. There is at most one active manifest for a named object at any one time and the active version is the only one that is externally available to a Riak CS user.
Garbage collection of an object version involves several different actions. These actions can be divided into synchronous actions that occur while the user is waiting for notification of successful command completion and asynchronous actions that are not directly tied to user actions and occur in the background. These two action groups are described in more detail in the following sections.
There are two direct actions a user may take to initiate the garbage collection of an object version: overwriting the object with a new version or deleting the object.
When an object version is overwritten a new object manifest is written
with the state set to active
and this new version becomes what is
available to the user, but in the case of a delete the object is no
longer externally available.
Also, as part of the overwrite or delete action, a set of eligible
manifest versions are determined and the state of each eligible
manifest is changed to pending_delete
and the delete_marked_time
field is set to a time value representing the current time.
The method for compiling the list of eligible manifests is dependent on the operation.
For object overwrites, the previously active
manifest version is
selected along with any manifest versions that are in the writing
state where the last_block_written_time
field (or the
write_start_time
if last_block_written_time
is undefined) of the
manifest represents a time value greater than leeway_seconds
seconds
ago. If a manifest version remains in the writing
state for greater
than leeway_seconds
seconds, it is assumed that that manifest
version represents a failed upload attempt and therefore it is
acceptable to reap any object blocks that may have been
written. Manifest versions in the writing
state whose
last_block_written_time
has not exceeded the leeway_seconds
threshold are not deemed eligible because they could represent an
object version that is still in the progress of writing its blocks.
Object deletes are more straightforward. Since no object is externally
available to the user after the delete operation, then any manifest
versions in the active
or writing
state are eligible to be
cleaned up. There is no concern about reaping the object version
that is currently being written to become the next active
version.
Once the states of the eligible manifests have been updated to
pending_delete
the manifest information for any pending_delete
manifest versions are collected into a CRDT set and the set is written
as a value to the riak-cs-gc
bucket keyed by a time value
representing the current epoch time plus the leeway interval (i.e.
the leeway_seconds
configuration option). If that write is
successful then the state for each manifest in the set is updated to
scheduled_delete
. This indicates that the blocks of the object have
been scheduled for deletion by the garbage collection daemon and
avoids other manifest resolution processes for the object from
scheduling unnecessary deletions.
Once the manifest enters the scheduled_delete
state it remains as a
tombstone for a minimum of leeway_seconds
.
After these actions have been attempted, the synchronous portion of the garbage collection process is concluded and a response is return to the user who issued the request.
The asynchronous portion of the garbage collection process is
orchestrated by the garbage collection daemon ( riak_cs_gc_manager
)
that wakes up at specific intervals and spawns a batch process (
riak_cs_gc_batch
) process that has multiple workers underneath. It
checks the riak-cs-gc
bucket for any scheduled entries that are
eligible for reaping, and assigns keys to each worker. Whan a worker
gets GC bucket keys assigned by batch process, it processes block
deletion and manifest reaping.
+----------+ +--------+
|gc_manager+--(spawn)-->|gc_batch| ..(fetch keys by 2i query)
+----------+ +---+----+
\ / | +-----------+
<-report finish/ +--+ gc_worker | ..(process keys assiend by batch)
| +-----------+
| +-----------+
+--+ gc_worker |
| +-----------+
.....
| +-----------+
+--+ gc_worker |
+-----------+
The enters a running state and begins to delete the object blocks associated with the eligible keys and continues until all keys have been processed. The duration of this running state varies depending on the number of keys involved and the size of the objects they represent. The daemon checks for messages after processing each object so that the work interval may be manually interrupted if needed.
Deletion eligibility is determined using the key values in the
riak-cs-gc
bucket which are time values. If the current time minus
leeway period according to the daemon is later than the time
represented by a key, the blocks for the object manifest stored at
that key are eligible for deletion and the daemon attempts to delete
them.
The batch gathers the eligible keys for deletion by performing a
secondary index range query on the $key
index with a lower bound of
time 0 and an upper bound of the current time minus leeway
period. This allows the batch to collect all the keys that are
eligible for deletion and have some way of accounting for clock skew.
Once the object blocks represented by a key in the riak-cs-gc
bucket
have all been deleted, the key is deleted from the riak-cs-gc
bucket.
Manifest versions are retrieved and updated by the
riak_cs_manifest_fsm
module with very few exceptions. This module
encapsulates the logic needed to retrieve the manifests, resolve any
conflicts due to siblings, and write updated manifest versions back to
Riak.
The actual deletion of the blocks of an object is managed by the
riak_cs_delete_fsm
module. It starts up a number of delete workers
(based on the configured delete concurrency) and passes off object
block information to those workers who in turn carry out the actual
delete operation for that block. The delete workers are instances of
the riak_cs_block_server
module.
Once a worker deletes a block it notifies the delete fsm and waits for notification about another block to delete. Once all blocks of an object are deleted then the delete fsm starts an instance of the manifest fsm to handle deleting the manifest version from the object manifest data structure and if there are no remaining manifest versions to delete the entire object manifest data structure. The goal of this final step is to avoid the cost of scanning through empty manifest keys that could linger indefinitely.
- An slow reader may have blocks GC'd as it is reading an object if the read exceeds the leeway interval.
- There is some reliance on system clocks and this could lead to object blocks being deleted earlier or later than their intended eligibility window dictates due to clock skew.
- A network partition (or machine failure) lasting longer than
leeway_seconds
could cause a manifest to "come back to life" and appear active, it would then continually serve requests whose blocks could not be found.