forked from radii/undup
-
Notifications
You must be signed in to change notification settings - Fork 0
/
SPEC
141 lines (107 loc) · 4.47 KB
/
SPEC
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
The undup stream format
-----------------------
Andrew Isaacson <adi@hexapodia.org>
February 3, 2012
version 0.1
OVERVIEW
--------
Undup transforms an input file, consisting of a sequence of bytes arranged
into 512-byte blocks, into a undup stream consisting of data blocks
interleaved with metadata blocks. Backrefs in the metadata can refer to
previous offsets in the stream, thereby allowing a stream with repetitive
content (such as an uncompressed tar archive containing many slightly
different copies of a source tree) to be represented without storing the
repeated data multiple times in the undup stream.
This compression phase ideally uses a significant amount of RAM, on the order
of 8% of the input stream size, but can be bounded (with some loss of
efficiency) to any chosen RAM footprint. The input to the compression phase
can be a stream or a seekable file; the output can be a stream or a seekable
file. The expansion phase requires that one of the input or the output must
be seekable.
The unduplication algorithm uses SHA256 to find duplicate blocks. If two
blocks A and A' with different content generate the same SHA256, the undup
stream will silently lose data during the compression phase. If this happens,
the expansion phase will fail to validate the final, full-stream SHA256 and
will noisily warn that this has happened. If A and A' also result in the same
transformation in the full-stream SHA256 state, then silent data loss will
occur because the output stream will contain A everywhere it should contain
A'. This is considered unlikely.
An undup stream consists of a header, followed by zero or more frames,
followed by a special frame known as a trailer.
All values are stored big endian, naturally aligned. All data and metadata
blocks are multiples of 512 bytes.
Undup uses SHA256 in many places. The SHA256 is sometimes truncated from its
full 32 bytes down to 12 bytes. Truncation is performed by discarding the
trailing bytes of the SHA256.
HEADER
------
The header is 512 bytes. It starts with the following structure and is
padded with bytes of value 0x00.
Magic Version
+-------------+-------------+
| 75 6e 64 75 | 01 00 00 00 |
+-------------+-------------+
TRAILER
-------
The trailer is 512 bytes. It starts with the following structure and is
padded with bytes of value 0x00.
Op Length Hash
0 1 .. 7 8 .. 39
+----+---------+------------------+
| ff | 8 bytes | SHA256, 32 bytes |
+----+---------+------------------+
The Length field denotes the total length, in bytes, of the input data stream.
The maximum length of a stream is 2^56-1 = 64 petabytes (minus one).
The Hash field gives the SHA256 of the input data stream.
FRAME
-----
A frame consists of a metadata block, followed by zero or more data blocks as
described in the metadata block.
A metadata block consists of 32 cells of 16 bytes each.
CELL
----
A cell is 16 bytes long. All cells start with a one-byte opcode. The format
of the rest of the cell depends on the opcode.
DATA CELL
---------
Op Len Hash
0 1 2 3 4 ... 16
+----+----------+-----------------+
| 01 | xx yy zz | hh hh hh ... hh |
+----+----------+-----------------+
op_data = 0x01
union {
u8 opcode;
u32 enclength;
}
length = enclength >> 8;
A data cell denotes that `length' payload blocks are appended, in natual order,
after the current metadata block. A single data cell can specify anywhere
from zero to 2^24-1 blocks (from zero bytes to (8 GiB - 512 bytes)). All
payload blocks must be complete. Partial blocks are not allowed, except for
the last block in the file, which is zero-filled and truncated according to
the trailer's length field.
The truncated SHA256 of the current data payload fills the rest of the data
cell.
BACKREF CELL
------------
Op Pos Len Zeros
0 1 2 6 7 8 9 10 11 12 13 14 15
+----+-----------------+-------------+-------------+
| 02 | mm nn ... xx yy | ww xx yy zz | 00 00 00 00 |
+----+-----------------+-------------+-------------+
op_backref = 0x02
union {
u8 opcode;
struct {
u64 encpos;
u32 length;
u32 zeros;
};
}
pos = encpos >> 8;
A backref cell denotes that 'length' payload blocks should be read from 'pos'
in the file stream. pos+length must be less than or equal to the current file
position. The range of blocks [pos, pos+length) MUST be represented by data
cells previously present in the dedup stream. Backref cells MUST NOT refer to
previous backref cells.