diff --git a/doc/zstd_compression_format.md b/doc/zstd_compression_format.md index 0532a846f45..d0253201ac6 100644 --- a/doc/zstd_compression_format.md +++ b/doc/zstd_compression_format.md @@ -1076,59 +1076,52 @@ Let's `low4Bits` designate the lowest 4 bits of the first byte : Then follows each symbol value, from `0` to last present one. The number of bits used by each field is variable. -It depends on : - -- Remaining probabilities + 1 : - __example__ : - Presuming an `Accuracy_Log` of 8, - and presuming 100 probabilities points have already been distributed, - the decoder may read any value from `0` to `256 - 100 + 1 == 157` (inclusive). - Therefore, it may read up to `log2sup(157) == 8` bits, where `log2sup(N)` - is the smallest integer `T` that satisfies `(1 << T) > N`. - -- Value decoded : small values use 1 less bit : - __example__ : - Presuming values from 0 to 157 (inclusive) are possible, - 255-157 = 98 values are remaining in an 8-bits field. - They are used this way : - first 98 values (hence from 0 to 97) use only 7 bits, - values from 98 to 157 use 8 bits. - This is achieved through this scheme : - - | Value read | Value decoded | Number of bits used | - | ---------- | ------------- | ------------------- | - | 0 - 97 | 0 - 97 | 7 | - | 98 - 127 | 98 - 127 | 8 | - | 128 - 225 | 0 - 97 | 7 | - | 226 - 255 | 128 - 157 | 8 | - -Symbols probabilities are read one by one, in order. - -Probability is obtained from Value decoded by following formula : -`Proba = value - 1` - -It means value `0` becomes negative probability `-1`. -`-1` is a special probability, which means "less than 1". -Its effect on distribution table is described in the [next section]. -For the purpose of calculating total allocated probability points, it counts as one. -[next section]:#from-normalized-distribution-to-decoding-tables +For each encoded symbol, minimum bit usage must be determined. +To do this, first compute +`Cumulative_Prob = Sum of Effective_Prob for all previously decoded symbols` +and `Max_Prob_Value = (1 << Accuracy_Log) - Cumulative_Prob + 1`. + +Then, compute `Prob_Base_Size` as the position of the highest `1` bit in +`Max_Prob_Value` (that is, the largest integer `T` that satisfies +`(1 << T) <= Max_Prob_Value`), and then compute +`Large_Prob_Range = Max_Prob_Value - (1 << Prob_Base_Size) + 1` and +`Large_Prob_Start = (1 << Prob_Base_Size) - Large_Prob_Range`. + +To decode a symbol probability, read `Prob_Base_Size` bits from the bitstream +as `Initial_Prob_Value`. If `Initial_Prob_Value >= Large_Prob_Start`, then +the decoder must read one additional bit from the bitstream. If that +bit is `1`, then compute +`Prob_Value = Initial_Prob_Value + Large_Prob_Range` +If the extra bit is `0`, or there is no extra bit, then +`Prob_Value = Initial_Prob_Value`. + +If `Prob_Value` is `0`, then the probability is a special "less than 1" value. +The effects of the "less than 1" probability on the distribution table is +described in the [next section]. In this case, `Effective_Prob` for the +symbol is `1`. + +If `Prob_Value` is non-zero, then the probability is computed as +`Probability = Prob_Value - 1`, and `Effective_Prob = Probability`. + +If the `Probability` is zero, then it is succeeded by at least one repeat +count in the bitstream. Each repeat count is `2` bits. Each time a repeat +count of `3` is encountered, another repeat count is read until a repeat +count other than `3` is encountered. Once all repeat counts have been read, +compute `Symbol_Repeat_Count = Sum of all repeat counts for the symbol`. +The probability of the next `Symbol_Repeat_Count` symbols after the decoded +symbol is zero, and decoding of the probability values of those symbols is +skipped. The `Effective_Prob` of all skipped symbols is also `0`. + +If, after decoding a probability, the sum of all decoded `Effective_Prob` +equals `1 << Accuracy_Log`, then decoding of symbols is completed and the +probability of all remaining symbols is `0`. + +After decoding probabilities, further decoding resumes at the byte after +the last byte used or partially used by probability decoding. If the last +byte was partially used, then any unused bits are ignored. -When a symbol has a __probability__ of `zero`, -it is followed by a 2-bits repeat flag. -This repeat flag tells how many probabilities of zeroes follow the current one. -It provides a number ranging from 0 to 3. -If it is a 3, another 2-bits repeat flag follows, and so on. - -When last symbol reaches cumulated total of `1 << Accuracy_Log`, -decoding is complete. -If the last symbol makes cumulated total go above `1 << Accuracy_Log`, -distribution is considered corrupted. - -Then the decoder can tell how many bytes were used in this process, -and how many symbols are present. -The bitstream consumes a round number of bytes. -Any remaining bit within the last byte is just unused. +[next section]:#from-normalized-distribution-to-decoding-tables #### From normalized distribution to decoding tables