Skip to content

Commit

Permalink
Redo FSE Table Description section to replace examples with a descrip…
Browse files Browse the repository at this point in the history
…tion of the algorithm
  • Loading branch information
elasota committed Nov 3, 2023
1 parent 1518570 commit da4a607
Showing 1 changed file with 44 additions and 51 deletions.
95 changes: 44 additions & 51 deletions doc/zstd_compression_format.md
Original file line number Diff line number Diff line change
Expand Up @@ -1076,59 +1076,52 @@ Let's `low4Bits` designate the lowest 4 bits of the first byte :

Then follows each symbol value, from `0` to last present one.
The number of bits used by each field is variable.
It depends on :

- Remaining probabilities + 1 :
__example__ :
Presuming an `Accuracy_Log` of 8,
and presuming 100 probabilities points have already been distributed,
the decoder may read any value from `0` to `256 - 100 + 1 == 157` (inclusive).
Therefore, it may read up to `log2sup(157) == 8` bits, where `log2sup(N)`
is the smallest integer `T` that satisfies `(1 << T) > N`.

- Value decoded : small values use 1 less bit :
__example__ :
Presuming values from 0 to 157 (inclusive) are possible,
255-157 = 98 values are remaining in an 8-bits field.
They are used this way :
first 98 values (hence from 0 to 97) use only 7 bits,
values from 98 to 157 use 8 bits.
This is achieved through this scheme :

| Value read | Value decoded | Number of bits used |
| ---------- | ------------- | ------------------- |
| 0 - 97 | 0 - 97 | 7 |
| 98 - 127 | 98 - 127 | 8 |
| 128 - 225 | 0 - 97 | 7 |
| 226 - 255 | 128 - 157 | 8 |

Symbols probabilities are read one by one, in order.

Probability is obtained from Value decoded by following formula :
`Proba = value - 1`

It means value `0` becomes negative probability `-1`.
`-1` is a special probability, which means "less than 1".
Its effect on distribution table is described in the [next section].
For the purpose of calculating total allocated probability points, it counts as one.

[next section]:#from-normalized-distribution-to-decoding-tables
For each encoded symbol, minimum bit usage must be determined.
To do this, first compute
`Cumulative_Prob = Sum of Effective_Prob for all previously decoded symbols`
and `Max_Prob_Value = (1 << Accuracy_Log) - Cumulative_Prob + 1`.

Then, compute `Prob_Base_Size` as the position of the highest `1` bit in
`Max_Prob_Value` (that is, the largest integer `T` that satisfies
`(1 << T) <= Max_Prob_Value`), and then compute
`Large_Prob_Range = Max_Prob_Value - (1 << Prob_Base_Size) + 1` and
`Large_Prob_Start = (1 << Prob_Base_Size) - Large_Prob_Range`.

To decode a symbol probability, read `Prob_Base_Size` bits from the bitstream
as `Initial_Prob_Value`. If `Initial_Prob_Value >= Large_Prob_Start`, then
the decoder must read one additional bit from the bitstream. If that
bit is `1`, then compute
`Prob_Value = Initial_Prob_Value + Large_Prob_Range`
If the extra bit is `0`, or there is no extra bit, then
`Prob_Value = Initial_Prob_Value`.

If `Prob_Value` is `0`, then the probability is a special "less than 1" value.
The effects of the "less than 1" probability on the distribution table is
described in the [next section]. In this case, `Effective_Prob` for the
symbol is `1`.

If `Prob_Value` is non-zero, then the probability is computed as
`Probability = Prob_Value - 1`, and `Effective_Prob = Probability`.

If the `Probability` is zero, then it is succeeded by at least one repeat
count in the bitstream. Each repeat count is `2` bits. Each time a repeat
count of `3` is encountered, another repeat count is read until a repeat
count other than `3` is encountered. Once all repeat counts have been read,
compute `Symbol_Repeat_Count = Sum of all repeat counts for the symbol`.
The probability of the next `Symbol_Repeat_Count` symbols after the decoded
symbol is zero, and decoding of the probability values of those symbols is
skipped. The `Effective_Prob` of all skipped symbols is also `0`.

If, after decoding a probability, the sum of all decoded `Effective_Prob`
equals `1 << Accuracy_Log`, then decoding of symbols is completed and the
probability of all remaining symbols is `0`.

After decoding probabilities, further decoding resumes at the byte after
the last byte used or partially used by probability decoding. If the last
byte was partially used, then any unused bits are ignored.

When a symbol has a __probability__ of `zero`,
it is followed by a 2-bits repeat flag.
This repeat flag tells how many probabilities of zeroes follow the current one.
It provides a number ranging from 0 to 3.
If it is a 3, another 2-bits repeat flag follows, and so on.

When last symbol reaches cumulated total of `1 << Accuracy_Log`,
decoding is complete.
If the last symbol makes cumulated total go above `1 << Accuracy_Log`,
distribution is considered corrupted.

Then the decoder can tell how many bytes were used in this process,
and how many symbols are present.
The bitstream consumes a round number of bytes.
Any remaining bit within the last byte is just unused.
[next section]:#from-normalized-distribution-to-decoding-tables

#### From normalized distribution to decoding tables

Expand Down

0 comments on commit da4a607

Please sign in to comment.