Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redo FSE Table Description section #3810

Closed
wants to merge 2 commits into from

Conversation

elasota
Copy link
Contributor

@elasota elasota commented Nov 4, 2023

This rewrites the FSE Table Description section in a more descriptive and exhaustive style that I think will be clearer to implementers.

Main changes:

  • Removed examples, instead specifies exactly how all intermediate values are to be calculated and used.
  • Removed the part about the decoder knowing the total byte consumption. (The decoder doesn't need to track that information.)
  • Bit reads are described as only occurring if the bits are needed. I think this makes it clearer that this step doesn't require any lookahead.
  • Removed the failure condition for (1 << Accuracy_Log) being exceeded. I believe this condition is impossible because the largest possible decodable probability value is the value that will cause the cumulative probability to equal (1 << Accuracy_Log) at which point the loop is terminated.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Mar 3, 2024

Looking into this now.

While adding a more detailed explanation of the decoding process is welcome and a great idea,
I'm not sure to understand how removing the example helps ?

Also: this PR is now in conflict with dev, likely as a consequence of accepting another spec change on the same or neighboring paragraph.

@elasota
Copy link
Contributor Author

elasota commented Mar 15, 2024

I will add the examples back and fix the merge conflict this weekend.

@elasota elasota force-pushed the fmt-clarifications branch from da4a607 to 01c97ab Compare March 22, 2024 23:18
@elasota
Copy link
Contributor Author

elasota commented Mar 22, 2024

I've updated this to add back the examples, but rephrased them to use the nomenclature in the new description.

I've also updated the upper bound description for offset codes: Rejecting out-of-bounds values is well-defined for Huffman weights, literal lengths, and match lengths, since they have fixed upper bounds, but offset code upper bounds are decoder-defined, so I've added some verbiage stating that decoders are allowed to reject any streams encoding a non-zero probability for a value larger than their limit, and indicated that encoders should not include non-zero probabilities for an offset code larger than the largest offset code present in the stream.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Apr 1, 2024

I've read the newly proposed version, and while I have no error to report, I unfortunately also can't state that the new version is an obvious improvement (i.e. clearer) compared to the older one. They both look difficult to read to me.

It's unfortunate that the FSE table allocation process is so complex to describe.
It's certainly not an easy task, and there might be no great solution.

I understand that the new version also contains some clarifications which are mildly different from the original version. There might be some merit here, but I can't really separate this part from the rest in the PR.

If you believe some clarifications are worthwhile on their own, it might be easier to review them one by one.
You might be right for several of them, but it's also important to properly distinguish the format specification from the implementation details, as these details may change and must be allowed to change in the future. It's a subtle distinction, which deserves some dedicated attention.

@elasota
Copy link
Contributor Author

elasota commented Apr 1, 2024

You might be right for several of them, but it's also important to properly distinguish the format specification from the implementation details, as these details may change and must be allowed to change in the future. It's a subtle distinction, which deserves some dedicated attention.

I'm not sure what this part is in reference to, but I have been trying to avoid bringing implementation details into this.

Normally the expectation is that if a spec describes an algorithm, an implementation is free to implement any equivalent algorithm. So, tracking cumulative probability vs. remaining probability for instance isn't really important, and a lot of the intermediates could be described in different ways, it doesn't really matter as long as it produces the same results.

What I am mainly trying to improve for example is this type of description:

255-157 = 98 values are remaining in an 8-bits field.

That requires some deduction to tell that the 255 is supposed to be (1<<8)-1, and the 8 is supposed to come from the log2sup call above (even though 8 is also the Accuracy_Log in the example). I think it is better to describe exactly what each value is supposed to represent and how to compute it, and state all of the necessary calculations instead of implying them.

Anyway, I will split this up and close out this PR and open some new ones.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Oct 1, 2024

Several specification updates have been integrated since this PR was first submitted.
I presume all relevant ideas have already been extracted and merged at this point.
As a consequence, I guess it's time to close this PR.

@Cyan4973 Cyan4973 closed this Dec 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants