[PROPOSAL]: FP8 with block-wise amax #6105

Edenzzzz · 2024-10-28T16:27:04Z

Proposal

@kuozhang brought up in #6101 that FP8 with TP should all_reduce a global amax history.
However based on my understanding of the code for creating amax history, it seems to only create and update local scaling factors, with num_scale=1 meaning one factor over all features? This seems equivalent to computing block-wise amax usingtp_size blocks as in QLoRA, and should be more accurate.
In anycase, I feel NVIDIA's method for tracking amax stats is quite coarse. They only track the amax over all features during a history window, and don't test precision with other methods. In the future we could test

Computing block-wise amax instead of over all features?
Other amax history tracking methods, such as exponential moving average.
Feel free to submit PRs/correct my opinions if the community/team have time :)

Self-service

I'd be willing to do some initial work on this proposal myself.

The text was updated successfully, but these errors were encountered:

Edenzzzz added the enhancement New feature or request label Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PROPOSAL]: FP8 with block-wise amax #6105

[PROPOSAL]: FP8 with block-wise amax #6105

Edenzzzz commented Oct 28, 2024 •

edited

Loading

[PROPOSAL]: FP8 with block-wise amax #6105

[PROPOSAL]: FP8 with block-wise amax #6105

Comments

Edenzzzz commented Oct 28, 2024 • edited Loading

Proposal

Self-service

Edenzzzz commented Oct 28, 2024 •

edited

Loading