Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROPOSAL]: FP8 with block-wise amax #6105

Open
1 task
Edenzzzz opened this issue Oct 28, 2024 · 0 comments
Open
1 task

[PROPOSAL]: FP8 with block-wise amax #6105

Edenzzzz opened this issue Oct 28, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@Edenzzzz
Copy link
Contributor

Edenzzzz commented Oct 28, 2024

Proposal

@kuozhang brought up in #6101 that FP8 with TP should all_reduce a global amax history.
However based on my understanding of the code for creating amax history, it seems to only create and update local scaling factors, with num_scale=1 meaning one factor over all features? This seems equivalent to computing block-wise amax usingtp_size blocks as in QLoRA, and should be more accurate.
In anycase, I feel NVIDIA's method for tracking amax stats is quite coarse. They only track the amax over all features during a history window, and don't test precision with other methods. In the future we could test

  1. Computing block-wise amax instead of over all features?
  2. Other amax history tracking methods, such as exponential moving average.
    Feel free to submit PRs/correct my opinions if the community/team have time :)

Self-service

  • I'd be willing to do some initial work on this proposal myself.
@Edenzzzz Edenzzzz added the enhancement New feature or request label Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant