Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gemm fp8 e4m3 #185

Open
wants to merge 31 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
bb89933
gemm fp8 e4m3
AndreSlavescu Aug 31, 2024
60f7ffd
update to benchmark
AndreSlavescu Aug 31, 2024
e11c22b
faster fwd performance with tl.multiple_of
AndreSlavescu Aug 31, 2024
e68f7f1
add stricter check for compute capability + exception handling
AndreSlavescu Aug 31, 2024
fafdfbe
Merge branch 'main' into matmulfp8
AndreSlavescu Aug 31, 2024
91bf3dd
perf improvement
AndreSlavescu Sep 3, 2024
9d467bf
remove discrete functional api
AndreSlavescu Sep 3, 2024
8b45800
make compute capability check a decorator
AndreSlavescu Sep 3, 2024
2319fc7
format
AndreSlavescu Sep 3, 2024
7418433
implement backward kernel as well
AndreSlavescu Sep 3, 2024
c39a7aa
add more benchmarks + diff utils
AndreSlavescu Sep 3, 2024
f8829e5
Merge branch 'main' of https://github.com/AndreSlavescu/Liger-Kernel …
AndreSlavescu Sep 4, 2024
032c4d9
update utils to include mma_v3 for H100
AndreSlavescu Sep 5, 2024
464cdd2
Merge branch 'main' into matmulfp8
lancerts Sep 6, 2024
c8dba40
update test.
AndreSlavescu Sep 7, 2024
ce2aee5
Merge branch 'matmulfp8' of https://github.com/AndreSlavescu/Liger-Ke…
AndreSlavescu Sep 7, 2024
cedf3de
Merge branch 'main' into matmulfp8
AndreSlavescu Sep 7, 2024
bb2f725
format
AndreSlavescu Sep 7, 2024
b3195da
Merge branch 'main' into matmulfp8
lancerts Sep 8, 2024
744642b
Merge branch 'main' into matmulfp8
lancerts Sep 8, 2024
7a8043a
Merge branch 'main' into matmulfp8
AndreSlavescu Sep 10, 2024
9e60b0a
compute types
AndreSlavescu Sep 12, 2024
98b7abf
modify benchmark to be up to date
AndreSlavescu Sep 12, 2024
a709616
format
AndreSlavescu Sep 12, 2024
c9cbc3a
Merge branch 'main' into matmulfp8
lancerts Sep 12, 2024
569b4eb
fix mem bounds
AndreSlavescu Sep 12, 2024
acc228b
Merge branch 'matmulfp8' of https://github.com/AndreSlavescu/Liger-Ke…
AndreSlavescu Sep 12, 2024
d31244a
docstring for fp8 gemm design
AndreSlavescu Sep 12, 2024
0f36098
format
AndreSlavescu Sep 12, 2024
edc4ebc
remove old benchmark format
AndreSlavescu Sep 12, 2024
618c858
Merge branch 'main' into matmulfp8
AndreSlavescu Sep 15, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
72 changes: 72 additions & 0 deletions benchmark/data/all_benchmark_data.csv
Original file line number Diff line number Diff line change
Expand Up @@ -445,3 +445,75 @@ kl_div,torch,full,speed,ms,V,vocab size,16384,11.124671936035156,11.122162818908
kl_div,torch,full,speed,ms,V,vocab size,32768,23.052032470703125,23.050334930419922,23.052589416503906,"{""B"": 8, ""T"": 2048}",NVIDIA H100 PCIe,2024-09-04 12:59:48,0.2.1
kl_div,torch,full,speed,ms,V,vocab size,65536,46.063167572021484,46.05990219116211,46.06643295288086,"{""B"": 8, ""T"": 2048}",NVIDIA H100 PCIe,2024-09-04 12:59:48,0.2.1
kl_div,torch,full,speed,ms,V,vocab size,131072,92.06393432617188,92.06393432617188,92.06393432617188,"{""B"": 8, ""T"": 2048}",NVIDIA H100 PCIe,2024-09-04 12:59:48,0.2.1
gemm_split_k_fp8,liger,forward,speed,ms,Matrix Size (m x k x n),Matrix Size (m x k x n),"(64, 64, 64)",0.011264000087976456,0.011264000087976456,0.01228800043463707,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:03:53,0.2.1
gemm_split_k_fp8,liger,forward,speed,ms,Matrix Size (m x k x n),Matrix Size (m x k x n),"(256, 256, 256)",0.014336000196635723,0.013311999849975109,0.014336000196635723,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:03:53,0.2.1
gemm_split_k_fp8,liger,forward,speed,ms,Matrix Size (m x k x n),Matrix Size (m x k x n),"(512, 512, 512)",0.01740800030529499,0.01740800030529499,0.018432000651955605,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:03:53,0.2.1
gemm_split_k_fp8,liger,forward,speed,ms,Matrix Size (m x k x n),Matrix Size (m x k x n),"(1024, 1024, 1024)",0.03481600061058998,0.03481600061058998,0.03686400130391121,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:03:53,0.2.1
gemm_split_k_fp8,liger,forward,speed,ms,Matrix Size (m x k x n),Matrix Size (m x k x n),"(64, 128, 64)",0.011264000087976456,0.011264000087976456,0.01228800043463707,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:03:53,0.2.1
gemm_split_k_fp8,liger,forward,speed,ms,Matrix Size (m x k x n),Matrix Size (m x k x n),"(256, 512, 256)",0.015359999611973763,0.015359999611973763,0.016383999958634377,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:03:53,0.2.1
gemm_split_k_fp8,liger,forward,speed,ms,Matrix Size (m x k x n),Matrix Size (m x k x n),"(512, 1024, 512)",0.021503999829292297,0.021503999829292297,0.02252800017595291,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:03:53,0.2.1
gemm_split_k_fp8,liger,forward,speed,ms,Matrix Size (m x k x n),Matrix Size (m x k x n),"(1024, 2048, 1024)",0.048128001391887665,0.048128001391887665,0.04915200173854828,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:03:53,0.2.1
gemm_split_k_fp8,torch,forward,speed,ms,Matrix Size (m x k x n),Matrix Size (m x k x n),"(64, 64, 64)",0.009216000325977802,0.009216000325977802,0.010239999741315842,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:03:54,0.2.1
gemm_split_k_fp8,torch,forward,speed,ms,Matrix Size (m x k x n),Matrix Size (m x k x n),"(256, 256, 256)",0.011247999966144562,0.010239999741315842,0.011264000087976456,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:03:54,0.2.1
gemm_split_k_fp8,torch,forward,speed,ms,Matrix Size (m x k x n),Matrix Size (m x k x n),"(512, 512, 512)",0.021503999829292297,0.020479999482631683,0.021503999829292297,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:03:54,0.2.1
gemm_split_k_fp8,torch,forward,speed,ms,Matrix Size (m x k x n),Matrix Size (m x k x n),"(1024, 1024, 1024)",0.04915200173854828,0.04915200173854828,0.050175998359918594,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:03:54,0.2.1
gemm_split_k_fp8,torch,forward,speed,ms,Matrix Size (m x k x n),Matrix Size (m x k x n),"(64, 128, 64)",0.009216000325977802,0.009216000325977802,0.010239999741315842,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:03:54,0.2.1
gemm_split_k_fp8,torch,forward,speed,ms,Matrix Size (m x k x n),Matrix Size (m x k x n),"(256, 512, 256)",0.013344000093638897,0.013311999849975109,0.014336000196635723,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:03:54,0.2.1
gemm_split_k_fp8,torch,forward,speed,ms,Matrix Size (m x k x n),Matrix Size (m x k x n),"(512, 1024, 512)",0.037935998290777206,0.03788800165057182,0.03891199827194214,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:03:54,0.2.1
gemm_split_k_fp8,torch,forward,speed,ms,Matrix Size (m x k x n),Matrix Size (m x k x n),"(1024, 2048, 1024)",0.11673600226640701,0.11468800157308578,0.11776000261306763,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:03:54,0.2.1
gemm_split_k_fp8,torch_compile,forward,speed,ms,Matrix Size (m x k x n),Matrix Size (m x k x n),"(64, 64, 64)",0.010239999741315842,0.010239999741315842,0.010239999741315842,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:03:58,0.2.1
gemm_split_k_fp8,torch_compile,forward,speed,ms,Matrix Size (m x k x n),Matrix Size (m x k x n),"(256, 256, 256)",0.028672000393271446,0.02457600086927414,0.030719999223947525,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:03:58,0.2.1
gemm_split_k_fp8,torch_compile,forward,speed,ms,Matrix Size (m x k x n),Matrix Size (m x k x n),"(512, 512, 512)",0.03276799991726875,0.03174399957060814,0.03276799991726875,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:03:58,0.2.1
gemm_split_k_fp8,torch_compile,forward,speed,ms,Matrix Size (m x k x n),Matrix Size (m x k x n),"(1024, 1024, 1024)",0.06451199948787689,0.06348799914121628,0.07065600156784058,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:03:58,0.2.1
gemm_split_k_fp8,torch_compile,forward,speed,ms,Matrix Size (m x k x n),Matrix Size (m x k x n),"(64, 128, 64)",0.026623999699950218,0.020479999482631683,0.026623999699950218,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:03:58,0.2.1
gemm_split_k_fp8,torch_compile,forward,speed,ms,Matrix Size (m x k x n),Matrix Size (m x k x n),"(256, 512, 256)",0.013439999893307686,0.013311999849975109,0.014336000196635723,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:03:58,0.2.1
gemm_split_k_fp8,torch_compile,forward,speed,ms,Matrix Size (m x k x n),Matrix Size (m x k x n),"(512, 1024, 512)",0.050175998359918594,0.04915200173854828,0.055296000093221664,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:03:58,0.2.1
gemm_split_k_fp8,torch_compile,forward,speed,ms,Matrix Size (m x k x n),Matrix Size (m x k x n),"(1024, 2048, 1024)",0.13209599256515503,0.12697599828243256,0.1372160017490387,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:03:58,0.2.1
gemm_split_k_fp8,liger,full,speed,ms,Matrix Size (m x k x n),Matrix Size (m x k x n),"(64, 64, 64)",0.14745600521564484,0.14336000382900238,0.15769599378108978,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:03:59,0.2.1
gemm_split_k_fp8,liger,full,speed,ms,Matrix Size (m x k x n),Matrix Size (m x k x n),"(256, 256, 256)",0.14899200201034546,0.1454080045223236,0.15769599378108978,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:03:59,0.2.1
gemm_split_k_fp8,liger,full,speed,ms,Matrix Size (m x k x n),Matrix Size (m x k x n),"(512, 512, 512)",0.1525759994983673,0.14950400590896606,0.16486400365829468,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:03:59,0.2.1
gemm_split_k_fp8,liger,full,speed,ms,Matrix Size (m x k x n),Matrix Size (m x k x n),"(1024, 1024, 1024)",0.16486400365829468,0.16281600296497345,0.17203199863433838,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:03:59,0.2.1
gemm_split_k_fp8,liger,full,speed,ms,Matrix Size (m x k x n),Matrix Size (m x k x n),"(64, 128, 64)",0.14745600521564484,0.1443839967250824,0.158720001578331,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:03:59,0.2.1
gemm_split_k_fp8,liger,full,speed,ms,Matrix Size (m x k x n),Matrix Size (m x k x n),"(256, 512, 256)",0.1515520066022873,0.14847999811172485,0.16256000101566315,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:03:59,0.2.1
gemm_split_k_fp8,liger,full,speed,ms,Matrix Size (m x k x n),Matrix Size (m x k x n),"(512, 1024, 512)",0.16383999586105347,0.159743994474411,0.1726464033126831,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:03:59,0.2.1
gemm_split_k_fp8,liger,full,speed,ms,Matrix Size (m x k x n),Matrix Size (m x k x n),"(1024, 2048, 1024)",0.2252800017595291,0.22220799326896667,0.22835199534893036,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:03:59,0.2.1
gemm_split_k_fp8,torch,full,speed,ms,Matrix Size (m x k x n),Matrix Size (m x k x n),"(64, 64, 64)",0.18943999707698822,0.1884160041809082,0.19968000054359436,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:04:00,0.2.1
gemm_split_k_fp8,torch,full,speed,ms,Matrix Size (m x k x n),Matrix Size (m x k x n),"(256, 256, 256)",0.3901439905166626,0.33689600229263306,0.40816640853881836,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:04:00,0.2.1
gemm_split_k_fp8,torch,full,speed,ms,Matrix Size (m x k x n),Matrix Size (m x k x n),"(512, 512, 512)",0.3906559944152832,0.3840000033378601,0.4065279960632324,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:04:00,0.2.1
gemm_split_k_fp8,torch,full,speed,ms,Matrix Size (m x k x n),Matrix Size (m x k x n),"(1024, 1024, 1024)",0.42905598878860474,0.42188799381256104,0.4433920085430145,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:04:00,0.2.1
gemm_split_k_fp8,torch,full,speed,ms,Matrix Size (m x k x n),Matrix Size (m x k x n),"(64, 128, 64)",0.38604798913002014,0.0741376057267189,0.40816640853881836,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:04:00,0.2.1
gemm_split_k_fp8,torch,full,speed,ms,Matrix Size (m x k x n),Matrix Size (m x k x n),"(256, 512, 256)",0.15462400019168854,0.04403200000524521,0.35696640610694885,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:04:00,0.2.1
gemm_split_k_fp8,torch,full,speed,ms,Matrix Size (m x k x n),Matrix Size (m x k x n),"(512, 1024, 512)",0.10547199845314026,0.10444799810647964,0.10547199845314026,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:04:00,0.2.1
gemm_split_k_fp8,torch,full,speed,ms,Matrix Size (m x k x n),Matrix Size (m x k x n),"(1024, 2048, 1024)",0.3840000033378601,0.38092800974845886,0.3891200125217438,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:04:00,0.2.1
gemm_split_k_fp8,torch_compile,full,speed,ms,Matrix Size (m x k x n),Matrix Size (m x k x n),"(64, 64, 64)",0.03174399957060814,0.03174399957060814,2.2890496253967285,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:04:02,0.2.1
gemm_split_k_fp8,torch_compile,full,speed,ms,Matrix Size (m x k x n),Matrix Size (m x k x n),"(256, 256, 256)",0.03788800165057182,0.0377856008708477,2.2958080768585205,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:04:02,0.2.1
gemm_split_k_fp8,torch_compile,full,speed,ms,Matrix Size (m x k x n),Matrix Size (m x k x n),"(512, 512, 512)",0.06252799928188324,0.062463998794555664,2.322432041168213,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:04:02,0.2.1
gemm_split_k_fp8,torch_compile,full,speed,ms,Matrix Size (m x k x n),Matrix Size (m x k x n),"(1024, 1024, 1024)",0.2181120067834854,0.2170879989862442,0.2252800017595291,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:04:02,0.2.1
gemm_split_k_fp8,torch_compile,full,speed,ms,Matrix Size (m x k x n),Matrix Size (m x k x n),"(64, 128, 64)",0.030719999223947525,0.030719999223947525,2.287616014480591,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:04:02,0.2.1
gemm_split_k_fp8,torch_compile,full,speed,ms,Matrix Size (m x k x n),Matrix Size (m x k x n),"(256, 512, 256)",0.043007999658584595,0.042080000042915344,0.043007999658584595,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:04:02,0.2.1
gemm_split_k_fp8,torch_compile,full,speed,ms,Matrix Size (m x k x n),Matrix Size (m x k x n),"(512, 1024, 512)",0.10547199845314026,0.10444799810647964,2.362368106842041,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:04:02,0.2.1
gemm_split_k_fp8,torch_compile,full,speed,ms,Matrix Size (m x k x n),Matrix Size (m x k x n),"(1024, 2048, 1024)",0.3901280164718628,0.38618239760398865,0.3936256170272827,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:04:02,0.2.1
gemm_split_k_fp8,liger,full,memory,MB,Matrix Size (m x k x n),Matrix Size (m x k x n),"(64, 64, 64)",16.3125,16.3125,16.3125,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:04:02,0.2.1
gemm_split_k_fp8,liger,full,memory,MB,Matrix Size (m x k x n),Matrix Size (m x k x n),"(256, 256, 256)",17.25,17.25,17.25,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:04:02,0.2.1
gemm_split_k_fp8,liger,full,memory,MB,Matrix Size (m x k x n),Matrix Size (m x k x n),"(512, 512, 512)",20.25,20.25,20.25,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:04:02,0.2.1
gemm_split_k_fp8,liger,full,memory,MB,Matrix Size (m x k x n),Matrix Size (m x k x n),"(1024, 1024, 1024)",32.25,32.25,32.25,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:04:02,0.2.1
gemm_split_k_fp8,liger,full,memory,MB,Matrix Size (m x k x n),Matrix Size (m x k x n),"(64, 128, 64)",16.36328125,16.36328125,16.36328125,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:04:02,0.2.1
gemm_split_k_fp8,liger,full,memory,MB,Matrix Size (m x k x n),Matrix Size (m x k x n),"(256, 512, 256)",18.0625,18.0625,18.0625,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:04:02,0.2.1
gemm_split_k_fp8,liger,full,memory,MB,Matrix Size (m x k x n),Matrix Size (m x k x n),"(512, 1024, 512)",23.5,23.5,23.5,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:04:02,0.2.1
gemm_split_k_fp8,liger,full,memory,MB,Matrix Size (m x k x n),Matrix Size (m x k x n),"(1024, 2048, 1024)",45.25,45.25,45.25,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:04:02,0.2.1
gemm_split_k_fp8,torch,full,memory,MB,Matrix Size (m x k x n),Matrix Size (m x k x n),"(64, 64, 64)",16.3837890625,16.3837890625,16.3837890625,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:04:02,0.2.1
gemm_split_k_fp8,torch,full,memory,MB,Matrix Size (m x k x n),Matrix Size (m x k x n),"(256, 256, 256)",18.3759765625,18.3759765625,18.3759765625,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:04:02,0.2.1
gemm_split_k_fp8,torch,full,memory,MB,Matrix Size (m x k x n),Matrix Size (m x k x n),"(512, 512, 512)",24.7509765625,24.7509765625,24.7509765625,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:04:02,0.2.1
gemm_split_k_fp8,torch,full,memory,MB,Matrix Size (m x k x n),Matrix Size (m x k x n),"(1024, 1024, 1024)",50.2509765625,50.2509765625,50.2509765625,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:04:02,0.2.1
gemm_split_k_fp8,torch,full,memory,MB,Matrix Size (m x k x n),Matrix Size (m x k x n),"(64, 128, 64)",16.4853515625,16.4853515625,16.4853515625,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:04:02,0.2.1
gemm_split_k_fp8,torch,full,memory,MB,Matrix Size (m x k x n),Matrix Size (m x k x n),"(256, 512, 256)",20.0009765625,20.0009765625,20.0009765625,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:04:02,0.2.1
gemm_split_k_fp8,torch,full,memory,MB,Matrix Size (m x k x n),Matrix Size (m x k x n),"(512, 1024, 512)",31.2509765625,31.2509765625,31.2509765625,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:04:02,0.2.1
gemm_split_k_fp8,torch,full,memory,MB,Matrix Size (m x k x n),Matrix Size (m x k x n),"(1024, 2048, 1024)",76.2509765625,76.2509765625,76.2509765625,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:04:02,0.2.1
gemm_split_k_fp8,torch_compile,full,memory,MB,Matrix Size (m x k x n),Matrix Size (m x k x n),"(64, 64, 64)",16.3681640625,16.3681640625,16.3681640625,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:04:05,0.2.1
gemm_split_k_fp8,torch_compile,full,memory,MB,Matrix Size (m x k x n),Matrix Size (m x k x n),"(256, 256, 256)",18.1259765625,18.1259765625,18.1259765625,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:04:05,0.2.1
gemm_split_k_fp8,torch_compile,full,memory,MB,Matrix Size (m x k x n),Matrix Size (m x k x n),"(512, 512, 512)",23.7509765625,23.7509765625,23.7509765625,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:04:05,0.2.1
gemm_split_k_fp8,torch_compile,full,memory,MB,Matrix Size (m x k x n),Matrix Size (m x k x n),"(1024, 1024, 1024)",46.2509765625,46.2509765625,46.2509765625,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:04:05,0.2.1
gemm_split_k_fp8,torch_compile,full,memory,MB,Matrix Size (m x k x n),Matrix Size (m x k x n),"(64, 128, 64)",16.4697265625,16.4697265625,16.4697265625,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:04:05,0.2.1
gemm_split_k_fp8,torch_compile,full,memory,MB,Matrix Size (m x k x n),Matrix Size (m x k x n),"(256, 512, 256)",19.7509765625,19.7509765625,19.7509765625,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:04:05,0.2.1
gemm_split_k_fp8,torch_compile,full,memory,MB,Matrix Size (m x k x n),Matrix Size (m x k x n),"(512, 1024, 512)",30.2509765625,30.2509765625,30.2509765625,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:04:05,0.2.1
gemm_split_k_fp8,torch_compile,full,memory,MB,Matrix Size (m x k x n),Matrix Size (m x k x n),"(1024, 2048, 1024)",72.2509765625,72.2509765625,72.2509765625,"{""dtype"": ""torch.float32""}",NVIDIA GeForce RTX 4090,2024-09-12 03:04:05,0.2.1
Loading
Loading