Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

From NVIDIA Megatron-LM for visibility #18

Open
wants to merge 3,242 commits into
base: multi-query-attention
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
3242 commits
Select commit Hold shift + click to select a range
08e245d
ADLR/megatron-lm!1968 - Optimize broadcasted data during parallel load
mikolajblaz Sep 5, 2024
5b73de7
Merge branch 'mblaz/fast-load-broadcast' into 'main'
ko3n1g Sep 5, 2024
6701e08
ADLR/megatron-lm!1951 - Fix description of distributed optimizer work…
deepakn94 Sep 5, 2024
3396356
Merge branch 'dnarayanan/distributed_optimizer_readme_fixes' into 'main'
ko3n1g Sep 5, 2024
033d8b0
ADLR/megatron-lm!1669 - Add native-fp8
kunlunl Sep 5, 2024
01945b9
Merge branch 'kunlunl/native_fp8_2' into 'main'
ko3n1g Sep 5, 2024
f0161d2
ADLR/megatron-lm!2039 - Restore the actual PyT 2.4 fix from !1970
mikolajblaz Sep 5, 2024
7580748
Merge branch 'mblaz/dist-ckpt-pyt2.4' into 'main'
jaredcasper Sep 5, 2024
a61150d
ADLR/megatron-lm!2044 - tests: Skip flaky mamba test
ko3n1g Sep 5, 2024
2169674
Merge branch 'ko3n1g/tests/disable-mamba-test' into 'main'
ko3n1g Sep 5, 2024
cb979cf
ADLR/megatron-lm!2048 - ci: Bump reference sha
ko3n1g Sep 5, 2024
38873f5
Merge branch 'ko3n1g/ci/bump-sha' into 'main'
ko3n1g Sep 5, 2024
7ef8b3f
ADLR/megatron-lm!2029 - Add model config files for Mixtral-8x7B and M…
xxuwenc Sep 5, 2024
5ec1e29
Merge branch 'xuwenc/release_moe_benchmarking' into 'main'
ko3n1g Sep 5, 2024
fa8bb59
ADLR/megatron-lm!1881 - Uneven Pipeline Parallelism
wdykas Sep 5, 2024
60d03fd
Merge branch 'uneven-pipeline' into 'main'
ericharper Sep 5, 2024
86df799
ADLR/megatron-lm!1912 - Add support for pytorch tensorboard profiler
jon-barker Sep 5, 2024
cb4ce23
Merge branch 'jbarker/pt-profiler' into 'main'
jaredcasper Sep 5, 2024
dd876ba
ADLR/megatron-lm!2050 - ci: Pass `LOAD_PATH` into training
ko3n1g Sep 5, 2024
4a756e2
Merge branch 'ko3n1g/tests/release-training-load-path' into 'main'
ko3n1g Sep 5, 2024
8f19bcd
ADLR/megatron-lm!1958 - Update check_param_hashes_across_dp_replicas …
akoumpa Sep 6, 2024
732a689
Merge branch 'akoumparouli/check_param_hashes_across_dp_replicas_fix'…
ericharper Sep 6, 2024
43ee4b8
ADLR/megatron-lm!1796 - Per layer cudagraph support for GPT training …
jiemingz Sep 6, 2024
9366f3c
Merge branch 'auto_cudagraph' into 'main'
ko3n1g Sep 6, 2024
8499f26
ADLR/megatron-lm!2053 - Update model config files for Mixtral-8x7B an…
xxuwenc Sep 6, 2024
3728c67
Merge branch 'xuwenc/release_moe_benchmarking' into 'main'
ko3n1g Sep 6, 2024
98abe37
ADLR/megatron-lm!1971 - Revert "ADLR/megatron-lm!1747 - Use TP-CP gro…
erhoo82 Sep 6, 2024
a2b6ee4
Merge branch 'amax_red' into 'main'
ko3n1g Sep 6, 2024
8f331e8
ADLR/megatron-lm!1089 - FP8 support for MoE with conservative recipe
Victarry Sep 6, 2024
cc16182
Merge branch 'denliu/fp8_moe' into 'main'
ko3n1g Sep 6, 2024
9a0e78d
ADLR/megatron-lm!2042 - Fix `zarr` deprecation notice
mikolajblaz Sep 6, 2024
7a113e7
Merge branch 'mblaz/fix-deprecation-notice' into 'main'
ericharper Sep 6, 2024
3fb5c51
ADLR/megatron-lm!1859 - Skierat/fully parallel local
skierat Sep 7, 2024
8252432
Merge branch 'skierat/fully-parallel-local' into 'main'
ericharper Sep 7, 2024
6c3ada7
ADLR/megatron-lm!1630 - Runtime upcycling support for MoE
RayWang96 Sep 7, 2024
f5667db
Merge branch 'runtime-upcycling' into 'main'
jaredcasper Sep 7, 2024
80e3863
ADLR/megatron-lm!2052 - updates import for fault_tolerance package to…
srogawski-nvidia Sep 7, 2024
5019bb4
Merge branch 'nvidia_resiliency_ext' into 'main'
jaredcasper Sep 7, 2024
c14d987
ADLR/megatron-lm!2056 - tests: Move mixtral locations
ko3n1g Sep 7, 2024
79b448a
Merge branch 'ko3n1g/tests/fix-mixtral-tests' into 'main'
ko3n1g Sep 7, 2024
7053e64
ADLR/megatron-lm!2055 - ci: Bump sha
ko3n1g Sep 7, 2024
c2d7e2f
Merge branch 'ko3n1g/ci/bump-sha-2' into 'main'
ko3n1g Sep 7, 2024
759d787
ADLR/megatron-lm!1926 - Adding T5 release test
huvunvidia Sep 7, 2024
6c49616
Merge branch 'huvu/mcore_t5_release_test' into 'main'
ko3n1g Sep 7, 2024
ab5624b
ADLR/megatron-lm!1990 - Mitigate slow loops in set_is_first_minibatch…
jon-barker Sep 7, 2024
cb42680
Merge branch 'jbarker/remove_the_bad_loops' into 'main'
jon-barker Sep 7, 2024
7adc86e
ADLR/megatron-lm!1882 - Fix bug in docstrings in `megatron/core/paral…
deepakn94 Sep 7, 2024
5747146
Merge branch 'dnarayanan-main-patch-45127' into 'main'
ericharper Sep 7, 2024
655a663
ADLR/megatron-lm!1975 - Refactor distributed optimizer communication …
deepakn94 Sep 7, 2024
6ac4db0
Merge branch 'dnarayanan/dist_optimizer_refactor' into 'main'
deepakn94 Sep 7, 2024
8d62160
ADLR/megatron-lm!2046 - ci: Automated cherry-picking
ko3n1g Sep 8, 2024
8b0a9b3
Merge branch 'ko3n1g/ci/maybe-cherry-pick-commit' into 'main'
ko3n1g Sep 8, 2024
56ddf9a
ADLR/megatron-lm!2060 - ci: Bump sha
ko3n1g Sep 8, 2024
8e21350
Merge branch 'bump-sha-3' into 'main'
ko3n1g Sep 8, 2024
a604c95
ADLR/megatron-lm!2061 - ci: Allow skipping unit tests
ko3n1g Sep 8, 2024
46b850f
Merge branch 'ko3n1g/ci/allow-skipping-unittests' into 'main'
ko3n1g Sep 8, 2024
4a47180
ADLR/megatron-lm!2062 - ci: Automate cut-off of release branch
ko3n1g Sep 8, 2024
dccb6df
Merge branch 'ko3n1g/ci/automate-release-branch' into 'main'
ko3n1g Sep 8, 2024
eb7418f
ADLR/megatron-lm!2064 - ci: Fixes for mirroring and cherry picking
ko3n1g Sep 8, 2024
8307fcd
Merge branch 'ko3n1g/ci/fix-mirroring' into 'main'
ko3n1g Sep 8, 2024
0b5bc5e
ADLR/megatron-lm!2066 - ci: Use PAT for mirroring
ko3n1g Sep 9, 2024
27c3737
Merge branch 'ko3n1g/ci/fix-mirroring-2' into 'main'
ko3n1g Sep 9, 2024
6dade5f
ADLR/megatron-lm!2068 - ci: Skip cherry-pick on empty label
ko3n1g Sep 9, 2024
90cd925
Merge branch 'ko3n1g/ci/skip-cherrypicking-on-empty-labels' into 'main'
ko3n1g Sep 9, 2024
bef7771
ADLR/megatron-lm!2051 - Fix lint errors in prepartion for other MRs
tals Sep 10, 2024
d0f5aa9
Merge branch 'tshiri/format_before_real_change' into 'main'
ko3n1g Sep 10, 2024
aae7237
ADLR/megatron-lm!2079 - ci: Repeat unit tests 5 times
ko3n1g Sep 10, 2024
b28d445
Merge branch 'ko3n1g/ci/repeat-unittests' into 'main'
ko3n1g Sep 10, 2024
c290133
ADLR/megatron-lm!2081 - Skip the upcycling UT.
yanring Sep 10, 2024
522d8a3
Merge branch 'zijiey/skip_upcycling_ut' into 'main'
ko3n1g Sep 10, 2024
f03af48
ADLR/megatron-lm!2067 - Update Golden Values for MoE Nightly Tests
yaox12 Sep 10, 2024
bbecd08
Merge branch 'xiny/fix_moe_nightly_test' into 'main'
ko3n1g Sep 10, 2024
6a89bc7
ADLR/megatron-lm!2083 - ci: Cherry-pick into the right project
ko3n1g Sep 10, 2024
3bdae05
Merge branch 'ko3n1g/ci/cherry-pick-project' into 'main'
ko3n1g Sep 10, 2024
e93d566
ADLR/megatron-lm!2084 - expanding pyproject.toml definitions for uv
pstjohn Sep 10, 2024
b6887d3
Merge branch 'pstjohn/pyproject.toml' into 'main'
ko3n1g Sep 10, 2024
1ea3918
ADLR/megatron-lm!1931 - copyedits try 3 : pure doc changes
megnvidia Sep 10, 2024
db0fc33
Merge branch 'edits-rebased' into 'main'
jaredcasper Sep 10, 2024
f218582
ADLR/megatron-lm!2086 - Add Encoder-Decoder Parallelism Documentation
Sep 11, 2024
fe1640a
Merge branch 'mike/add_encoder_doc' into 'main'
ko3n1g Sep 11, 2024
1fa9464
ADLR/megatron-lm!1699 - MoE Shared Expert support
hxbai Sep 11, 2024
fec11a7
Merge branch 'hongxiaob/shared_expert' into 'main'
ko3n1g Sep 11, 2024
6e4e9df
ADLR/megatron-lm!2088 - Add MoE interface tests and move other tests …
yanring Sep 11, 2024
8fc7553
Merge branch 'zijiey/moe_interface_tests' into 'main'
ko3n1g Sep 11, 2024
2130890
ADLR/megatron-lm!2092 - ci: Bump reference sha
ko3n1g Sep 11, 2024
6664dc6
Merge branch 'ko3n1g/ci/bump-sha-3' into 'main'
ko3n1g Sep 11, 2024
32949f2
ADLR/megatron-lm!2093 - ci: Disable broken test
ko3n1g Sep 11, 2024
df1418a
Merge branch 'ko3n1g/ci/disable-broken-test' into 'main'
ko3n1g Sep 11, 2024
f8b7c3f
ADLR/megatron-lm!1985 - Multimodal sequence length optimizations
trintamaki Sep 11, 2024
6151709
Merge branch 'trintamaki/multi-image-multi-tile-dataloader-seq-len' i…
jaredcasper Sep 11, 2024
3005d02
ADLR/megatron-lm!2094 - tests: Disable flaky test
ko3n1g Sep 11, 2024
9ec2337
Merge branch 'ko3n1g/tests/flaky-test-2' into 'main'
ko3n1g Sep 11, 2024
e5fb1fa
ADLR/megatron-lm!2004 - tests: Repeat MRs 5 times
ko3n1g Sep 12, 2024
028b777
Merge branch 'ko3n1g/ci/repeat-mrs' into 'main'
ko3n1g Sep 12, 2024
dcc6634
ADLR/megatron-lm!2091 - Don't pass device_id to torch.distributed.ini…
szmigacz Sep 12, 2024
76f9f48
Merge branch 'no_dist_device_id' into 'main'
ko3n1g Sep 12, 2024
bf7b978
ADLR/megatron-lm!2059 - ci: Add release tests for 0.9
ko3n1g Sep 14, 2024
21924d8
Merge branch 'ko3n1g/ci/release-tests' into 'main'
ko3n1g Sep 14, 2024
e6f1d81
ADLR/megatron-lm!2106 - fix: allow merge request CI for non-protected…
terrykong Sep 17, 2024
6562666
Merge branch 'terryk/ci-can-fail-on-unprotected-targets' into 'main'
ko3n1g Sep 17, 2024
0902af0
ADLR/megatron-lm!2107 - chore: Fix autoformatter for release branches
ko3n1g Sep 17, 2024
72008a0
Merge branch 'ko3n1g/chore/formatting-on-release-branch' into 'main'
ko3n1g Sep 17, 2024
2a8d8af
ADLR/megatron-lm!2104 - Fixing broken links
Sep 17, 2024
3f10ff6
Merge branch 'docFix' into 'main'
Sep 17, 2024
71d8ce7
ADLR/megatron-lm!2072 - Add video handling into multimodal mcore
Sep 17, 2024
0bda578
Merge branch 'add-video-handling' into 'main'
jon-barker Sep 17, 2024
ab7f706
ADLR/megatron-lm!1715 - Enable optional kwargs with CUDA graph
vasunvidia Sep 18, 2024
77b4bfe
Merge branch 'lora_cg' into 'main'
ko3n1g Sep 18, 2024
0cffc6b
ADLR/megatron-lm!2077 - Resolve "Fix TE version in TELinear"
Victarry Sep 18, 2024
461b06c
Merge branch '318-fix-te-version-in-telinear' into 'main'
ericharper Sep 18, 2024
6b78cb1
ADLR/megatron-lm!2112 - Update path to MMMU to use new repos structure
Sep 18, 2024
d350231
Merge branch 'fix_mmmu_mmodal' into 'main'
jon-barker Sep 18, 2024
cedd415
ADLR/megatron-lm!1880 - Removing env variable NVTE_ALLOW_NONDETERMINI…
Sep 18, 2024
6b35ca8
Merge branch 'bertflash' into 'main'
Sep 18, 2024
63be779
ADLR/megatron-lm!2033 - Online eval
trintamaki Sep 19, 2024
835af44
Merge branch 'trintamaki/online-eval' into 'main'
ericharper Sep 19, 2024
2c9bcac
ADLR/megatron-lm!1973 - MMMU multi-image support
trintamaki Sep 19, 2024
905de33
Merge branch 'trintamaki/multi-image-mmmu' into 'main'
jon-barker Sep 19, 2024
5c0697c
ADLR/megatron-lm!2113 - build: Use multi-stage for parallel builds
ko3n1g Sep 20, 2024
c394f78
Merge branch 'ko3n1g/build/pip' into 'main'
ko3n1g Sep 20, 2024
cf596b9
ADLR/megatron-lm!2126 - Only print warning when relevant
deepakn94 Sep 21, 2024
640e62f
Merge branch 'dnarayanan/warning_fix' into 'main'
jaredcasper Sep 21, 2024
3eeb932
ADLR/megatron-lm!2124 - tests: Fix location of megatron
ko3n1g Sep 21, 2024
205f946
Merge branch 'ko3n1g/tests/fix-location-of-megatron' into 'main'
ko3n1g Sep 21, 2024
d210eb0
ADLR/megatron-lm!2127 - ci: Bump sha
ko3n1g Sep 21, 2024
811a26a
Merge branch 'ko3n1g/chore/bump-sha' into 'main'
ko3n1g Sep 21, 2024
405135a
ADLR/megatron-lm!2128 - ci: Improve cherry pick workflow
ko3n1g Sep 22, 2024
fba615f
Merge branch 'ko3n1g/ci/improve-cherry-pick-workflow' into 'main'
ko3n1g Sep 22, 2024
95be3cb
ADLR/megatron-lm!2034 - ci: Introduce JET Python SDK
ko3n1g Sep 22, 2024
e79808c
Merge branch 'ko3n1g/ci/convergence-tests-with-jet' into 'main'
ko3n1g Sep 22, 2024
e10a9f4
ADLR/megatron-lm!2130 - ci: Improve cherry pick MR description
ko3n1g Sep 22, 2024
8e69382
Merge branch 'ko3n1g/ci/improve-cherry-pick-workflow' into 'main'
ko3n1g Sep 22, 2024
e35818d
ADLR/megatron-lm!2119 - Huvu/t5 te10 fix nemoci pr482
huvunvidia Sep 23, 2024
dbd2d18
Merge branch 'huvu/t5_TE10_fix_nemoci_PR482' into 'main'
ericharper Sep 23, 2024
8c666c2
ADLR/megatron-lm!2134 - ci: Set author and milestone for cherry-picks
ko3n1g Sep 23, 2024
6d8dc80
Merge branch 'ko3n1g/ci/cherry-pick-authro' into 'main'
ko3n1g Sep 23, 2024
c45f951
ADLR/megatron-lm!2135 - ci: Send alerts on unit-tests-extended
ko3n1g Sep 23, 2024
08e80b0
Merge branch 'ko3n1g/ci/notify-ut' into 'main'
ko3n1g Sep 23, 2024
643e60a
ADLR/megatron-lm!2133 - tests: Minor improvements to JET
ko3n1g Sep 23, 2024
8ec4617
Merge branch 'ko3n1g/ci/fixes-to-jet' into 'main'
ko3n1g Sep 23, 2024
5ade91a
ADLR/megatron-lm!2136 - tests: Fix GPT test
ko3n1g Sep 23, 2024
1f2d556
Merge branch 'ko3n1g/tests/fix-gpt-release-samples' into 'main'
ko3n1g Sep 23, 2024
e464e94
ADLR/megatron-lm!2139 - ci: Fix cherry-pick strings
ko3n1g Sep 23, 2024
0fd4617
Merge branch 'ko3n1g/ci/cherry-pick-strip-chars' into 'main'
ko3n1g Sep 23, 2024
ede39b8
ADLR/megatron-lm!2110 - Use torch dataloader in multimodal evaluation
trintamaki Sep 23, 2024
2065c35
Merge branch 'trintamaki/multimodal-eval-dataset' into 'main'
jon-barker Sep 23, 2024
697ea61
ADLR/megatron-lm!2137 - ci: Enable dev container for new features
ko3n1g Sep 23, 2024
075c727
Merge branch 'ko3n1g/ci/dev-container' into 'main'
ko3n1g Sep 23, 2024
5e23e72
ADLR/megatron-lm!2005 - Fix performance regression brought by torch.b…
xxuwenc Sep 24, 2024
884b087
Merge branch 'revert_bincount' into 'main'
ko3n1g Sep 24, 2024
ad38459
ADLR/megatron-lm!2073 - Multimodal batched bug fix
trintamaki Sep 24, 2024
162b82d
Merge branch 'trintamaki/multimodal_batch_bugfix' into 'main'
jon-barker Sep 24, 2024
32eac88
ADLR/megatron-lm!1581 - Add MLA support into MCore
BoxiangW Sep 24, 2024
dcf9e77
Merge branch 'boxiangw/mla' into 'main'
jaredcasper Sep 24, 2024
d207755
ADLR/megatron-lm!1995 - Add freeze options to pretrain_vlm
trintamaki Sep 25, 2024
891b8f9
Merge branch 'trintamaki/pretrain_vlm_freeze_option' into 'main'
jon-barker Sep 25, 2024
31c23f5
ADLR/megatron-lm!2145 - Improve logging when decreasing batch size
deepakn94 Sep 25, 2024
78bef1c
Merge branch 'dnarayanan/improve_logging' into 'main'
ericharper Sep 25, 2024
5aceacb
ADLR/megatron-lm!2148 - Add model.eval() to run_text_generation_serve…
mathemakitten Sep 25, 2024
4158084
Merge branch 'hn-set-model-eval-mode' into 'main'
jaredcasper Sep 25, 2024
368f561
ADLR/megatron-lm!2111 - Mcore llama3.1 support
jon-barker Sep 26, 2024
c1c19d1
Merge branch 'jbarker/llama3.1' into 'main'
ericharper Sep 26, 2024
1265399
ADLR/megatron-lm!2151 - ci: Run experimental UTs on dev image
ko3n1g Sep 26, 2024
c025cec
Merge branch 'ko3n1g/ci/uts-on-dev' into 'main'
ko3n1g Sep 26, 2024
f0d7120
ADLR/megatron-lm!1953 - Mcore export to export models to TRTLLM (GPU …
Sep 26, 2024
45bf4c1
Merge branch 'final_export' into 'main'
Sep 26, 2024
f5171f2
ADLR/megatron-lm!2154 - ci: Prune docker cache of `mcore-docker-node-…
ko3n1g Sep 26, 2024
e38d92a
Merge branch 'ko3n1g/ci/prune-container-cache-mcore-docker-node-jet' …
ko3n1g Sep 26, 2024
c31452c
ADLR/megatron-lm!2155 - Resolve release test failure caused by Groupe…
xxuwenc Sep 26, 2024
d55d61a
Merge branch 'xuwenc/release_perf_bugfix' into 'main'
ko3n1g Sep 26, 2024
3beefb5
ADLR/megatron-lm!2156 - tests: Set better name for Wandb logging
ko3n1g Sep 26, 2024
5553fc1
Merge branch 'ko3n1g/tests/better-logging-to-wandb' into 'main'
ko3n1g Sep 26, 2024
0976661
ADLR/megatron-lm!1950 - Remove pkg_resources package
ksivaman Sep 27, 2024
1585be2
Merge branch 'fix_version_checks' into 'main'
ko3n1g Sep 27, 2024
2bad957
ADLR/megatron-lm!2142 - ci: Onboard CW
ko3n1g Sep 27, 2024
12c2696
Merge branch 'ko3n1g/ci/onboard-cw' into 'main'
ko3n1g Sep 27, 2024
3428cd9
ADLR/megatron-lm!2158 - Small changes to export
Sep 28, 2024
b3375a0
Merge branch 'new_export' into 'main'
ericharper Sep 28, 2024
5b7374a
ADLR/megatron-lm!2152 - Fix rope backward compatibility
BoxiangW Sep 30, 2024
6ad11b0
Merge branch 'boxiangw/mla_backwards_comp' into 'main'
jaredcasper Sep 30, 2024
ca6d170
ADLR/megatron-lm!2140 - [Bug fix] Don't trace graphs during inference
jiemingz Oct 1, 2024
dddecd1
Merge branch 'auto_cudagraph_val_fix' into 'main'
ericharper Oct 1, 2024
5ab659b
ADLR/megatron-lm!2109 - Adding more MR tests for T5 (e.g., transforme…
huvunvidia Oct 1, 2024
3efa8c2
Merge branch 'huvu/t5_dist_checkpoint_mrtests' into 'main'
ko3n1g Oct 1, 2024
f07581b
ADLR/megatron-lm!2164 - ci: Download artifacts
ko3n1g Oct 1, 2024
85cd99b
Merge branch 'ko3n1g/ci/artifacts' into 'main'
ko3n1g Oct 1, 2024
858694f
ADLR/megatron-lm!2165 - ci: Bump version
ko3n1g Oct 2, 2024
065260b
Merge branch 'ko3n1g/ci/backwards-tag' into 'main'
jaredcasper Oct 2, 2024
f76b465
ADLR/megatron-lm!2153 - Add the interface to set TP communication boo…
erhoo82 Oct 3, 2024
25f7da2
Merge branch 'tp_bootstrap_backend' into 'main'
ericharper Oct 3, 2024
50042ff
ADLR/megatron-lm!2095 - Add support for SigLIP vision encoder to mult…
Oct 3, 2024
4d5f94d
Merge branch 'convert_siglip_model' into 'main'
jaredcasper Oct 3, 2024
2aaf85d
ADLR/megatron-lm!2175 - adding cu_seqlens_padded support in MCore
Oct 4, 2024
c02b335
Merge branch 'add_cu_seqlens_padded_support' into 'main'
ericharper Oct 4, 2024
ee9dba2
ADLR/megatron-lm!2181 - Fixing attention mask dimenions to support TE…
Oct 4, 2024
fde8bb1
Merge branch 'fixattnmask' into 'main'
ericharper Oct 4, 2024
843a22e
ADLR/megatron-lm!2180 - rotary_scaling fix for llama3.1 and 3.2
yueshen2016 Oct 4, 2024
b98ec86
Merge branch 'yueshen/rotary_scaling_fix_llama3_1' into 'main'
ericharper Oct 4, 2024
827d5b6
ADLR/megatron-lm!2185 - chore: Improve generator for launch scripts
ko3n1g Oct 4, 2024
31fe61a
Merge branch 'ko3n1g/ci/fix-launch-script-generator' into 'main'
ko3n1g Oct 4, 2024
e2a1c52
ADLR/megatron-lm!2160 - Adding Inference pipeline for T5
huvunvidia Oct 5, 2024
0acda93
Merge branch 'huvu/t5_generate' into 'main'
ericharper Oct 5, 2024
2f9ac3c
ADLR/megatron-lm!2182 - ci: Group runs by model
ko3n1g Oct 5, 2024
edb51fc
Merge branch 'ko3n1g/ci/group-runs' into 'main'
ko3n1g Oct 5, 2024
cf0d855
ADLR/megatron-lm!1862 - Cpu init te
wdykas Oct 5, 2024
0e6bef1
Merge branch 'cpu-init-te' into 'main'
ko3n1g Oct 5, 2024
6939737
ADLR/megatron-lm!2186 - ci: Run script after export
ko3n1g Oct 5, 2024
73e7b58
Merge branch 'ko3n1g/ci/run-script-after-export' into 'main'
ko3n1g Oct 5, 2024
6ca379e
ADLR/megatron-lm!2089 - Fix upcycling issues.
RayWang96 Oct 7, 2024
ff5cee9
Merge branch 'runtime-upcycling' into 'main'
ericharper Oct 7, 2024
a559ec1
ADLR/megatron-lm!2189 - tests: Fix ENV export
ko3n1g Oct 7, 2024
3f90b98
Merge branch 'ko3n1g/ci/fix-env-export' into 'main'
ko3n1g Oct 7, 2024
e108535
ADLR/megatron-lm!2194 - tests: Fix ENV export
ko3n1g Oct 9, 2024
3f43927
Merge branch 'ko3n1g/ci/fix-env-export' into 'main'
ko3n1g Oct 9, 2024
fbdc916
ADLR/megatron-lm!1790 - GroupedMLP DistOpt Resharding and add UTs to …
hxbai Oct 9, 2024
b1218b9
Merge branch 'hongxiaob/moe_dist_ckpt' into 'main'
ko3n1g Oct 9, 2024
5776d06
ADLR/megatron-lm!2197 - ci: Always upload artifacts
ko3n1g Oct 9, 2024
bf74129
Merge branch 'ko3n1g/ci/always-artifacts' into 'main'
ko3n1g Oct 9, 2024
0e3eaa5
ADLR/megatron-lm!2141 - Data parallel inference
trintamaki Oct 9, 2024
fcdbf90
Merge branch 'trintamaki/data-parallel-inference' into 'main'
jon-barker Oct 9, 2024
37a2116
ADLR/megatron-lm!2199 - Remove CUDA requirement from cpu test.
Oct 9, 2024
228dc20
Merge branch 'vitalyk/testfix' into 'main'
ko3n1g Oct 9, 2024
f462160
ADLR/megatron-lm!2096 - Support padding between subsequences of Packe…
parthmannan Oct 10, 2024
7e90ec0
Merge branch 'packed_seq_padded_support' into 'main'
ericharper Oct 10, 2024
566d9cd
ADLR/megatron-lm!2206 - Revert "Merge branch 'vitalyk/testfix' into '…
ko3n1g Oct 10, 2024
b60f5d0
Merge branch 'revert-228dc204' into 'main'
ko3n1g Oct 10, 2024
13c39ac
ADLR/megatron-lm!1909 - Standard interface for getting offsets from t…
Oct 11, 2024
47bb8d1
Merge branch 'sasatheesh/tokenizer_offsets' into 'main'
ericharper Oct 11, 2024
8c018ca
ADLR/megatron-lm!2208 - tests: Use flaky instead of skip marker
ko3n1g Oct 11, 2024
772faca
Merge branch 'ko3n1g/ci/flaky-marker' into 'main'
ko3n1g Oct 11, 2024
831d64d
ADLR/megatron-lm!2017 - chore: Bump Pytorch container
ko3n1g Oct 16, 2024
4876ee1
Merge branch 'ko3n1g/chore/bump-pyt' into 'main'
ko3n1g Oct 16, 2024
bc4874c
ADLR/megatron-lm!2214 - Add siglip converter to multimodal example
Oct 16, 2024
6bafe92
Merge branch 'add_siglip_converter' into 'main'
jon-barker Oct 16, 2024
a30d63b
ADLR/megatron-lm!2226 - Add missing import to megatron/training/initi…
deepakn94 Oct 16, 2024
0d89fc4
Merge branch 'dnarayanan/fix_import' into 'main'
deepakn94 Oct 16, 2024
33d2f45
ADLR/megatron-lm!2223 - ci(refactor): Facelift gitlab-ci
ko3n1g Oct 18, 2024
55622ff
Merge branch 'ko3n1g/ci/refactor-jobs' into 'main'
ko3n1g Oct 18, 2024
cba8bdc
ADLR/megatron-lm!2234 - ci: Set stronger dependencies
ko3n1g Oct 18, 2024
ecf0dbe
Merge branch 'ko3n1g/ci/test-dependencies' into 'main'
ko3n1g Oct 18, 2024
839dff2
ADLR/megatron-lm!2075 - Triton cache fix
duncanriach Oct 19, 2024
b7814bb
Merge branch 'duncan/triton-cache-fix' into 'main'
ericharper Oct 19, 2024
a9c16c5
ADLR/megatron-lm!1939 - fix an issue when using `multi_tensor_scale` …
BestJuly Oct 19, 2024
02d1762
Merge branch 'lit/fix_multi_tensor_scale' into 'main'
ericharper Oct 19, 2024
6adf0bd
ADLR/megatron-lm!1927 - Improved missing key exception for errors dur…
jstjohn Oct 19, 2024
db6cb4e
Merge branch 'jstjohn/improved_missing_key_exception' into 'main'
ericharper Oct 19, 2024
2c950a5
ADLR/megatron-lm!2038 - LLaVA Multimodal SP support
parthmannan Oct 19, 2024
739177e
Merge branch 'pmannan/llava_debug' into 'main'
ericharper Oct 19, 2024
d28e26e
ADLR/megatron-lm!2227 - qwen2.5 conversion
Oct 19, 2024
db7d37b
Merge branch 'qwen25_conversion' into 'main'
jon-barker Oct 19, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
4 changes: 4 additions & 0 deletions .flake8
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
[flake8]
max-line-length = 100
extend-ignore = E203,E501,F401,E402,E714
per-file-ignores = __init__.py:F401
32 changes: 32 additions & 0 deletions .github/ISSUE_TEMPLATE/bug.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
---
name: BUG
about: Report a bug that needs attention
title: "[BUG]"
labels: ''
assignees: ''

---

**Describe the bug**
A clear and concise description of what the bug is.

**To Reproduce**
Steps to reproduce the behavior. The easier it is to reproduce the faster it will get maintainer attention.

**Expected behavior**
A clear and concise description of what you expected to happen.

**Stack trace/logs**
If applicable, add the stack trace or logs from the time of the error.

**Environment (please complete the following information):**
- Megatron-LM commit ID
- PyTorch version
- CUDA version
- NCCL version

**Proposed fix**
If you have a proposal for how to fix the issue state it here or link to a PR.

**Additional context**
Add any other context about the problem here.
23 changes: 23 additions & 0 deletions .github/ISSUE_TEMPLATE/enhancement.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
---
name: ENHANCEMENT
about: Suggest an idea to improve this project
title: "[ENHANCEMENT]"
labels: ''
assignees: ''

---

**Is your feature request related to a problem? Please describe.**
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

**Describe the solution you'd like**
A clear and concise description of what you want to happen.

**Describe alternatives you've considered**
A clear and concise description of any alternative solutions or features you've considered.

**Proposed implementation**
If you have a proposed implementation for the feature state it here or link to a PR.

**Additional context**
Add any other context or screenshots about the feature request here.
12 changes: 12 additions & 0 deletions .github/ISSUE_TEMPLATE/question.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
---
name: QUESTION
about: Ask a question about Megatron-LM that is not a bug, regression or enhancement
request
title: "[QUESTION]"
labels: ''
assignees: ''

---

**Your question**
Ask a clear and concise question about Megatron-LM.
39 changes: 39 additions & 0 deletions .github/ISSUE_TEMPLATE/regression.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
---
name: REGRESSION
about: Report a regression in speed or accuracy due to a Megatron-LM update
title: "[REGRESSION]"
labels: ''
assignees: ''

---

**Describe the regression**
A clear and concise description of what the regression is.

**To Reproduce**
Steps to reproduce the behavior. The easier it is to reproduce the faster it will get maintainer attention.

**Previous performance**
What speed or accuracy did you previously see.

**New performance**
What speed or accuracy do you see after the update.

**Stack trace/logs**
If applicable, add the stack trace or logs related to the regression.

**Environment (please complete the following information):**
- Previous Megatron-LM commit ID
- New Megatron-LM commit ID
- Previous PyTorch version
- New PyTorch version
- Previous CUDA version
- New CUDA version
- Previous NCCL version
- New NCCL version

**Proposed fix**
If you have a proposal for how to fix the issue state it here or link to a PR.

**Additional context**
Add any other context about the problem here.
31 changes: 31 additions & 0 deletions .github/workflows/stale.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# This workflow warns and then closes issues and PRs that have had no activity for a specified amount of time.
#
# You can adjust the behavior by modifying this file.
# For more information, see:
# https://github.com/actions/stale
name: Mark stale issues and pull requests

on:
schedule:
- cron: '15 18 * * *'

jobs:
stale:

runs-on: ubuntu-latest
permissions:
issues: write
pull-requests: write

steps:
- uses: actions/stale@v5
with:
repo-token: ${{ secrets.GITHUB_TOKEN }}
days-before-stale: 60
stale-issue-message: 'Marking as stale. No activity in 60 days.'
stale-pr-message: 'Marking as stale. No activity in 60 days.'
stale-issue-label: 'stale'
stale-pr-label: 'stale'
remove-stale-when-updated: true
operations-per-run: 1000
days-before-close: -1
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,6 @@ build
*~
slurm*
logs
.vscode
local/
.gitmodules
Loading