Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

From NVIDIA Megatron-LM for visibility #18

Open
wants to merge 3,242 commits into
base: multi-query-attention
Choose a base branch
from
This pull request is big! We’re only showing the most recent 250 commits.

Commits on Sep 5, 2024

  1. Configuration menu
    Copy the full SHA
    08e245d View commit details
    Browse the repository at this point in the history
  2. Merge branch 'mblaz/fast-load-broadcast' into 'main'

    Optimize broadcasted data during parallel load
    
    See merge request ADLR/megatron-lm!1968
    ko3n1g committed Sep 5, 2024
    Configuration menu
    Copy the full SHA
    5b73de7 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    6701e08 View commit details
    Browse the repository at this point in the history
  4. Merge branch 'dnarayanan/distributed_optimizer_readme_fixes' into 'main'

    Fix description of distributed optimizer workflow
    
    See merge request ADLR/megatron-lm!1951
    ko3n1g committed Sep 5, 2024
    Configuration menu
    Copy the full SHA
    3396356 View commit details
    Browse the repository at this point in the history
  5. ADLR/megatron-lm!1669 - Add native-fp8

    kunlunl authored and ko3n1g committed Sep 5, 2024
    Configuration menu
    Copy the full SHA
    033d8b0 View commit details
    Browse the repository at this point in the history
  6. Merge branch 'kunlunl/native_fp8_2' into 'main'

    Add native-fp8
    
    See merge request ADLR/megatron-lm!1669
    ko3n1g committed Sep 5, 2024
    Configuration menu
    Copy the full SHA
    01945b9 View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    f0161d2 View commit details
    Browse the repository at this point in the history
  8. Merge branch 'mblaz/dist-ckpt-pyt2.4' into 'main'

    Restore the actual PyT 2.4 fix from !1970
    
    See merge request ADLR/megatron-lm!2039
    jaredcasper committed Sep 5, 2024
    Configuration menu
    Copy the full SHA
    7580748 View commit details
    Browse the repository at this point in the history
  9. Configuration menu
    Copy the full SHA
    a61150d View commit details
    Browse the repository at this point in the history
  10. Merge branch 'ko3n1g/tests/disable-mamba-test' into 'main'

    tests: Skip flaky mamba test
    
    See merge request ADLR/megatron-lm!2044
    ko3n1g committed Sep 5, 2024
    Configuration menu
    Copy the full SHA
    2169674 View commit details
    Browse the repository at this point in the history
  11. Configuration menu
    Copy the full SHA
    cb979cf View commit details
    Browse the repository at this point in the history
  12. Merge branch 'ko3n1g/ci/bump-sha' into 'main'

    ci: Bump reference sha
    
    See merge request ADLR/megatron-lm!2048
    ko3n1g committed Sep 5, 2024
    Configuration menu
    Copy the full SHA
    38873f5 View commit details
    Browse the repository at this point in the history
  13. ADLR/megatron-lm!2029 - Add model config files for Mixtral-8x7B and M…

    …ixtral-8x22B performance benchmarking
    xxuwenc authored and ko3n1g committed Sep 5, 2024
    Configuration menu
    Copy the full SHA
    7ef8b3f View commit details
    Browse the repository at this point in the history
  14. Merge branch 'xuwenc/release_moe_benchmarking' into 'main'

    Add model config files for Mixtral-8x7B and Mixtral-8x22B performance benchmarking
    
    See merge request ADLR/megatron-lm!2029
    ko3n1g committed Sep 5, 2024
    Configuration menu
    Copy the full SHA
    5ec1e29 View commit details
    Browse the repository at this point in the history
  15. ADLR/megatron-lm!1881 - Uneven Pipeline Parallelism

    Co-authored-by: William Dykas <wdykas@cw-dfw-cs-001-dc-02.cm.cluster>
    Co-authored-by: William Dykas <wdykas@cw-dfw-cs-001-dc-01.cm.cluster>
    Co-authored-by: William Dykas <wdykas@cs-cw-dfw-login-01.cm.cluster>
    Co-authored-by: William Dykas <wdykas@cs-cw-dfw-dc-02.cm.cluster>
    5 people authored and ericharper committed Sep 5, 2024
    Configuration menu
    Copy the full SHA
    fa8bb59 View commit details
    Browse the repository at this point in the history
  16. Merge branch 'uneven-pipeline' into 'main'

    Uneven Pipeline Parallelism
    
    See merge request ADLR/megatron-lm!1881
    ericharper committed Sep 5, 2024
    Configuration menu
    Copy the full SHA
    60d03fd View commit details
    Browse the repository at this point in the history
  17. ADLR/megatron-lm!1912 - Add support for pytorch tensorboard profiler

    Co-authored-by: Jon Barker <jbarker@draco-oci-dc-01.cm.cluster>
    2 people authored and jaredcasper committed Sep 5, 2024
    Configuration menu
    Copy the full SHA
    86df799 View commit details
    Browse the repository at this point in the history
  18. Merge branch 'jbarker/pt-profiler' into 'main'

    Add support for pytorch tensorboard profiler
    
    See merge request ADLR/megatron-lm!1912
    jaredcasper committed Sep 5, 2024
    Configuration menu
    Copy the full SHA
    cb4ce23 View commit details
    Browse the repository at this point in the history
  19. Configuration menu
    Copy the full SHA
    dd876ba View commit details
    Browse the repository at this point in the history
  20. Merge branch 'ko3n1g/tests/release-training-load-path' into 'main'

    ci: Pass `LOAD_PATH` into training
    
    See merge request ADLR/megatron-lm!2050
    ko3n1g committed Sep 5, 2024
    Configuration menu
    Copy the full SHA
    4a756e2 View commit details
    Browse the repository at this point in the history

Commits on Sep 6, 2024

  1. ADLR/megatron-lm!1958 - Update check_param_hashes_across_dp_replicas …

    …to return true if hashes across all DP ranks match.
    akoumpa authored and ericharper committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    8f19bcd View commit details
    Browse the repository at this point in the history
  2. Merge branch 'akoumparouli/check_param_hashes_across_dp_replicas_fix'…

    … into 'main'
    
    Update check_param_hashes_across_dp_replicas to return true if hashes across all DP ranks match.
    
    See merge request ADLR/megatron-lm!1958
    ericharper committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    732a689 View commit details
    Browse the repository at this point in the history
  3. ADLR/megatron-lm!1796 - Per layer cudagraph support for GPT training …

    …with Transformer Engine modules
    jiemingz authored and ko3n1g committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    43ee4b8 View commit details
    Browse the repository at this point in the history
  4. Merge branch 'auto_cudagraph' into 'main'

    Per layer cudagraph support for GPT training with Transformer Engine modules
    
    See merge request ADLR/megatron-lm!1796
    ko3n1g committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    9366f3c View commit details
    Browse the repository at this point in the history
  5. ADLR/megatron-lm!2053 - Update model config files for Mixtral-8x7B an…

    …d Mixtral-8x22B performance benchmarking
    xxuwenc authored and ko3n1g committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    8499f26 View commit details
    Browse the repository at this point in the history
  6. Merge branch 'xuwenc/release_moe_benchmarking' into 'main'

    Update model config files for Mixtral-8x7B and Mixtral-8x22B performance benchmarking
    
    See merge request ADLR/megatron-lm!2053
    ko3n1g committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    3728c67 View commit details
    Browse the repository at this point in the history
  7. ADLR/megatron-lm!1971 - Revert "ADLR/megatron-lm!1747 - Use TP-CP gro…

    …up for fp8 amax reduction"
    erhoo82 authored and ko3n1g committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    98abe37 View commit details
    Browse the repository at this point in the history
  8. Merge branch 'amax_red' into 'main'

    Revert "ADLR/megatron-lm!1747 - Use TP-CP group for fp8 amax reduction"
    
    See merge request ADLR/megatron-lm!1971
    ko3n1g committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    a2b6ee4 View commit details
    Browse the repository at this point in the history
  9. Configuration menu
    Copy the full SHA
    8f331e8 View commit details
    Browse the repository at this point in the history
  10. Merge branch 'denliu/fp8_moe' into 'main'

    FP8 support for MoE with conservative recipe
    
    Closes #43
    
    See merge request ADLR/megatron-lm!1089
    ko3n1g committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    cc16182 View commit details
    Browse the repository at this point in the history
  11. Configuration menu
    Copy the full SHA
    9a0e78d View commit details
    Browse the repository at this point in the history
  12. Merge branch 'mblaz/fix-deprecation-notice' into 'main'

    Fix `zarr` deprecation notice
    
    See merge request ADLR/megatron-lm!2042
    ericharper committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    7a113e7 View commit details
    Browse the repository at this point in the history

Commits on Sep 7, 2024

  1. ADLR/megatron-lm!1859 - Skierat/fully parallel local

    Co-authored-by: Mikołaj Błaż <mblaz@nvidia.com>
    Co-authored-by: Slawek Kierat <skierat@skierat-mlt.client.nvidia.com>
    Co-authored-by: Jakub Szulc <jszulc@nvidia.com>
    Co-authored-by: Slawomir Kierat <skierat@dgx1v-loki-25.nvidia.com>
    5 people authored and ericharper committed Sep 7, 2024
    Configuration menu
    Copy the full SHA
    3fb5c51 View commit details
    Browse the repository at this point in the history
  2. Merge branch 'skierat/fully-parallel-local' into 'main'

    Skierat/fully parallel local
    
    See merge request ADLR/megatron-lm!1859
    ericharper committed Sep 7, 2024
    Configuration menu
    Copy the full SHA
    8252432 View commit details
    Browse the repository at this point in the history
  3. ADLR/megatron-lm!1630 - Runtime upcycling support for MoE

    Co-authored-by: Zijie Yan <zijiey@nvidia.com>
    Co-authored-by: Abhinav Khattar <akhattar@nvidia.com>
    Co-authored-by: Ethan He <yihuih@nvidia.com>
    4 people authored and jaredcasper committed Sep 7, 2024
    Configuration menu
    Copy the full SHA
    6c3ada7 View commit details
    Browse the repository at this point in the history
  4. Merge branch 'runtime-upcycling' into 'main'

    Runtime upcycling support for MoE
    
    See merge request ADLR/megatron-lm!1630
    jaredcasper committed Sep 7, 2024
    Configuration menu
    Copy the full SHA
    f5667db View commit details
    Browse the repository at this point in the history
  5. ADLR/megatron-lm!2052 - updates import for fault_tolerance package to…

    … nvidia_resiliency_ext.fault_tolerance
    srogawski-nvidia authored and jaredcasper committed Sep 7, 2024
    Configuration menu
    Copy the full SHA
    80e3863 View commit details
    Browse the repository at this point in the history
  6. Merge branch 'nvidia_resiliency_ext' into 'main'

    updates import for fault_tolerance package to nvidia_resiliency_ext.fault_tolerance
    
    See merge request ADLR/megatron-lm!2052
    jaredcasper committed Sep 7, 2024
    Configuration menu
    Copy the full SHA
    5019bb4 View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    c14d987 View commit details
    Browse the repository at this point in the history
  8. Merge branch 'ko3n1g/tests/fix-mixtral-tests' into 'main'

    tests: Move mixtral locations
    
    See merge request ADLR/megatron-lm!2056
    ko3n1g committed Sep 7, 2024
    Configuration menu
    Copy the full SHA
    79b448a View commit details
    Browse the repository at this point in the history
  9. Configuration menu
    Copy the full SHA
    7053e64 View commit details
    Browse the repository at this point in the history
  10. Merge branch 'ko3n1g/ci/bump-sha-2' into 'main'

    ci: Bump sha
    
    See merge request ADLR/megatron-lm!2055
    ko3n1g committed Sep 7, 2024
    Configuration menu
    Copy the full SHA
    c2d7e2f View commit details
    Browse the repository at this point in the history
  11. ADLR/megatron-lm!1926 - Adding T5 release test

    Co-authored-by: Huy Vu <huvu@cs-oci-ord-login-01.cm.cluster>
    Co-authored-by: Huy Vu2 <huvu@login-eos01.eos.clusters.nvidia.com>
    3 people authored and ko3n1g committed Sep 7, 2024
    Configuration menu
    Copy the full SHA
    759d787 View commit details
    Browse the repository at this point in the history
  12. Merge branch 'huvu/mcore_t5_release_test' into 'main'

    Adding T5 release test
    
    See merge request ADLR/megatron-lm!1926
    ko3n1g committed Sep 7, 2024
    Configuration menu
    Copy the full SHA
    6c49616 View commit details
    Browse the repository at this point in the history
  13. ADLR/megatron-lm!1990 - Mitigate slow loops in set_is_first_minibatch…

    … and zero_grad_buffers
    
    Co-authored-by: Jon Barker <jbarker@draco-oci-dc-01.cm.cluster>
    jon-barker and Jon Barker committed Sep 7, 2024
    Configuration menu
    Copy the full SHA
    ab5624b View commit details
    Browse the repository at this point in the history
  14. Merge branch 'jbarker/remove_the_bad_loops' into 'main'

    Mitigate slow loops in set_is_first_minibatch and zero_grad_buffers
    
    See merge request ADLR/megatron-lm!1990
    jon-barker committed Sep 7, 2024
    Configuration menu
    Copy the full SHA
    cb42680 View commit details
    Browse the repository at this point in the history
  15. Configuration menu
    Copy the full SHA
    7adc86e View commit details
    Browse the repository at this point in the history
  16. Merge branch 'dnarayanan-main-patch-45127' into 'main'

    Fix bug in docstrings in `megatron/core/parallel_state.py`
    
    See merge request ADLR/megatron-lm!1882
    ericharper committed Sep 7, 2024
    Configuration menu
    Copy the full SHA
    5747146 View commit details
    Browse the repository at this point in the history
  17. ADLR/megatron-lm!1975 - Refactor distributed optimizer communication …

    …code into megatron/core/distributed
    deepakn94 committed Sep 7, 2024
    Configuration menu
    Copy the full SHA
    655a663 View commit details
    Browse the repository at this point in the history
  18. Merge branch 'dnarayanan/dist_optimizer_refactor' into 'main'

    Refactor distributed optimizer communication code into megatron/core/distributed
    
    See merge request ADLR/megatron-lm!1975
    deepakn94 committed Sep 7, 2024
    Configuration menu
    Copy the full SHA
    6ac4db0 View commit details
    Browse the repository at this point in the history

Commits on Sep 8, 2024

  1. Configuration menu
    Copy the full SHA
    8d62160 View commit details
    Browse the repository at this point in the history
  2. Merge branch 'ko3n1g/ci/maybe-cherry-pick-commit' into 'main'

    ci: Automated cherry-picking
    
    See merge request ADLR/megatron-lm!2046
    ko3n1g committed Sep 8, 2024
    Configuration menu
    Copy the full SHA
    8b0a9b3 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    56ddf9a View commit details
    Browse the repository at this point in the history
  4. Merge branch 'bump-sha-3' into 'main'

    ci: Bump sha
    
    See merge request ADLR/megatron-lm!2060
    ko3n1g committed Sep 8, 2024
    Configuration menu
    Copy the full SHA
    8e21350 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    a604c95 View commit details
    Browse the repository at this point in the history
  6. Merge branch 'ko3n1g/ci/allow-skipping-unittests' into 'main'

    ci: Allow skipping unit tests
    
    See merge request ADLR/megatron-lm!2061
    ko3n1g committed Sep 8, 2024
    Configuration menu
    Copy the full SHA
    46b850f View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    4a47180 View commit details
    Browse the repository at this point in the history
  8. Merge branch 'ko3n1g/ci/automate-release-branch' into 'main'

    ci: Automate cut-off of release branch
    
    See merge request ADLR/megatron-lm!2062
    ko3n1g committed Sep 8, 2024
    Configuration menu
    Copy the full SHA
    dccb6df View commit details
    Browse the repository at this point in the history
  9. Configuration menu
    Copy the full SHA
    eb7418f View commit details
    Browse the repository at this point in the history
  10. Merge branch 'ko3n1g/ci/fix-mirroring' into 'main'

    ci: Fixes for mirroring and cherry picking
    
    See merge request ADLR/megatron-lm!2064
    ko3n1g committed Sep 8, 2024
    Configuration menu
    Copy the full SHA
    8307fcd View commit details
    Browse the repository at this point in the history

Commits on Sep 9, 2024

  1. Configuration menu
    Copy the full SHA
    0b5bc5e View commit details
    Browse the repository at this point in the history
  2. Merge branch 'ko3n1g/ci/fix-mirroring-2' into 'main'

    ci: Use PAT for mirroring
    
    See merge request ADLR/megatron-lm!2066
    ko3n1g committed Sep 9, 2024
    Configuration menu
    Copy the full SHA
    27c3737 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    6dade5f View commit details
    Browse the repository at this point in the history
  4. Merge branch 'ko3n1g/ci/skip-cherrypicking-on-empty-labels' into 'main'

    ci: Skip cherry-pick on empty label
    
    See merge request ADLR/megatron-lm!2068
    ko3n1g committed Sep 9, 2024
    Configuration menu
    Copy the full SHA
    90cd925 View commit details
    Browse the repository at this point in the history

Commits on Sep 10, 2024

  1. Configuration menu
    Copy the full SHA
    bef7771 View commit details
    Browse the repository at this point in the history
  2. Merge branch 'tshiri/format_before_real_change' into 'main'

    Fix lint errors in prepartion for other MRs
    
    See merge request ADLR/megatron-lm!2051
    ko3n1g committed Sep 10, 2024
    Configuration menu
    Copy the full SHA
    d0f5aa9 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    aae7237 View commit details
    Browse the repository at this point in the history
  4. Merge branch 'ko3n1g/ci/repeat-unittests' into 'main'

    ci: Repeat unit tests 5 times
    
    See merge request ADLR/megatron-lm!2079
    ko3n1g committed Sep 10, 2024
    Configuration menu
    Copy the full SHA
    b28d445 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    c290133 View commit details
    Browse the repository at this point in the history
  6. Merge branch 'zijiey/skip_upcycling_ut' into 'main'

    Skip the upcycling UT.
    
    See merge request ADLR/megatron-lm!2081
    ko3n1g committed Sep 10, 2024
    Configuration menu
    Copy the full SHA
    522d8a3 View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    f03af48 View commit details
    Browse the repository at this point in the history
  8. Merge branch 'xiny/fix_moe_nightly_test' into 'main'

    Update Golden Values for MoE Nightly Tests
    
    See merge request ADLR/megatron-lm!2067
    ko3n1g committed Sep 10, 2024
    Configuration menu
    Copy the full SHA
    bbecd08 View commit details
    Browse the repository at this point in the history
  9. Configuration menu
    Copy the full SHA
    6a89bc7 View commit details
    Browse the repository at this point in the history
  10. Merge branch 'ko3n1g/ci/cherry-pick-project' into 'main'

    ci: Cherry-pick into the right project
    
    See merge request ADLR/megatron-lm!2083
    ko3n1g committed Sep 10, 2024
    Configuration menu
    Copy the full SHA
    3bdae05 View commit details
    Browse the repository at this point in the history
  11. Configuration menu
    Copy the full SHA
    e93d566 View commit details
    Browse the repository at this point in the history
  12. Merge branch 'pstjohn/pyproject.toml' into 'main'

    expanding pyproject.toml definitions for uv
    
    See merge request ADLR/megatron-lm!2084
    ko3n1g committed Sep 10, 2024
    Configuration menu
    Copy the full SHA
    b6887d3 View commit details
    Browse the repository at this point in the history
  13. Configuration menu
    Copy the full SHA
    1ea3918 View commit details
    Browse the repository at this point in the history
  14. Merge branch 'edits-rebased' into 'main'

    copyedits try 3 : pure doc changes
    
    See merge request ADLR/megatron-lm!1931
    jaredcasper committed Sep 10, 2024
    Configuration menu
    Copy the full SHA
    db0fc33 View commit details
    Browse the repository at this point in the history

Commits on Sep 11, 2024

  1. ADLR/megatron-lm!2086 - Add Encoder-Decoder Parallelism Documentation

    Co-authored-by: Mike Chrzanowski <mchrzanowski@draco-oci-dc-01.cm.cluster>
    2 people authored and ko3n1g committed Sep 11, 2024
    Configuration menu
    Copy the full SHA
    f218582 View commit details
    Browse the repository at this point in the history
  2. Merge branch 'mike/add_encoder_doc' into 'main'

    Add Encoder-Decoder Parallelism Documentation
    
    See merge request ADLR/megatron-lm!2086
    ko3n1g committed Sep 11, 2024
    Configuration menu
    Copy the full SHA
    fe1640a View commit details
    Browse the repository at this point in the history
  3. ADLR/megatron-lm!1699 - MoE Shared Expert support

    Co-authored-by: Zijie Yan <zijiey@nvidia.com>
    Co-authored-by: tongliu <tongliu@nvidia.com>
    Co-authored-by: Dennis Liu <denliu@nvidia.com>
    4 people authored and ko3n1g committed Sep 11, 2024
    Configuration menu
    Copy the full SHA
    1fa9464 View commit details
    Browse the repository at this point in the history
  4. Merge branch 'hongxiaob/shared_expert' into 'main'

    MoE Shared Expert support
    
    Closes #134
    
    See merge request ADLR/megatron-lm!1699
    ko3n1g committed Sep 11, 2024
    Configuration menu
    Copy the full SHA
    fec11a7 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    6e4e9df View commit details
    Browse the repository at this point in the history
  6. Merge branch 'zijiey/moe_interface_tests' into 'main'

    Add MoE interface tests and move other tests to internal
    
    See merge request ADLR/megatron-lm!2088
    ko3n1g committed Sep 11, 2024
    Configuration menu
    Copy the full SHA
    8fc7553 View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    2130890 View commit details
    Browse the repository at this point in the history
  8. Merge branch 'ko3n1g/ci/bump-sha-3' into 'main'

    ci: Bump reference sha
    
    See merge request ADLR/megatron-lm!2092
    ko3n1g committed Sep 11, 2024
    Configuration menu
    Copy the full SHA
    6664dc6 View commit details
    Browse the repository at this point in the history
  9. Configuration menu
    Copy the full SHA
    32949f2 View commit details
    Browse the repository at this point in the history
  10. Merge branch 'ko3n1g/ci/disable-broken-test' into 'main'

    ci: Disable broken test
    
    See merge request ADLR/megatron-lm!2093
    ko3n1g committed Sep 11, 2024
    Configuration menu
    Copy the full SHA
    df1418a View commit details
    Browse the repository at this point in the history
  11. Configuration menu
    Copy the full SHA
    f8b7c3f View commit details
    Browse the repository at this point in the history
  12. Merge branch 'trintamaki/multi-image-multi-tile-dataloader-seq-len' i…

    …nto 'main'
    
    Multimodal sequence length optimizations
    
    See merge request ADLR/megatron-lm!1985
    jaredcasper committed Sep 11, 2024
    Configuration menu
    Copy the full SHA
    6151709 View commit details
    Browse the repository at this point in the history
  13. Configuration menu
    Copy the full SHA
    3005d02 View commit details
    Browse the repository at this point in the history
  14. Merge branch 'ko3n1g/tests/flaky-test-2' into 'main'

    tests: Disable flaky test
    
    See merge request ADLR/megatron-lm!2094
    ko3n1g committed Sep 11, 2024
    Configuration menu
    Copy the full SHA
    9ec2337 View commit details
    Browse the repository at this point in the history

Commits on Sep 12, 2024

  1. Configuration menu
    Copy the full SHA
    e5fb1fa View commit details
    Browse the repository at this point in the history
  2. Merge branch 'ko3n1g/ci/repeat-mrs' into 'main'

    tests: Repeat MRs 5 times
    
    See merge request ADLR/megatron-lm!2004
    ko3n1g committed Sep 12, 2024
    Configuration menu
    Copy the full SHA
    028b777 View commit details
    Browse the repository at this point in the history
  3. ADLR/megatron-lm!2091 - Don't pass device_id to torch.distributed.ini…

    …t_process_group, it causes hangs
    
    Co-authored-by: Szymon Migacz <1934379+szmigacz@users.noreply.github.com>
    2 people authored and ko3n1g committed Sep 12, 2024
    Configuration menu
    Copy the full SHA
    dcc6634 View commit details
    Browse the repository at this point in the history
  4. Merge branch 'no_dist_device_id' into 'main'

    Don't pass device_id to torch.distributed.init_process_group, it causes hangs
    
    See merge request ADLR/megatron-lm!2091
    ko3n1g committed Sep 12, 2024
    Configuration menu
    Copy the full SHA
    76f9f48 View commit details
    Browse the repository at this point in the history

Commits on Sep 14, 2024

  1. Configuration menu
    Copy the full SHA
    bf7b978 View commit details
    Browse the repository at this point in the history
  2. Merge branch 'ko3n1g/ci/release-tests' into 'main'

    ci: Add release tests for 0.9
    
    See merge request ADLR/megatron-lm!2059
    ko3n1g committed Sep 14, 2024
    Configuration menu
    Copy the full SHA
    21924d8 View commit details
    Browse the repository at this point in the history

Commits on Sep 17, 2024

  1. ADLR/megatron-lm!2106 - fix: allow merge request CI for non-protected…

    … branches to fail
    terrykong authored and ko3n1g committed Sep 17, 2024
    Configuration menu
    Copy the full SHA
    e6f1d81 View commit details
    Browse the repository at this point in the history
  2. Merge branch 'terryk/ci-can-fail-on-unprotected-targets' into 'main'

    fix: allow merge request CI for non-protected branches to fail
    
    See merge request ADLR/megatron-lm!2106
    ko3n1g committed Sep 17, 2024
    Configuration menu
    Copy the full SHA
    6562666 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    0902af0 View commit details
    Browse the repository at this point in the history
  4. Merge branch 'ko3n1g/chore/formatting-on-release-branch' into 'main'

    chore: Fix autoformatter for release branches
    
    See merge request ADLR/megatron-lm!2107
    ko3n1g committed Sep 17, 2024
    Configuration menu
    Copy the full SHA
    72008a0 View commit details
    Browse the repository at this point in the history
  5. ADLR/megatron-lm!2104 - Fixing broken links

    Co-authored-by: Shanmugam Ramasamy <shanmugamr@shanmugamr-mlt.client.nvidia.com>
    Shanmugam Ramasamy and Shanmugam Ramasamy committed Sep 17, 2024
    Configuration menu
    Copy the full SHA
    2a8d8af View commit details
    Browse the repository at this point in the history
  6. Merge branch 'docFix' into 'main'

    Fixing broken links
    
    See merge request ADLR/megatron-lm!2104
    Shanmugam Ramasamy committed Sep 17, 2024
    Configuration menu
    Copy the full SHA
    3f10ff6 View commit details
    Browse the repository at this point in the history
  7. ADLR/megatron-lm!2072 - Add video handling into multimodal mcore

    Matthieu Le authored and jon-barker committed Sep 17, 2024
    Configuration menu
    Copy the full SHA
    71d8ce7 View commit details
    Browse the repository at this point in the history
  8. Merge branch 'add-video-handling' into 'main'

    Add video handling into multimodal mcore
    
    See merge request ADLR/megatron-lm!2072
    jon-barker committed Sep 17, 2024
    Configuration menu
    Copy the full SHA
    0bda578 View commit details
    Browse the repository at this point in the history

Commits on Sep 18, 2024

  1. Configuration menu
    Copy the full SHA
    ab7f706 View commit details
    Browse the repository at this point in the history
  2. Merge branch 'lora_cg' into 'main'

    Enable optional kwargs with CUDA graph
    
    See merge request ADLR/megatron-lm!1715
    ko3n1g committed Sep 18, 2024
    Configuration menu
    Copy the full SHA
    77b4bfe View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    0cffc6b View commit details
    Browse the repository at this point in the history
  4. Merge branch '318-fix-te-version-in-telinear' into 'main'

    Resolve "Fix TE version in TELinear"
    
    Closes #318
    
    See merge request ADLR/megatron-lm!2077
    ericharper committed Sep 18, 2024
    Configuration menu
    Copy the full SHA
    461b06c View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    6b78cb1 View commit details
    Browse the repository at this point in the history
  6. Merge branch 'fix_mmmu_mmodal' into 'main'

    Update path to MMMU to use new repos structure
    
    See merge request ADLR/megatron-lm!2112
    jon-barker committed Sep 18, 2024
    Configuration menu
    Copy the full SHA
    d350231 View commit details
    Browse the repository at this point in the history
  7. ADLR/megatron-lm!1880 - Removing env variable NVTE_ALLOW_NONDETERMINI…

    …STIC_ALGO
    
    Co-authored-by: Shanmugam Ramasamy <shanmugamr@shanmugamr-mlt.client.nvidia.com>
    Shanmugam Ramasamy and Shanmugam Ramasamy committed Sep 18, 2024
    Configuration menu
    Copy the full SHA
    cedd415 View commit details
    Browse the repository at this point in the history
  8. Merge branch 'bertflash' into 'main'

    Removing env variable NVTE_ALLOW_NONDETERMINISTIC_ALGO
    
    See merge request ADLR/megatron-lm!1880
    Shanmugam Ramasamy committed Sep 18, 2024
    Configuration menu
    Copy the full SHA
    6b35ca8 View commit details
    Browse the repository at this point in the history

Commits on Sep 19, 2024

  1. ADLR/megatron-lm!2033 - Online eval

    trintamaki authored and ericharper committed Sep 19, 2024
    Configuration menu
    Copy the full SHA
    63be779 View commit details
    Browse the repository at this point in the history
  2. Merge branch 'trintamaki/online-eval' into 'main'

    Online eval
    
    See merge request ADLR/megatron-lm!2033
    ericharper committed Sep 19, 2024
    Configuration menu
    Copy the full SHA
    835af44 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    2c9bcac View commit details
    Browse the repository at this point in the history
  4. Merge branch 'trintamaki/multi-image-mmmu' into 'main'

    MMMU multi-image support
    
    See merge request ADLR/megatron-lm!1973
    jon-barker committed Sep 19, 2024
    Configuration menu
    Copy the full SHA
    905de33 View commit details
    Browse the repository at this point in the history

Commits on Sep 20, 2024

  1. Configuration menu
    Copy the full SHA
    5c0697c View commit details
    Browse the repository at this point in the history
  2. Merge branch 'ko3n1g/build/pip' into 'main'

    build: Use multi-stage for parallel builds
    
    See merge request ADLR/megatron-lm!2113
    ko3n1g committed Sep 20, 2024
    Configuration menu
    Copy the full SHA
    c394f78 View commit details
    Browse the repository at this point in the history

Commits on Sep 21, 2024

  1. Configuration menu
    Copy the full SHA
    cf596b9 View commit details
    Browse the repository at this point in the history
  2. Merge branch 'dnarayanan/warning_fix' into 'main'

    Only print warning when relevant
    
    See merge request ADLR/megatron-lm!2126
    jaredcasper committed Sep 21, 2024
    Configuration menu
    Copy the full SHA
    640e62f View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    3eeb932 View commit details
    Browse the repository at this point in the history
  4. Merge branch 'ko3n1g/tests/fix-location-of-megatron' into 'main'

    tests: Fix location of megatron
    
    See merge request ADLR/megatron-lm!2124
    ko3n1g committed Sep 21, 2024
    Configuration menu
    Copy the full SHA
    205f946 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    d210eb0 View commit details
    Browse the repository at this point in the history
  6. Merge branch 'ko3n1g/chore/bump-sha' into 'main'

    ci: Bump sha
    
    See merge request ADLR/megatron-lm!2127
    ko3n1g committed Sep 21, 2024
    Configuration menu
    Copy the full SHA
    811a26a View commit details
    Browse the repository at this point in the history

Commits on Sep 22, 2024

  1. Configuration menu
    Copy the full SHA
    405135a View commit details
    Browse the repository at this point in the history
  2. Merge branch 'ko3n1g/ci/improve-cherry-pick-workflow' into 'main'

    ci: Improve cherry pick workflow
    
    See merge request ADLR/megatron-lm!2128
    ko3n1g committed Sep 22, 2024
    Configuration menu
    Copy the full SHA
    fba615f View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    95be3cb View commit details
    Browse the repository at this point in the history
  4. Merge branch 'ko3n1g/ci/convergence-tests-with-jet' into 'main'

    ci: Introduce JET Python SDK
    
    See merge request ADLR/megatron-lm!2034
    ko3n1g committed Sep 22, 2024
    Configuration menu
    Copy the full SHA
    e79808c View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    e10a9f4 View commit details
    Browse the repository at this point in the history
  6. Merge branch 'ko3n1g/ci/improve-cherry-pick-workflow' into 'main'

    ci: Improve cherry pick MR description
    
    See merge request ADLR/megatron-lm!2130
    ko3n1g committed Sep 22, 2024
    Configuration menu
    Copy the full SHA
    8e69382 View commit details
    Browse the repository at this point in the history

Commits on Sep 23, 2024

  1. ADLR/megatron-lm!2119 - Huvu/t5 te10 fix nemoci pr482

    Co-authored-by: Huy Vu2 <huvu@login-eos01.eos.clusters.nvidia.com>
    2 people authored and ericharper committed Sep 23, 2024
    Configuration menu
    Copy the full SHA
    e35818d View commit details
    Browse the repository at this point in the history
  2. Merge branch 'huvu/t5_TE10_fix_nemoci_PR482' into 'main'

    Huvu/t5 te10 fix nemoci pr482
    
    See merge request ADLR/megatron-lm!2119
    ericharper committed Sep 23, 2024
    Configuration menu
    Copy the full SHA
    dbd2d18 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    8c666c2 View commit details
    Browse the repository at this point in the history
  4. Merge branch 'ko3n1g/ci/cherry-pick-authro' into 'main'

    ci: Set author and milestone for cherry-picks
    
    See merge request ADLR/megatron-lm!2134
    ko3n1g committed Sep 23, 2024
    Configuration menu
    Copy the full SHA
    6d8dc80 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    c45f951 View commit details
    Browse the repository at this point in the history
  6. Merge branch 'ko3n1g/ci/notify-ut' into 'main'

    ci: Send alerts on unit-tests-extended
    
    See merge request ADLR/megatron-lm!2135
    ko3n1g committed Sep 23, 2024
    Configuration menu
    Copy the full SHA
    08e80b0 View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    643e60a View commit details
    Browse the repository at this point in the history
  8. Merge branch 'ko3n1g/ci/fixes-to-jet' into 'main'

    tests: Minor improvements to JET
    
    See merge request ADLR/megatron-lm!2133
    ko3n1g committed Sep 23, 2024
    Configuration menu
    Copy the full SHA
    8ec4617 View commit details
    Browse the repository at this point in the history
  9. Configuration menu
    Copy the full SHA
    5ade91a View commit details
    Browse the repository at this point in the history
  10. Merge branch 'ko3n1g/tests/fix-gpt-release-samples' into 'main'

    tests: Fix GPT test
    
    See merge request ADLR/megatron-lm!2136
    ko3n1g committed Sep 23, 2024
    Configuration menu
    Copy the full SHA
    1f2d556 View commit details
    Browse the repository at this point in the history
  11. Configuration menu
    Copy the full SHA
    e464e94 View commit details
    Browse the repository at this point in the history
  12. Merge branch 'ko3n1g/ci/cherry-pick-strip-chars' into 'main'

    ci: Fix cherry-pick strings
    
    See merge request ADLR/megatron-lm!2139
    ko3n1g committed Sep 23, 2024
    Configuration menu
    Copy the full SHA
    0fd4617 View commit details
    Browse the repository at this point in the history
  13. Configuration menu
    Copy the full SHA
    ede39b8 View commit details
    Browse the repository at this point in the history
  14. Merge branch 'trintamaki/multimodal-eval-dataset' into 'main'

    Use torch dataloader in multimodal evaluation
    
    See merge request ADLR/megatron-lm!2110
    jon-barker committed Sep 23, 2024
    Configuration menu
    Copy the full SHA
    2065c35 View commit details
    Browse the repository at this point in the history
  15. Configuration menu
    Copy the full SHA
    697ea61 View commit details
    Browse the repository at this point in the history
  16. Merge branch 'ko3n1g/ci/dev-container' into 'main'

    ci: Enable dev container for new features
    
    See merge request ADLR/megatron-lm!2137
    ko3n1g committed Sep 23, 2024
    Configuration menu
    Copy the full SHA
    075c727 View commit details
    Browse the repository at this point in the history

Commits on Sep 24, 2024

  1. Configuration menu
    Copy the full SHA
    5e23e72 View commit details
    Browse the repository at this point in the history
  2. Merge branch 'revert_bincount' into 'main'

    Fix performance regression brought by torch.bincount
    
    Closes #263
    
    See merge request ADLR/megatron-lm!2005
    ko3n1g committed Sep 24, 2024
    Configuration menu
    Copy the full SHA
    884b087 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    ad38459 View commit details
    Browse the repository at this point in the history
  4. Merge branch 'trintamaki/multimodal_batch_bugfix' into 'main'

    Multimodal batched bug fix
    
    See merge request ADLR/megatron-lm!2073
    jon-barker committed Sep 24, 2024
    Configuration menu
    Copy the full SHA
    162b82d View commit details
    Browse the repository at this point in the history
  5. ADLR/megatron-lm!1581 - Add MLA support into MCore

    Co-authored-by: Shunkang <shunkangz@nvidia.com>
    Co-authored-by: BoxiangW <bwang1@fas.harvard.edu>
    3 people authored and jaredcasper committed Sep 24, 2024
    Configuration menu
    Copy the full SHA
    32eac88 View commit details
    Browse the repository at this point in the history
  6. Merge branch 'boxiangw/mla' into 'main'

    Add MLA support into MCore
    
    See merge request ADLR/megatron-lm!1581
    jaredcasper committed Sep 24, 2024
    Configuration menu
    Copy the full SHA
    dcf9e77 View commit details
    Browse the repository at this point in the history

Commits on Sep 25, 2024

  1. Configuration menu
    Copy the full SHA
    d207755 View commit details
    Browse the repository at this point in the history
  2. Merge branch 'trintamaki/pretrain_vlm_freeze_option' into 'main'

    Add freeze options to pretrain_vlm
    
    See merge request ADLR/megatron-lm!1995
    jon-barker committed Sep 25, 2024
    Configuration menu
    Copy the full SHA
    891b8f9 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    31c23f5 View commit details
    Browse the repository at this point in the history
  4. Merge branch 'dnarayanan/improve_logging' into 'main'

    Improve logging when decreasing batch size
    
    See merge request ADLR/megatron-lm!2145
    ericharper committed Sep 25, 2024
    Configuration menu
    Copy the full SHA
    78bef1c View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    5aceacb View commit details
    Browse the repository at this point in the history
  6. Merge branch 'hn-set-model-eval-mode' into 'main'

    Add model.eval() to run_text_generation_server.py
    
    See merge request ADLR/megatron-lm!2148
    jaredcasper committed Sep 25, 2024
    Configuration menu
    Copy the full SHA
    4158084 View commit details
    Browse the repository at this point in the history

Commits on Sep 26, 2024

  1. ADLR/megatron-lm!2111 - Mcore llama3.1 support

    Co-authored-by: Jon Barker <jbarker@draco-oci-dc-01.cm.cluster>
    2 people authored and ericharper committed Sep 26, 2024
    Configuration menu
    Copy the full SHA
    368f561 View commit details
    Browse the repository at this point in the history
  2. Merge branch 'jbarker/llama3.1' into 'main'

    Mcore llama3.1 support
    
    See merge request ADLR/megatron-lm!2111
    ericharper committed Sep 26, 2024
    Configuration menu
    Copy the full SHA
    c1c19d1 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    1265399 View commit details
    Browse the repository at this point in the history
  4. Merge branch 'ko3n1g/ci/uts-on-dev' into 'main'

    ci: Run experimental UTs on dev image
    
    See merge request ADLR/megatron-lm!2151
    ko3n1g committed Sep 26, 2024
    Configuration menu
    Copy the full SHA
    c025cec View commit details
    Browse the repository at this point in the history
  5. ADLR/megatron-lm!1953 - Mcore export to export models to TRTLLM (GPU …

    …and CPU version)
    
    Co-authored-by: Shanmugam Ramasamy <shanmugamr@shanmugamr-mlt.client.nvidia.com>
    Co-authored-by: Shanmugam Ramasamy <shanmugamr@login-eos01.eos.clusters.nvidia.com>
    3 people committed Sep 26, 2024
    Configuration menu
    Copy the full SHA
    f0d7120 View commit details
    Browse the repository at this point in the history
  6. Merge branch 'final_export' into 'main'

    Mcore export to export models to TRTLLM (GPU and CPU version)
    
    See merge request ADLR/megatron-lm!1953
    Shanmugam Ramasamy committed Sep 26, 2024
    Configuration menu
    Copy the full SHA
    45bf4c1 View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    f5171f2 View commit details
    Browse the repository at this point in the history
  8. Merge branch 'ko3n1g/ci/prune-container-cache-mcore-docker-node-jet' …

    …into 'main'
    
    ci: Prune docker cache of `mcore-docker-node-jet`
    
    See merge request ADLR/megatron-lm!2154
    ko3n1g committed Sep 26, 2024
    Configuration menu
    Copy the full SHA
    e38d92a View commit details
    Browse the repository at this point in the history
  9. ADLR/megatron-lm!2155 - Resolve release test failure caused by Groupe…

    …dMLP distributed checkpointing
    xxuwenc authored and ko3n1g committed Sep 26, 2024
    Configuration menu
    Copy the full SHA
    c31452c View commit details
    Browse the repository at this point in the history
  10. Merge branch 'xuwenc/release_perf_bugfix' into 'main'

    Resolve release test failure caused by GroupedMLP distributed checkpointing
    
    See merge request ADLR/megatron-lm!2155
    ko3n1g committed Sep 26, 2024
    Configuration menu
    Copy the full SHA
    d55d61a View commit details
    Browse the repository at this point in the history
  11. Configuration menu
    Copy the full SHA
    3beefb5 View commit details
    Browse the repository at this point in the history
  12. Merge branch 'ko3n1g/tests/better-logging-to-wandb' into 'main'

    tests: Set better name for Wandb logging
    
    See merge request ADLR/megatron-lm!2156
    ko3n1g committed Sep 26, 2024
    Configuration menu
    Copy the full SHA
    5553fc1 View commit details
    Browse the repository at this point in the history

Commits on Sep 27, 2024

  1. ADLR/megatron-lm!1950 - Remove pkg_resources package

    Co-authored-by: Xin Yao <xiny@nvidia.com>
    Co-authored-by: Deepak Narayanan <dnarayanan@nvidia.com>
    3 people authored and ko3n1g committed Sep 27, 2024
    Configuration menu
    Copy the full SHA
    0976661 View commit details
    Browse the repository at this point in the history
  2. Merge branch 'fix_version_checks' into 'main'

    Remove pkg_resources package
    
    See merge request ADLR/megatron-lm!1950
    ko3n1g committed Sep 27, 2024
    Configuration menu
    Copy the full SHA
    1585be2 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    2bad957 View commit details
    Browse the repository at this point in the history
  4. Merge branch 'ko3n1g/ci/onboard-cw' into 'main'

    ci: Onboard CW
    
    See merge request ADLR/megatron-lm!2142
    ko3n1g committed Sep 27, 2024
    Configuration menu
    Copy the full SHA
    12c2696 View commit details
    Browse the repository at this point in the history

Commits on Sep 28, 2024

  1. ADLR/megatron-lm!2158 - Small changes to export

    Co-authored-by: Shanmugam Ramasamy <shanmugamr@shanmugamr-mlt.client.nvidia.com>
    Co-authored-by: Shanmugam Ramasamy <shanmugamr@login-eos01.eos.clusters.nvidia.com>
    3 people authored and ericharper committed Sep 28, 2024
    Configuration menu
    Copy the full SHA
    3428cd9 View commit details
    Browse the repository at this point in the history
  2. Merge branch 'new_export' into 'main'

    Small changes to export
    
    See merge request ADLR/megatron-lm!2158
    ericharper committed Sep 28, 2024
    Configuration menu
    Copy the full SHA
    b3375a0 View commit details
    Browse the repository at this point in the history

Commits on Sep 30, 2024

  1. Configuration menu
    Copy the full SHA
    5b7374a View commit details
    Browse the repository at this point in the history
  2. Merge branch 'boxiangw/mla_backwards_comp' into 'main'

    Fix rope backward compatibility
    
    See merge request ADLR/megatron-lm!2152
    jaredcasper committed Sep 30, 2024
    Configuration menu
    Copy the full SHA
    6ad11b0 View commit details
    Browse the repository at this point in the history

Commits on Oct 1, 2024

  1. Configuration menu
    Copy the full SHA
    ca6d170 View commit details
    Browse the repository at this point in the history
  2. Merge branch 'auto_cudagraph_val_fix' into 'main'

    [Bug fix] Don't trace graphs during inference
    
    See merge request ADLR/megatron-lm!2140
    ericharper committed Oct 1, 2024
    Configuration menu
    Copy the full SHA
    dddecd1 View commit details
    Browse the repository at this point in the history
  3. ADLR/megatron-lm!2109 - Adding more MR tests for T5 (e.g., transforme…

    …r_engine, distributed_checkpoint)
    
    Co-authored-by: Huy Vu2 <huvu@login-eos01.eos.clusters.nvidia.com>
    2 people authored and ko3n1g committed Oct 1, 2024
    Configuration menu
    Copy the full SHA
    5ab659b View commit details
    Browse the repository at this point in the history
  4. Merge branch 'huvu/t5_dist_checkpoint_mrtests' into 'main'

    Adding more MR tests for T5 (e.g., transformer_engine, distributed_checkpoint)
    
    See merge request ADLR/megatron-lm!2109
    ko3n1g committed Oct 1, 2024
    Configuration menu
    Copy the full SHA
    3efa8c2 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    f07581b View commit details
    Browse the repository at this point in the history
  6. Merge branch 'ko3n1g/ci/artifacts' into 'main'

    ci: Download artifacts
    
    See merge request ADLR/megatron-lm!2164
    ko3n1g committed Oct 1, 2024
    Configuration menu
    Copy the full SHA
    85cd99b View commit details
    Browse the repository at this point in the history

Commits on Oct 2, 2024

  1. Configuration menu
    Copy the full SHA
    858694f View commit details
    Browse the repository at this point in the history
  2. Merge branch 'ko3n1g/ci/backwards-tag' into 'main'

    ci: Bump version
    
    See merge request ADLR/megatron-lm!2165
    jaredcasper committed Oct 2, 2024
    Configuration menu
    Copy the full SHA
    065260b View commit details
    Browse the repository at this point in the history

Commits on Oct 3, 2024

  1. ADLR/megatron-lm!2153 - Add the interface to set TP communication boo…

    …tstrap backend
    
    Co-authored-by: Deepak Narayanan <dnarayanan@nvidia.com>
    2 people authored and ericharper committed Oct 3, 2024
    Configuration menu
    Copy the full SHA
    f76b465 View commit details
    Browse the repository at this point in the history
  2. Merge branch 'tp_bootstrap_backend' into 'main'

    Add the interface to set TP communication bootstrap backend
    
    See merge request ADLR/megatron-lm!2153
    ericharper committed Oct 3, 2024
    Configuration menu
    Copy the full SHA
    25f7da2 View commit details
    Browse the repository at this point in the history
  3. ADLR/megatron-lm!2095 - Add support for SigLIP vision encoder to mult…

    …imodal mcore
    Matthieu Le authored and jaredcasper committed Oct 3, 2024
    Configuration menu
    Copy the full SHA
    50042ff View commit details
    Browse the repository at this point in the history
  4. Merge branch 'convert_siglip_model' into 'main'

    Add support for SigLIP vision encoder to multimodal mcore
    
    See merge request ADLR/megatron-lm!2095
    jaredcasper committed Oct 3, 2024
    Configuration menu
    Copy the full SHA
    4d5f94d View commit details
    Browse the repository at this point in the history

Commits on Oct 4, 2024

  1. ADLR/megatron-lm!2175 - adding cu_seqlens_padded support in MCore

    Co-authored-by: root <root@cw-dfw-h100-002-248-012.cm.cluster>
    Co-authored-by: Lifu Zhang <tomzhanglf@gmail.com>
    3 people authored and ericharper committed Oct 4, 2024
    Configuration menu
    Copy the full SHA
    2aaf85d View commit details
    Browse the repository at this point in the history
  2. Merge branch 'add_cu_seqlens_padded_support' into 'main'

    adding cu_seqlens_padded support in MCore
    
    See merge request ADLR/megatron-lm!2175
    ericharper committed Oct 4, 2024
    Configuration menu
    Copy the full SHA
    c02b335 View commit details
    Browse the repository at this point in the history
  3. ADLR/megatron-lm!2181 - Fixing attention mask dimenions to support TE…

    … versions > 1.9
    
    Co-authored-by: Shanmugam Ramasamy <shanmugamr@shanmugamr-mlt.client.nvidia.com>
    2 people authored and ericharper committed Oct 4, 2024
    Configuration menu
    Copy the full SHA
    ee9dba2 View commit details
    Browse the repository at this point in the history
  4. Merge branch 'fixattnmask' into 'main'

    Fixing attention mask dimenions to support TE versions > 1.9
    
    See merge request ADLR/megatron-lm!2181
    ericharper committed Oct 4, 2024
    Configuration menu
    Copy the full SHA
    fde8bb1 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    843a22e View commit details
    Browse the repository at this point in the history
  6. Merge branch 'yueshen/rotary_scaling_fix_llama3_1' into 'main'

    rotary_scaling fix for llama3.1 and 3.2
    
    See merge request ADLR/megatron-lm!2180
    ericharper committed Oct 4, 2024
    Configuration menu
    Copy the full SHA
    b98ec86 View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    827d5b6 View commit details
    Browse the repository at this point in the history
  8. Merge branch 'ko3n1g/ci/fix-launch-script-generator' into 'main'

    chore: Improve generator for launch scripts
    
    See merge request ADLR/megatron-lm!2185
    ko3n1g committed Oct 4, 2024
    Configuration menu
    Copy the full SHA
    31fe61a View commit details
    Browse the repository at this point in the history

Commits on Oct 5, 2024

  1. ADLR/megatron-lm!2160 - Adding Inference pipeline for T5

    Co-authored-by: Eric Harper <eharper@nvidia.com>
    Co-authored-by: Huy Vu2 <huvu@login-eos01.eos.clusters.nvidia.com>
    3 people committed Oct 5, 2024
    Configuration menu
    Copy the full SHA
    e2a1c52 View commit details
    Browse the repository at this point in the history
  2. Merge branch 'huvu/t5_generate' into 'main'

    Adding Inference pipeline for T5
    
    See merge request ADLR/megatron-lm!2160
    ericharper committed Oct 5, 2024
    Configuration menu
    Copy the full SHA
    0acda93 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    2f9ac3c View commit details
    Browse the repository at this point in the history
  4. Merge branch 'ko3n1g/ci/group-runs' into 'main'

    ci: Group runs by model
    
    See merge request ADLR/megatron-lm!2182
    ko3n1g committed Oct 5, 2024
    Configuration menu
    Copy the full SHA
    edb51fc View commit details
    Browse the repository at this point in the history
  5. ADLR/megatron-lm!1862 - Cpu init te

    Co-authored-by: William Dykas <wdykas@cw-dfw-cs-001-dc-02.cm.cluster>
    Co-authored-by: root <root@cw-dfw-h100-001-097-026.cm.cluster>
    Co-authored-by: William Dykas <wdykas@cs-cw-dfw-login-01.cm.cluster>
    4 people authored and ko3n1g committed Oct 5, 2024
    Configuration menu
    Copy the full SHA
    cf0d855 View commit details
    Browse the repository at this point in the history
  6. Merge branch 'cpu-init-te' into 'main'

    Cpu init te
    
    See merge request ADLR/megatron-lm!1862
    ko3n1g committed Oct 5, 2024
    Configuration menu
    Copy the full SHA
    0e6bef1 View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    6939737 View commit details
    Browse the repository at this point in the history
  8. Merge branch 'ko3n1g/ci/run-script-after-export' into 'main'

    ci: Run script after export
    
    See merge request ADLR/megatron-lm!2186
    ko3n1g committed Oct 5, 2024
    Configuration menu
    Copy the full SHA
    73e7b58 View commit details
    Browse the repository at this point in the history

Commits on Oct 7, 2024

  1. Configuration menu
    Copy the full SHA
    6ca379e View commit details
    Browse the repository at this point in the history
  2. Merge branch 'runtime-upcycling' into 'main'

    Fix upcycling issues.
    
    See merge request ADLR/megatron-lm!2089
    ericharper committed Oct 7, 2024
    Configuration menu
    Copy the full SHA
    ff5cee9 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    a559ec1 View commit details
    Browse the repository at this point in the history
  4. Merge branch 'ko3n1g/ci/fix-env-export' into 'main'

    tests: Fix ENV export
    
    See merge request ADLR/megatron-lm!2189
    ko3n1g committed Oct 7, 2024
    Configuration menu
    Copy the full SHA
    3f90b98 View commit details
    Browse the repository at this point in the history

Commits on Oct 9, 2024

  1. Configuration menu
    Copy the full SHA
    e108535 View commit details
    Browse the repository at this point in the history
  2. Merge branch 'ko3n1g/ci/fix-env-export' into 'main'

    tests: Fix ENV export
    
    See merge request ADLR/megatron-lm!2194
    ko3n1g committed Oct 9, 2024
    Configuration menu
    Copy the full SHA
    3f43927 View commit details
    Browse the repository at this point in the history
  3. ADLR/megatron-lm!1790 - GroupedMLP DistOpt Resharding and add UTs to …

    …ChainedOptimizer Support for distributed checkpointing
    hxbai authored and ko3n1g committed Oct 9, 2024
    Configuration menu
    Copy the full SHA
    fbdc916 View commit details
    Browse the repository at this point in the history
  4. Merge branch 'hongxiaob/moe_dist_ckpt' into 'main'

    GroupedMLP DistOpt Resharding and add UTs to ChainedOptimizer Support for distributed checkpointing
    
    See merge request ADLR/megatron-lm!1790
    ko3n1g committed Oct 9, 2024
    Configuration menu
    Copy the full SHA
    b1218b9 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    5776d06 View commit details
    Browse the repository at this point in the history
  6. Merge branch 'ko3n1g/ci/always-artifacts' into 'main'

    ci: Always upload artifacts
    
    See merge request ADLR/megatron-lm!2197
    ko3n1g committed Oct 9, 2024
    Configuration menu
    Copy the full SHA
    bf74129 View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    0e3eaa5 View commit details
    Browse the repository at this point in the history
  8. Merge branch 'trintamaki/data-parallel-inference' into 'main'

    Data parallel inference
    
    See merge request ADLR/megatron-lm!2141
    jon-barker committed Oct 9, 2024
    Configuration menu
    Copy the full SHA
    fcdbf90 View commit details
    Browse the repository at this point in the history
  9. ADLR/megatron-lm!2199 - Remove CUDA requirement from cpu test.

    Vitaly Kurin authored and ko3n1g committed Oct 9, 2024
    Configuration menu
    Copy the full SHA
    37a2116 View commit details
    Browse the repository at this point in the history
  10. Merge branch 'vitalyk/testfix' into 'main'

    Remove CUDA requirement from cpu test.
    
    See merge request ADLR/megatron-lm!2199
    ko3n1g committed Oct 9, 2024
    Configuration menu
    Copy the full SHA
    228dc20 View commit details
    Browse the repository at this point in the history

Commits on Oct 10, 2024

  1. Configuration menu
    Copy the full SHA
    f462160 View commit details
    Browse the repository at this point in the history
  2. Merge branch 'packed_seq_padded_support' into 'main'

    Support padding between subsequences of Packed Sequence
    
    See merge request ADLR/megatron-lm!2096
    ericharper committed Oct 10, 2024
    Configuration menu
    Copy the full SHA
    7e90ec0 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    566d9cd View commit details
    Browse the repository at this point in the history
  4. Merge branch 'revert-228dc204' into 'main'

    Revert "Merge branch 'vitalyk/testfix' into 'main'"
    
    See merge request ADLR/megatron-lm!2206
    ko3n1g committed Oct 10, 2024
    Configuration menu
    Copy the full SHA
    b60f5d0 View commit details
    Browse the repository at this point in the history

Commits on Oct 11, 2024

  1. ADLR/megatron-lm!1909 - Standard interface for getting offsets from t…

    …okenizers
    Sanjeev Satheesh authored and ericharper committed Oct 11, 2024
    Configuration menu
    Copy the full SHA
    13c39ac View commit details
    Browse the repository at this point in the history
  2. Merge branch 'sasatheesh/tokenizer_offsets' into 'main'

    Standard interface for getting offsets from tokenizers
    
    See merge request ADLR/megatron-lm!1909
    ericharper committed Oct 11, 2024
    Configuration menu
    Copy the full SHA
    47bb8d1 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    8c018ca View commit details
    Browse the repository at this point in the history
  4. Merge branch 'ko3n1g/ci/flaky-marker' into 'main'

    tests: Use flaky instead of skip marker
    
    See merge request ADLR/megatron-lm!2208
    ko3n1g committed Oct 11, 2024
    Configuration menu
    Copy the full SHA
    772faca View commit details
    Browse the repository at this point in the history

Commits on Oct 16, 2024

  1. Configuration menu
    Copy the full SHA
    831d64d View commit details
    Browse the repository at this point in the history
  2. Merge branch 'ko3n1g/chore/bump-pyt' into 'main'

    chore: Bump Pytorch container
    
    See merge request ADLR/megatron-lm!2017
    ko3n1g committed Oct 16, 2024
    Configuration menu
    Copy the full SHA
    4876ee1 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    bc4874c View commit details
    Browse the repository at this point in the history
  4. Merge branch 'add_siglip_converter' into 'main'

    Add siglip converter to multimodal example
    
    See merge request ADLR/megatron-lm!2214
    jon-barker committed Oct 16, 2024
    Configuration menu
    Copy the full SHA
    6bafe92 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    a30d63b View commit details
    Browse the repository at this point in the history
  6. Merge branch 'dnarayanan/fix_import' into 'main'

    Add missing import to megatron/training/initialize.py
    
    See merge request ADLR/megatron-lm!2226
    deepakn94 committed Oct 16, 2024
    Configuration menu
    Copy the full SHA
    0d89fc4 View commit details
    Browse the repository at this point in the history

Commits on Oct 18, 2024

  1. Configuration menu
    Copy the full SHA
    33d2f45 View commit details
    Browse the repository at this point in the history
  2. Merge branch 'ko3n1g/ci/refactor-jobs' into 'main'

    ci(refactor): Facelift gitlab-ci
    
    See merge request ADLR/megatron-lm!2223
    ko3n1g committed Oct 18, 2024
    Configuration menu
    Copy the full SHA
    55622ff View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    cba8bdc View commit details
    Browse the repository at this point in the history
  4. Merge branch 'ko3n1g/ci/test-dependencies' into 'main'

    ci: Set stronger dependencies
    
    See merge request ADLR/megatron-lm!2234
    ko3n1g committed Oct 18, 2024
    Configuration menu
    Copy the full SHA
    ecf0dbe View commit details
    Browse the repository at this point in the history

Commits on Oct 19, 2024

  1. Configuration menu
    Copy the full SHA
    839dff2 View commit details
    Browse the repository at this point in the history
  2. Merge branch 'duncan/triton-cache-fix' into 'main'

    Triton cache fix
    
    See merge request ADLR/megatron-lm!2075
    ericharper committed Oct 19, 2024
    Configuration menu
    Copy the full SHA
    b7814bb View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    a9c16c5 View commit details
    Browse the repository at this point in the history
  4. Merge branch 'lit/fix_multi_tensor_scale' into 'main'

    fix an issue when using `multi_tensor_scale` from TE
    
    See merge request ADLR/megatron-lm!1939
    ericharper committed Oct 19, 2024
    Configuration menu
    Copy the full SHA
    02d1762 View commit details
    Browse the repository at this point in the history
  5. ADLR/megatron-lm!1927 - Improved missing key exception for errors dur…

    …ing checkpoint io
    jstjohn authored and ericharper committed Oct 19, 2024
    Configuration menu
    Copy the full SHA
    6adf0bd View commit details
    Browse the repository at this point in the history
  6. Merge branch 'jstjohn/improved_missing_key_exception' into 'main'

    Improved missing key exception for errors during checkpoint io
    
    See merge request ADLR/megatron-lm!1927
    ericharper committed Oct 19, 2024
    Configuration menu
    Copy the full SHA
    db6cb4e View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    2c950a5 View commit details
    Browse the repository at this point in the history
  8. Merge branch 'pmannan/llava_debug' into 'main'

    LLaVA Multimodal SP support
    
    See merge request ADLR/megatron-lm!2038
    ericharper committed Oct 19, 2024
    Configuration menu
    Copy the full SHA
    739177e View commit details
    Browse the repository at this point in the history
  9. ADLR/megatron-lm!2227 - qwen2.5 conversion

    Tyler Poon authored and jon-barker committed Oct 19, 2024
    Configuration menu
    Copy the full SHA
    d28e26e View commit details
    Browse the repository at this point in the history
  10. Merge branch 'qwen25_conversion' into 'main'

    qwen2.5 conversion
    
    See merge request ADLR/megatron-lm!2227
    jon-barker committed Oct 19, 2024
    Configuration menu
    Copy the full SHA
    db7d37b View commit details
    Browse the repository at this point in the history