05 Dec 20:59

snarayan21

83f12ee

v0.15.1 Latest

Latest

What's Changed

Bump version 0.16.0.dev0 by @j316chuck in #1667
Update mlflow requirement from <2.18,>=2.14.1 to >=2.14.1,<2.19 by @dependabot in #1673
Speed up embedding tests by @dakinggg in #1668
Add mcli yaml version bump by @j316chuck in #1674
Bump Openai version by @snarayan21 in #1684
Bump Streaming to v0.10.0 by @snarayan21 in #1685
Bugfix auto packing with streams + no remote path by @mattyding in #1679
Bump Composer to v0.28.0 by @snarayan21 in #1687
Expose DistributedSampler RNG seed argument by @janEbert in #1677
Add llama3 ft example yamls by @j316chuck in #1686

New Contributors

@janEbert made their first contribution in #1677

Full Changelog: v0.15.0...v0.15.1

Contributors

janEbert, j316chuck, and 4 other contributors

Assets 2

23 Nov 02:13

j316chuck

v0.15.0

8982b2c

v0.15.0

New Features

Open Source Embedding + Contrastive Code (#1615)

LLM foundry now supports finetuning embedding models with contrastive loss. Foundry now supports various approaches to selecting negative passages for contrastive loss which can be either randomly selected or pre-defined. For more information, please view the the readme.

PyTorch 2.5.1 (#1665)

This release updates LLM Foundry to the PyTorch 2.5.1 release, bringing with it support for the new features and optimizations in PyTorch 2.5.1.

Improved error messages (#1657, #1660, #1623, #1625)

Various improved error messages, making debugging user errors more clear.

What's Changed

Update mcli examples to use 0.14.0 by @irenedea in #1624
Open Source Embedding + Contrastive Code by @KuuCi in #1615
Catch delta table not found error by @milocress in #1625
Add Mlflow 403 PL UserError by @mattyding in #1623
Catches when data prep cluster fails to start by @milocress in #1628
Bump mlflow max version by @dakinggg in #1629
add another cluster connection failure wrapper by @milocress in #1630
Add MLflow log_model option by @nancyhung in #1544
Move loss generating token counting to the dataloader by @dakinggg in #1632
Bump databricks-connect from 14.1.0 to 15.4.3 by @dependabot in #1636
Fix dataset download location by @dakinggg in #1639
Revert "Bump databricks-connect from 14.1.0 to 15.4.3" by @XiaohanZhangCMU in #1640
Bump transformers version by @dakinggg in #1631
Fix gpu tests test_tp_train and test_huggingface_conversion_callback_interval by @irenedea in #1642
Update datasets requirement from <2.20,>=2.19 to >=2.20.0,<2.21 by @dependabot in #1330
Add max shard size to transformers save_pretrained by @b-chu in #1648
Update huggingface-hub requirement from <0.25,>=0.19.0 to >=0.19.0,<0.27 by @dependabot in #1652
Update accelerate requirement from <0.34,>=0.25 to >=0.25,<1.2 by @dependabot in #1633
Catch Delta Table Not Found by @KuuCi in #1653
Add Exception for missing UC column by @milocress in #1654
Infer step size for Embeddings by @KuuCi in #1647
Pin FAv2 by @mvpatel2000 in #1656
Retry catching BlockingIOError by @KuuCi in #1657
Catch bad data prep by @milocress in #1644
Update pytest-cov requirement from <6,>=4 to >=4,<7 by @dependabot in #1663
Bump coverage[toml] from 7.6.1 to 7.6.4 by @dependabot in #1650
Move transform_model_pre_registration in hf_checkpointer by @irenedea in #1664
Catch Cluster Permissions Error by @KuuCi in #1660
Mosaicml version bump by @j316chuck in #1661
Changes for removing unused terms in CE loss fn by @gupta-abhay in #1643
Update setuptools requirement from <68.0.0 to <76.0.0 by @dependabot in #1662
Bump docker version to torch 2.5.1 by @j316chuck in #1665
Bump ubuntu 22.04 + torch 2.5.1 by @KuuCi in #1666

New Contributors

@mattyding made their first contribution in #1623

Full Changelog: v0.14.5...v0.15.0

Contributors

gupta-abhay, j316chuck, and 10 other contributors

Assets 2

18 Nov 17:15

irenedea

v0.14.5

aa5e1b9

v0.14.5

Move transform_model_pre_registration in hf_checkpointer (#1664)

Full Changelog: v0.14.4...v0.14.5

Assets 2

07 Nov 20:42

b-chu

v0.14.4

284c31b

v0.14.4

Add max shard size to transformers save_pretrained by @b-chu in #1648

Full Changelog: v0.14.3...v0.14.4

Contributors

b-chu

Assets 2

05 Nov 15:41

b-chu

v0.14.3

83c1afd

v0.14.3

What's Changed

Fix dataset download location by @dakinggg in #1639

Full Changelog: v0.14.2...v0.14.3

Contributors

dakinggg

Assets 2

04 Nov 02:14

dakinggg

v0.14.2

b320210

v0.14.2

Bug Fixes

Move loss generating token counting to the dataloader (#1632)

Fixes a throughput regression due to #1610, which was release in v0.14.0

What's Changed

Move loss generating token counting to the dataloader by @dakinggg in #1632

Full Changelog: v0.14.1...v0.14.2

Contributors

dakinggg

Assets 2

01 Nov 23:55

dakinggg

v0.14.1

fe69619

v0.14.1

New Features

Use log_model for registering models (#1544 )

Instead of calling the mlflow register API directly, we use the intended log_model API, which will both log the model to mlflow run artifacts, and register it to Unity Catalog.

What's Changed

Catch delta table not found error by @milocress in #1625
Add Mlflow 403 PL UserError @dakinggg in #1623
Catches when data prep cluster fails to start by @milocress in #1628
add another cluster connection failure wrapper by @milocress in #1630
Use log_model API to register the model by @nancyhung @dakinggg in #1544

Full Changelog: v0.14.0...v0.14.1

Contributors

nancyhung, milocress, and dakinggg

Assets 2

28 Oct 22:41

irenedea

v0.14.0

8047c85

v0.14.0

New Features

Load Checkpoint Callback (#1570)

We added support for Composer's LoadCheckpoint callback, which loads a checkpoint at a specified event. This enables use cases like loading model base weights with peft.

callbacks:
    load_checkpoint:
        load_path: /path/to/your/weights

Breaking Changes

Accumulate over tokens in a Batch for Training Loss (#1618,#1610,#1595)

We added a new flag accumulate_train_batch_on_tokens which specifies whether training loss is accumulated over the number of tokens in a batch, rather than the number of samples. It is true by default. This will slightly change loss curves for models trained with padding. The old behavior can be recovered by simply setting this to False explicitly.

Default Run Name (#1611)

If no run name is provided, we now will default to using composer's randomly generated run names. (Previously, we defaulted to using "llm" for the run name.)

What's Changed

Update mcli examples to use 0.13.0 by @irenedea in #1594
Pass accumulate_train_batch_on_tokens through to composer by @dakinggg in #1595
Loosen MegaBlocks version pin by @mvpatel2000 in #1597
Add configurability for hf checkpointer register timeout by @dakinggg in #1599
Loosen MegaBlocks to <1.0 by @mvpatel2000 in #1598
Finetuning dataloader validation tweaks by @mvpatel2000 in #1600
Bump onnx from 1.16.2 to 1.17.0 by @dependabot in #1604
Remove TE from dockerfile and instead add as optional dependency by @snarayan21 in #1605
Data prep on multiple GPUs by @eitanturok in #1576
Add env var for configuring the maximum number of processes to use for dataset processing by @irenedea in #1606
Updated error message for cluster check by @nancyhung in #1602
Use fun default composer run names by @irenedea in #1611
Ensure log messages are properly formatted again by @snarayan21 in #1614
Add UC not enabled error for delta to json conversion by @irenedea in #1613
Use a temporary directory for downloading finetuning dataset files by @irenedea in #1608
Bump composer version to 0.26.0 by @irenedea in #1616
Add loss generating token counts by @dakinggg in #1610
Change accumulate_train_batch_on_tokens default to True by @dakinggg in #1618
Bump version to 0.15.0.dev0 by @irenedea in #1621
Add load checkpoint callback by @irenedea in #1570

Full Changelog: v0.13.0...v0.14.0

Contributors

irenedea, nancyhung, and 5 other contributors

Assets 2

18 Oct 16:50

dakinggg

v0.13.1

0354f5f

v0.13.1

🚀 LLM Foundry v0.13.1

What's Changed

Add configurability to HF checkpointer timeout by @dakinggg in #1599

Full Changelog: v0.13.0...v0.13.1

Contributors

dakinggg

Assets 2

15 Oct 06:23

irenedea

v0.13.0

18b0a6d

v0.13.0

🚀 LLM Foundry v0.13.0

🛠️ Bug Fixes & Cleanup

Pytorch 2.4 Checkpointing (#1569, #1581, #1583)

Resolved issues related to checkpointing for Curriculum Learning (CL) callbacks.

🔧 Dependency Updates

Bumped tiktoken from 0.4.0 to 0.8.0 (#1572)
Updated onnxruntime from 1.19.0 to 1.19.2 (#1590)

What's Changed

Update mcli yamls by @dakinggg in #1552
Use allenai/c4 instead of c4 dataset by @eitanturok in #1554
Tensor Parallelism by @eitanturok in #1521
Insufficient Permissions Error when trying to access table by @KuuCi in #1555
Add NoOp optimizer by @snarayan21 in #1560
Deterministic GCRP Errors by @KuuCi in #1559
Simplify CL API by @b-chu in #1510
Reapply #1389 by @dakinggg in #1561
Add dataset swap callback by @b-chu in #1536
Add error to catch more unknown example types by @milocress in #1562
Add FileExtensionNotFoundError by @b-chu in #1564
Add InvalidConversationError by @b-chu in #1565
Release docker img by @KuuCi in #1547
Revert FT dataloader changes from #1561, keep #1564 by @snarayan21 in #1566
Cleanup TP by @eitanturok in #1556
Changes for dataset swap callback by @gupta-abhay in #1569
Do not consider run_name when auto-detecting autoresume by @irenedea in #1571
Allow parameters with requires_grad=False in meta init by @sashaDoubov in #1567
Bump tiktoken from 0.4.0 to 0.8.0 by @dependabot in #1572
Add extensions to FinetuningFileNotFoundError by @b-chu in #1578
Handle long file names in convert text to mds by @irenedea in #1579
Set streaming log level by @mvpatel2000 in #1582
Fix pytorch checkpointing for CL callback by @b-chu in #1581
Fix pytorch checkpointing for CL callback by @b-chu in #1583
Error if filtered dataset contains 0 examples by @irenedea in #1585
Change cluster errors from NetworkError to UserError by @irenedea in #1586
Do not autoresume if a default name is set, only on user defined ones by @irenedea in #1588
Bump onnxruntime from 1.19.0 to 1.19.2 by @dependabot in #1590
Make FinetuningStreamingDataset parameters more flexible by @XiaohanZhangCMU in #1580
Add build callback tests by @irenedea in #1577
Bump version to 0.14.0.dev0 by @irenedea in #1587
Fix typo in eval code by using 'fsdp' instead of 'fsdp_config' by @irenedea in #1593

Full Changelog: v0.12.0...v0.13.0

Contributors

sashaDoubov, gupta-abhay, and 10 other contributors

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's Changed

New Contributors

Contributors

New Features

Open Source Embedding + Contrastive Code (#1615)

PyTorch 2.5.1 (#1665)

Improved error messages (#1657, #1660, #1623, #1625)

What's Changed

New Contributors

Contributors

Contributors

What's Changed

Contributors

Bug Fixes

Move loss generating token counting to the dataloader (#1632)

What's Changed

Contributors

New Features

Use log_model for registering models (#1544 )

What's Changed

Contributors

New Features

Load Checkpoint Callback (#1570)

Breaking Changes

Accumulate over tokens in a Batch for Training Loss (#1618,#1610,#1595)

Default Run Name (#1611)

What's Changed

Contributors

🚀 LLM Foundry v0.13.1

What's Changed

Contributors

🚀 LLM Foundry v0.13.0

🛠️ Bug Fixes & Cleanup

Pytorch 2.4 Checkpointing (#1569, #1581, #1583)

🔧 Dependency Updates

What's Changed

Contributors

Releases: mosaicml/llm-foundry

v0.15.1

What's Changed

New Contributors

Contributors

v0.15.0

New Features

Open Source Embedding + Contrastive Code (#1615)

PyTorch 2.5.1 (#1665)

Improved error messages (#1657, #1660, #1623, #1625)

What's Changed

New Contributors

Contributors

v0.14.5

v0.14.4

Contributors

v0.14.3

What's Changed

Contributors

v0.14.2

Bug Fixes

Move loss generating token counting to the dataloader (#1632)

What's Changed

Contributors

v0.14.1

New Features

Use log_model for registering models (#1544 )

What's Changed

Contributors

v0.14.0

New Features

Load Checkpoint Callback (#1570)

Breaking Changes

Accumulate over tokens in a Batch for Training Loss (#1618,#1610,#1595)

Default Run Name (#1611)

What's Changed

Contributors

v0.13.1

🚀 LLM Foundry v0.13.1

What's Changed

Contributors

v0.13.0

🚀 LLM Foundry v0.13.0

🛠️ Bug Fixes & Cleanup

Pytorch 2.4 Checkpointing (#1569, #1581, #1583)

🔧 Dependency Updates

What's Changed

Contributors