Releases: bghira/SimpleTuner
v1.2.1 - free lunch edition
Features
This release will speed up all validations without any config changes.
- SageAttention (NVIDIA-only; must be installed manually for now)
- By default, only speeds up inference. SDXL more than Flux due to differences in their respective bottlenecks.
- Use
--attention_mechanism=sageattention
to enable this, and--sageattention_usage=training+inference
to enable it for training as well as validations. This will probably make your model worse or collapse though.
- Optimised
--gradient_checkpointing
implementation- No longer applies during validations, so even without SageAttention we get a speedup (on a 4090+5800X3D) from 29 seconds for a Flux image to 15 seconds (SDXL goes from 15 seconds to 6 seconds)
- Added
--gradient_checkpointing_interval
which you can use to speed up Flux training at the cost of some additional VRAM.- Makes NF4 even more attractive for a 4090, where you can then use the SOAP optimiser in a meaningful way.
- See the options guide for more information.
What's Changed
- Add SageAttention for substantial training speed-up by @bghira in #1182
- SageAttention: make it inference-only by default by @bghira in #1183
- gradient checkpointing speed-up by @bghira in #1184
- add gradient checkpointing option to docs by @bghira in #1185
- merge by @bghira in #1186
Full Changelog: v1.2...v1.2.1
v1.2 - EMA for LoRA/Lycoris training
Features
- EMA is reworked. Previous training runs using EMA should not update to this release. Your checkpoints will not load the EMA weights correctly.
- EMA now works fully for PEFT Standard LoRA and Lycoris adapters (tested LoKr only)
- When EMA is enabled, side-by-side comparisons are now done by default (can be disabled with
--ema_validation=ema_only
ornone
)
Example; the starting model benchmark is on the left as before, the centre is the training Lycoris adapter, and the right side is the EMA weights. (SD3.5 Medium)
Bugfixes
- Text encoders are now properly quantised if the parameter is given, they were in bf16 before
- Updated doc reference link to caption filter example
What's Changed
- quantise text encoders upon request correctly by @bghira in #1167
- merge minor follow-up fixes by @bghira in #1168
- (experimental) Allow EMA on LoRA/Lycoris networks by @bghira in #1170
- Update
caption_filter_list.txt.example
reference by @emmanuel-ferdman in #1178 - merge EMA LoRA/Lycoris support by @bghira in #1176
New Contributors
- @emmanuel-ferdman made their first contribution in #1178
Full Changelog: v1.1.5...v1.2
v1.1.5 - better validations for SD3.5M and Lycoris users
Features
- Flow-matching models like SD3 and Flux can use uniform schedule sampling again, mirroring the v0.9.x release cycle from early August
- More model card details for Hugging Face Hub
- SD3.5 Medium: skip-layer guidance for validation outputs to more closely match usual workflow results
- SD3.x: Allow configuring T5 and CLIP padding values (default to empty string)
- Added
--vae_enable_tiling
for reducing VAE overhead on 2048px training for SD3.5 Medium on smaller GPUs - CLIP score tracking for validations by adding
--evaluation_type=clip
to your config - LyCORIS training can now have a specific strength set during validations using
--validation_lycoris_strength
to mirror the typical workflows found in ComfyUI etc. A recommended value is 1.0 (default) or 1.3. Using a value lower than 1.0 can help to avoid seeing a model "blow up" when you intend on using it at a lower weight later, anyway.
Bugfixes
- Torch compile for validation fixed, now works (it did nothing before)
- Torch compile disabled for LyCORIS models
- Better SD3 quantisation performance via quanto by excluding layers from the quantisation
- Flux: default shift value to 3 instead of 1
- SD1.5 LoRA save fixed
- Quanto typo for FP8 fixed
- Multi-caption parquet backend crashing fixed
- Concurrent text embed writes on multi-GPU system file locking issue fixed
Pull requests
- experimental: remove some layers from quanto by @bghira in #1085
- merge by @bghira in #1086
- flux: modify the quanto default excluded layers to be different from sd3 by @bghira in #1087
- sd3: allow configuring clip and t5 uncond values by @bghira in #1088
- merge by @bghira in #1089
- fix SD3 text embed creation; downgrade to pytorch 2.4.1 by @bghira in #1093
- update docs and sd3 parameter defaults by @bghira in #1094
- Small link update in TUTORIAL by @rootonchair in #1095
- (#1097) resolve sd15 lora save error by @bghira in #1102
- fix(typo): correct arg name in warning by @Jannchie in #1099
- merge by @bghira in #1103
- Add deduplication of captions by @mhirki in #1104
- Throw an error if both --flux_schedule_auto_shift and --flux_schedule_shift are enabled. by @mhirki in #1106
- Fix unit test failure after PR #1106 by @mhirki in #1107
- Fix gO variable name by @samedii in #1108
- disable caption deduplication as it prevents multigpu caching; add warning for sd3 using wrong VAE; cleanly terminate and restart batch text embed writing thread by @bghira in #1111
- merge by @bghira in #1112
- sd3: revert enforcement of sd35 flow_matching_loss values by @bghira in #1115
- Updating Flux Quickstart Doc with Pre-Trained Model Info by @riffmaster-2001 in #1116
- merge by @bghira in #1123
- Fix missing docker dependencies by @Putzzmunta in #1126
- Fix multi-caption parquets crashing in multiple locations (Closes #1092) by @AmericanPresidentJimmyCarter in #1109
- sd3: add skip layer guidance by @bghira in #1125
- sd3: model card detail expansion by @bghira in #1130
- flux and sd3 could use uniform sampling instead of beta or sigmoid by @bghira in #1129
- Fix random validation errors for good (and restore torch.compile for the validation pipeline at the same time) by @mhirki in #1131
- merge by @bghira in #1132
- revamp model card to work by default and provide quanto hints by @bghira in #1133
- validation: disable compile for lycoris by @bghira in #1136
- add --vae_enable_tiling to encode large res images with less vram used by @bghira in #1141
- s3: when file does not exist, handle generic 404 error for headobject by @bghira in #1142
- trainer: enable vae tiling when enabled by @bghira in #1143
- validation: fix error when torch compile is disabled for lycoris by @bghira in #1144
- add clip score tracking by @bghira in #1146
- add documentation updates by @bghira in #1150
- merge by @bghira in #1151
- metadata: add more ddpm related schedule info to the model card by @bghira in #1152
- local data backend should have file locking for writes and reads by @bghira in #1160
- chore: ignore rmtree errors by @bghira in #1162
- validation: allow setting a non-default strength for validation with lycoris by @bghira in #1161
- add more info to model card, refine contents by @bghira in #1163
- error out when cache dir path is not found by @bghira in #1164
- merge by @bghira in #1165
New Contributors
- @rootonchair made their first contribution in #1095
- @Jannchie made their first contribution in #1099
- @samedii made their first contribution in #1108
- @Putzzmunta made their first contribution in #1126
Full Changelog: v1.1.4...v1.1.5
v1.1.4
Support for SD 3.5 fine-tuning.
Stability AI has provided a tutorial on using SimpleTuner for this task here and the SD3 quickstart provided by SimpleTuner is available here
What's Changed
- update to diffusers v0.31 for SD3.5 by @bghira in #1082
- merge masked loss + reg image fixes by @bghira in #1080
- update rocm, mps and nvidia to torch 2.5 by @bghira in #1081
- merge by @bghira in #1083
Full Changelog: v1.1.3...v1.1.4
v1.1.3
- Nested subdir datasets will now have caches also nested in subdirectories, which unfortunately requires most-likely regenerating these entries. Sorry - it was not feasible to keep the old structure working in parallel.
- FlashAttention3 fixes for H100 nodes by downgrading default torch version to 2.4.1
- Resume fixes for multi-gpu/multi-node state/epoch tracking
- Other misc bugfixes
What's Changed
- fix flux attn masked transformer modeling code by @bghira in #1055
- merge by @bghira in #1056
- fix rope function for FA3 by @bghira in #1057
- merge by @bghira in #1058
- lokr: resume by default training state if not found by @bghira in #1060
- merge by @bghira in #1061
- Restore init_lokr_norm functionality by @imit8ed in #1065
- refactor how masks are retrieved by @bghira in #1066
- nvidia dependency update for pytorch-triton / aiohappyeyeballs by @bghira in #1062
- downgrade cuda to pt241 by default by @bghira in #1067
- add nightly build for pt26 by @bghira in #1068
- Add recropping script for image JSON metadata backends by @AmericanPresidentJimmyCarter in #1063
- merge by @bghira in #1069
- bugfix: restore sampler state on rank 0 correctly by @bghira in #1071
- merge by @bghira in #1072
- fix vae cache dir creation for subdirs by @bghira in #1076
- fix for nested image subdirs w/ duplicated filenames across subdirs by @bghira in #1078
New Contributors
Full Changelog: v1.1.2...v1.1.3
v1.1.2 - masked loss and strong prior preservation
New stuff
- New
is_regularisation_data
option for datasets, works great - H100 or greater now has better torch compile support
- SDXL ControlNet training is back, now with quantised base model (int8)
- Multi-node training works now, with a guide to deploy it easily
- Configure.py now can generate a very rudimentary user prompt library for you if you are in a hurry
- Flux model cards now have more useful information about your Flux training setup
- Masked loss training & a demo script in the toolkit dir for generating a folder of image masks
What's Changed
- quanto: improve support for SDXL training by @bghira in #1027
- Fix attention masking transformer for flux by @AmericanPresidentJimmyCarter in #1032
- merge by @bghira in #1036
- H100/H200/B200 FlashAttention3 for Flux + TorchAO improvements by @bghira in #1033
- utf8 fix for emojis in dataset configs by @bghira in #1037
- fix venv instructions and edge case for aspect crop bucket list by @bghira in #1038
- merge by @bghira in #1039
- multi-node training fixes for state tracker by @bghira in #1040
- merge bugfixes by @bghira in #1041
- configure.py can configure caption strategy by @bghira in #1042
- regression by @bghira in #1043
- fix multinode state resumption by @bghira in #1044
- merge by @bghira in #1045
- validations can crash when sending updates to wandb by @bghira in #1046
- aws: do not give up on fatal errors during exists() by @bghira in #1047
- merge by @bghira in #1048
- add prompt expander based on 1B Llama model by @bghira in #1049
- implement regularisation dataset parent-student loss for LyCORIS training by @bghira in #1050
- metadata: add more flux model card details by @bghira in #1051
- merge by @bghira in #1052
- fix controlnet training for sdxl and introduce masked loss preconditioning by @bghira in #1053
- merge by @bghira in #1054
Full Changelog: v1.1.1...v1.1.2
v1.1.1 - bring on the potato models
Trained with NF4 via PagedLion8Bit.
- New custom timestep distribution for Flux via
--flux_use_beta_schedule
,--flux_beta_schedule_alpha
,--flux_beta_schedule_beta
(#1023) - The trendy AdEMAMix, its 8bit and paged counterparts are all now available as
bnb-ademamix
,bnb-ademamix-8bit, and
bnb-ademamix8bit-paged` - All low-bit optimisers from Bits n Bytes are now included for NVIDIA and ROCm systems
- NF4 training on NVIDIA systems down to 9090M total using Lion8Bit and 512px training at 1.5 sec/iter on a 4090
What's Changed
- int8-quanto followup fixes (batch size > 1) by @bghira in #1016
- merge by @bghira in #1018
- update doc by @bghira in #1019
- update docs by @bghira in #1025
- Add the ability to use a Beta schedule to select Flux timesteps by @AmericanPresidentJimmyCarter in #1023
- AdEMAMix, 8bit Adam/AdamW/Lion/Adagrad, Paged optimisers by @bghira in #1026
- Bits n Bytes NF4 training by @bghira in #1028
- merge by @bghira in #1029
Full Changelog: v1.1...v1.1.1
v1.1 - API-friendly edition
Features
Performance
- Improved launch speed for large datasets (>1M samples)
- Improved speed for quantising on CPU
- Optional support for directly quantising on GPU near-instantly (
--quantize_via
)
Compatibility
- SDXL, SD1.5 and SD2.x compatibility with LyCORIS training
- Updated documentation to make multiGPU configuration a bit more obvious.
- Improved support for
torch.compile()
, including automatically disabling it when eg.fp8-quanto
is enabled- Enable via
accelerate config
orconfig/config.env
viaTRAINER_DYNAMO_BACKEND=inductor
- Enable via
- TorchAO for quantisation as an alternative to Optimum Quanto for int8 weight-only quantisation (
int8-torchao
) f8uz-quanto
, a compatibility level for AMD users to experiment with FP8 training dynamics- Support for multigpu PEFT LoRA training with Quanto enabled (not
fp8-quanto
)- Previously, only LyCORIS would reliably work with quantised multigpu training sessions.
- Ability to quantise models when full-finetuning, without warning or error. Previously, this configuration was blocked. Your mileage may vary, it's an experimental configuration.
Integrations
- Images now get logged to tensorboard (thanks @anhi)
- FastAPI endpoints for integrations (undocumented)
- "raw" webhook type that sends a large number of HTTP requests containing events, useful for push notification type service
Optims
- SOAP optimiser support
- uses fp32 gradients, nice and accurate but uses more memory than other optims, by default slows down every 10 steps as it preconditions
- New 8bit and 4bit optimiser options from TorchAO (
ao-adamw8bit
,ao-adamw4bit
etc)
Pull Requests
- Fix flux cfg sampling bug by @AmericanPresidentJimmyCarter in #981
- merge by @bghira in #982
- FastAPI endpoints for managing trainer as a service by @bghira in #969
- constant lr resume fix for optimi-stableadamw by @bghira in #984
- clear data backends before configuring new ones by @bghira in #992
- update to latest quanto main by @bghira in #994
- log images in tensorboard by @anhi in #998
- merge by @bghira in #999
- torchao: add int8; quanto: add NF4; torch compile fixes + ability to compile optim by @bghira in #986
- update flux quickstart by @bghira in #1000
- compile optimiser by @bghira in #1001
- optimizer compile step only by @bghira in #1002
- remove optimiser compilation arg by @bghira in #1003
- remove optim compiler from options by @bghira in #1004
- remove optim compiler from options by @bghira in #1005
- SOAP optimiser; int4 fixes for 4090 by @bghira in #1006
- torchao: install 0.5.0 from pytorch source by @bghira in #1007
- update safety check warning with guidance toward cache clear interval for OOM issues by @bghira in #1008
- fix webhook contents for discord by @bghira in #1011
- fp8-quanto fixes, unblocking of PEFT multigpu LoRA training for other precision levels by @bghira in #1013
- quanto: activations sledgehammer by @bghira in #1014
- 1.1 merge window by @bghira in #1010
Full Changelog: v1.0.1...v1.1
v1.0.1
This is a maintenance release with not many new features.
What's Changed
- fix reference error to use_dora by @bghira in #929
- fix merge error by @bghira in #930
- fix use of --num_train_epochs by @bghira in #932
- merge fixes by @bghira in #934
- documentation updates, deepspeed config reference error fix by @bghira in #935
- Fix caption_with_cogvlm.py for cogvlm2 + textfile strategy by @burgalon in #936
- dependency updates, cogvlm fixes, peft/lycoris resume fix by @bghira in #939
- feature: zero embed padding for t5 on request by @bghira in #941
- merge by @bghira in #942
- comet_ml validation images by @burgalon in #944
- Allow users to init their LoKr with perturbed normal w2 by @AmericanPresidentJimmyCarter in #943
- merge by @bghira in #948
- fix typo in PR by @bghira in #949
- update arg name for norm init by @bghira in #950
- configure script should not set dropout by default by @bghira in #955
- VAECache: improve startup speed for large sets by @bghira in #956
- Update FLUX.md by @anae-git in #957
- mild bugfixes by @bghira in #963
- fix bucket worker not waiting for all queue worker to finish by @burgalon in #967
- merge by @bghira in #968
- fix DDP for PEFT LoRA & minor exit error by @bghira in #974
New Contributors
Full Changelog: v1.0...v1.0.1
v1.0 the total recall edition
Everything has changed! And yet, nothing has. Some defaults may. No, will - be different. It's hard to know which ones.
For those who can do so, it's recommended to use configure.py
to reconfigure your environment on this new release.
It should go without saying, but for those in the middle of a training run, do not upgrade to this release until you finish.
Refactoring and Enhancements:
-
Refactor
train.py
into a Trainer Class:- The core logic of
train.py
has been restructured into aTrainer
class, improving modularity and maintainability. - Exposes an SDK for reuse elsewhere.
- The core logic of
-
Model Family Unification:
- References to specific model types (
--sd3
,--flux
, etc.) have been replaced with a unified--model_family
argument, streamlining model specification and reducing clutter in configurations.
- References to specific model types (
-
Configuration System Overhaul:
- Switched from
.env
configuration files to JSON (config.json
), with multiple backends supporting JSON configuration loading. This allows more flexible and readable configuration management. - Updated the configuration loader to auto-detect the best backend when launching.
- Switched from
-
Enhanced Argument Handling:
- Deprecated old argument references and moved argument parsing to
helpers/configuration/cmd_args.py
for better organization. - Introduced support for new arguments such as
--model_card_safe_for_work
,--flux_schedule_shift
, and--disable_bucket_pruning
.
- Deprecated old argument references and moved argument parsing to
-
Improved Hugging Face Integration:
- Modified
configure.py
to avoid asking for Hugging Face model name details unless required. - Added the ability to pass the SFW (safe-for-work) argument into the training script.
- Modified
-
Optimizations and Bug Fixes:
- Fixed several references to learning rate (lr) initialization and corrected
--optimizer
usage. - Addressed issues with attention masks swapping and fixed the persistence of text encoders in RAM after refactoring.
- Fixed several references to learning rate (lr) initialization and corrected
-
Training and Validation Enhancements:
- Added better dataset examples with support for multiple resolutions and mixed configurations.
- Configured training scripts to disable gradient accumulation steps by default and provided better control over training options via the updated config files.
-
Enhanced Logging and Monitoring:
- Improved the handling of Weights & Biases (wandb) logs and updated tracker argument references.
-
Documentation Updates:
- Revised documentation to reflect changes in model family handling, argument updates, and configuration management.
- Added guidance on setting up the new configuration files and examples for multi-resolution datasets.
-
Miscellaneous Improvements:
- Enabled support for NSFW tags in model cards enabled by default.
- Updated
train.sh
to minimal requirements, reducing complexity and streamlining the training process.
More detailed change log
- lycoris model card updates by @bghira in #820
- Generate and store attention masks for T5 for flux by @AmericanPresidentJimmyCarter in #821
- Fix validation by @AmericanPresidentJimmyCarter in #822
- backwards-compatible flux embedding cache masks by @bghira in #823
- merge by @bghira in #824
- parquet add width and height columns by @frankchieng in #825
- quanto: remove warnings about int8/fp8 confusion as it happened so long ago now; add warning about int4 by @bghira in #826
- remove clip warning by @bghira in #827
- update lycoris to dev branch, 3.0.1dev3 by @bghira in #828
- Fix caption_with_blip3.py on CUDA by @anhi in #833
- fix quanto resuming by @bghira in #834
- lycoris: resume should use less vram now by @bghira in #835
- (#644) temporarily block training on multi-gpu setup with quanto + PEFT, inform user to go with lycoris instead by @bghira in #837
- quanto + deepspeed minor fixes for multigpu training by @bghira in #839
- deepspeed sharding by @bghira in #840
- fix: only run save full model on main process by @ErwannMillon in #838
- merge by @bghira in #841
- clean-up by @bghira in #842
- follow-up fixes for quanto limitation on multigpu by @bghira in #846
- merge by @bghira in #850
- (#851) remove shard merge code on load hook by @bghira in #853
- csv backend updates by @williamzhuk in #645
- csv fixes by @bghira in #856
- add schedulefree optim w/ kahan summation by @bghira in #857
- merge by @bghira in #858
- merge by @bghira in #861
- schedulefree: return to previous stable settings and add a new preset for aggressive training by @bghira in #862
- fix validation image filename only using resolution from first img, and, unreadable/untypeable parenthesis by @bghira in #863
- (#519) add side by side comparison with base model by @bghira in #865
- merge fixes by @bghira in #870
- (#864) add flux final export for full tune by @bghira in #871
- wandb gallery mode by @bghira in #872
- sdxl: dtype inference followup fix by @bghira in #873
- merge by @bghira in #878
- combine the vae cache clear logic with bucket rebuild logic by @bghira in #879
- flux: mobius-style training via augmented guidance scale by @bghira in #880
- track flux cfg via wandb by @bghira in #881
- multigpu VAE cache rebuild fixes; random crop auto-rebuild; mobius flux; json backend now renamed to discovery ; wandb guidance tracking by @bghira in #888
- fixing typo in flux document for preserve_data_backend_cache key by @riffmaster-2001 in #882
- reintroduce timestep dependent shift as an option during flux training for dev and schnell, disabled by default by @bghira in #892
- adding SD3 timestep-dependent shift for Flux training by @bghira in #894
- fix: set optimizer details to empty dict w/ deepspeed by @ErwannMillon in #895
- fix: make sure wandb_logs is always defined by @ErwannMillon in #896
- merge by @bghira in #900
- Dataloader Docs - Correct caption strategy for instance prompt by @barakyo in #902
- refactor train.py into Trainer class by @bghira in #899
- Update TRAINER_EXTRA_ARGS for model_family by @barakyo in #903
- Fix text encoder nuking regression by @mhirki in #906
- added lokr lycoris init_lora by @flotos in #907
- Fix Flux schedule shift and add resolution-dependent schedule shift by @mhirki in #905
- Swap the attention mask location, because Flux swapped text and image… by @AmericanPresidentJimmyCarter in #908
- support toml, json, env config backend, and multiple config environments by @bghira in #909
- Add
"none"
to --report_to argument by @twri in #911 - Add support for tiny PEFT-based Flux LoRA based on TheLastBen's post on Reddit by @mhirki in #912
- Update lycoris_config.json.example with working defaults by @mhirki in #918
- fix constant_with_warmup not being so constant or warming up by @bghira in #919
- follow-up fix for setting last_epoch by @bghira in #920
- fix multigpu schedule issue with LR on resume by @bghira in #921
- multiply the resume state step by the number of GPUs in an attempt to overcome accelerate v0.33 issue by @bghira in #922
- default to json/toml before the env file in case multigpu is configured by @bghira in #923
- fix json/toml configs str bool values by @bghira in #924
- bypass some "helpful" diffusers logic that makes random decisions to run on CPU by @bghira in #925
- v1.0 merge by @bghira in #910
*...