-
Notifications
You must be signed in to change notification settings - Fork 4.1k
Issues: microsoft/DeepSpeed
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Author
Label
Projects
Milestones
Assignee
Sort
Issues list
[BUG]NCCL operation timeout when training with deepspeed_zero3_offload or deepspeed_zero3 on RTX4090
bug
Something isn't working
training
#6756
opened Nov 18, 2024 by
MLS2021
Some Demos on How to config to offload tensors to nvme device
#6752
opened Nov 15, 2024 by
niebowen666
Model Checkpoint docs are incorrectly rendered on deepspeed.readthedocs.io
bug
Something isn't working
documentation
Improvements or additions to documentation
#6747
opened Nov 12, 2024 by
akeshet
Whether Deepspeed-Domino is compatible with other parallel strategy?
#6744
opened Nov 12, 2024 by
Andy666G
[BUG] max_grad_norm not effect
bug
Something isn't working
compression
#6743
opened Nov 12, 2024 by
yiyepiaoling0715
GPU mem doesn't release after delete tensors in optimizer.bit16groups
#6729
opened Nov 8, 2024 by
wheresmyhair
[BUG] any clue for MFU drop?
bug
Something isn't working
training
#6727
opened Nov 8, 2024 by
SeunghyunSEO
[BUG] [ROCm] Fine-tuning DeepSeek-Coder-V2-Lite-Instruct with 8 MI300X GPUs results in c10::DistBackendError
bug
Something isn't working
rocm
AMD/ROCm/HIP issues
training
#6725
opened Nov 8, 2024 by
nikhil-tensorwave
[BUG] Zero3 for torch.compile with compiled_autograd when running LayerNorm
bug
Something isn't working
training
#6719
opened Nov 6, 2024 by
yitingw1
[BUG] DeepSpeed accuracy issue for torch.compile if activation checkpoint function not compiler disabled
bug
Something isn't working
training
#6718
opened Nov 6, 2024 by
jerrychenhf
[BUG]Issue with Zero Optimization for Llama-2-7b Fine-Tuning on Intel GPUs
bug
Something isn't working
training
#6713
opened Nov 5, 2024 by
molang66
"__nv_bfloat162" has already been defined
install
Installation and package dependencies
windows
#6709
opened Nov 4, 2024 by
wolfljj
[REQUEST] Some questions about deepspeed sequence parallel
enhancement
New feature or request
#6708
opened Nov 4, 2024 by
yingtongxiong
[REQUEST] Non-element-wise Optimizer Compatibility
enhancement
New feature or request
#6701
opened Nov 2, 2024 by
Triang-jyed-driung
How could I convert ZeRO-0 deepspeed weights into fp32 model checkpoint?
enhancement
New feature or request
#6699
opened Nov 1, 2024 by
liming-ai
[BUG] Universal Checkpoint Conversion: Resumed Training Behaves as If Model Initialized from Scratch
bug
Something isn't working
training
#6691
opened Oct 30, 2024 by
purefall
DeepSpeed windows install errors
install
Installation and package dependencies
windows
#6673
opened Oct 27, 2024 by
xiezhipeng-git
Error when parsing GPUs on a node when only specifying node name
--include=node3
vs --include=node3:1,2,4
#6671
opened Oct 26, 2024 by
stephen-nju
[BUG] ZeRO++ sharding small parameter raise IndexError
bug
Something isn't working
training
#6659
opened Oct 23, 2024 by
wuxibin89
Previous Next
ProTip!
Exclude everything labeled
bug
with -label:bug.