microsoft / DeepSpeed Public

Notifications You must be signed in to change notification settings
Fork 4.1k
Star 35.5k

Code
Issues 953
Pull requests 119
Discussions
Actions
Projects
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Discussions
Actions
Projects
Security
Insights

Issues: microsoft/DeepSpeed

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

953 Open 1,872 Closed

Author

Filter by author

Label

Filter by label

Use alt + click/return to exclude labels

or ⇧ + click/return for logical OR

Projects

Filter by project

Milestones

Filter by milestone

Assignee

Filter by who’s assigned

Assigned to nobody

Sort

Sort by

Newest Oldest Most commented Least commented Recently updated Least recently updated Best match

Most reactions

Issues list

[BUG]NCCL operation timeout when training with deepspeed_zero3_offload or deepspeed_zero3 on RTX4090 bug

Something isn't working

training

#6756 opened Nov 18, 2024 by MLS2021

Some Demos on How to config to offload tensors to nvme device

#6752 opened Nov 15, 2024 by niebowen666

Model Checkpoint docs are incorrectly rendered on deepspeed.readthedocs.io bug

Something isn't working

documentation

Improvements or additions to documentation

#6747 opened Nov 12, 2024 by akeshet

Whether Deepspeed-Domino is compatible with other parallel strategy?

#6744 opened Nov 12, 2024 by Andy666G

[BUG] max_grad_norm not effect bug

Something isn't working

compression

#6743 opened Nov 12, 2024 by yiyepiaoling0715

About offload stage3 source code learning problems

#6735 opened Nov 9, 2024 by lzy-edu

GPU mem doesn't release after delete tensors in optimizer.bit16groups

#6729 opened Nov 8, 2024 by wheresmyhair

[BUG] any clue for MFU drop? bug

Something isn't working

training

#6727 opened Nov 8, 2024 by SeunghyunSEO

[BUG] [ROCm] Fine-tuning DeepSeek-Coder-V2-Lite-Instruct with 8 MI300X GPUs results in c10::DistBackendError bug

Something isn't working

rocm

AMD/ROCm/HIP issues

training

#6725 opened Nov 8, 2024 by nikhil-tensorwave

ZeRO-3 + MP8 Universal Checkpoint

#6724 opened Nov 7, 2024 by jeromeku

[BUG] Zero3 for torch.compile with compiled_autograd when running LayerNorm bug

Something isn't working

training

#6719 opened Nov 6, 2024 by yitingw1

[BUG] DeepSpeed accuracy issue for torch.compile if activation checkpoint function not compiler disabled bug

Something isn't working

training

#6718 opened Nov 6, 2024 by jerrychenhf

[BUG] pipeline parallelism+fp16+moe isn't working

#6714 opened Nov 5, 2024 by NeferpitouS3

[BUG]Issue with Zero Optimization for Llama-2-7b Fine-Tuning on Intel GPUs bug

Something isn't working

training

#6713 opened Nov 5, 2024 by molang66

"__nv_bfloat162" has already been defined install

Installation and package dependencies

windows

#6709 opened Nov 4, 2024 by wolfljj

[REQUEST] Some questions about deepspeed sequence parallel enhancement

New feature or request

#6708 opened Nov 4, 2024 by yingtongxiong

[REQUEST] Non-element-wise Optimizer Compatibility enhancement

New feature or request

#6701 opened Nov 2, 2024 by Triang-jyed-driung

How could I convert ZeRO-0 deepspeed weights into fp32 model checkpoint? enhancement

New feature or request

#6699 opened Nov 1, 2024 by liming-ai

[BUG] Universal Checkpoint Conversion: Resumed Training Behaves as If Model Initialized from Scratch bug

Something isn't working

training

#6691 opened Oct 30, 2024 by purefall

how deepspeed can avoid doing all_reduce?

#6690 opened Oct 30, 2024 by luuck

How to use both deepspeed framework and tutel framework?

#6684 opened Oct 29, 2024 by luuck

Discuss about compile config

#6683 opened Oct 29, 2024 by oraluben

DeepSpeed windows install errors install

Installation and package dependencies

windows

#6673 opened Oct 27, 2024 by xiezhipeng-git

Error when parsing GPUs on a node when only specifying node name --include=node3 vs --include=node3:1,2,4

#6671 opened Oct 26, 2024 by stephen-nju

[BUG] ZeRO++ sharding small parameter raise IndexError bug

Something isn't working

training

#6659 opened Oct 23, 2024 by wuxibin89

Previous 1 2 3 4 5 … 38 39 Next

Previous Next

ProTip! Exclude everything labeled bug with -label:bug.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly