-
Notifications
You must be signed in to change notification settings - Fork 700
Issues: kubeflow/training-operator
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Author
Label
Projects
Milestones
Assignee
Sort
Issues list
pytorchjob didn't create worker pod ,seems hang
kind/bug
lifecycle/needs-triage
#2327
opened Nov 15, 2024 by
Twilighter9527
Custom Volcano Queues not working with MPIJob V1
kind/bug
lifecycle/needs-triage
#2325
opened Nov 9, 2024 by
ameya-parab
KEP-2170: Support hundreds and thousands worker nodes for a single training Job
kind/feature
#2318
opened Nov 4, 2024 by
tenzen-y
Kubeflow Training Operator Logo
kind/discussion
kind/feature
#2314
opened Oct 29, 2024 by
andreyvelich
Use Debian images for Python components in the Training Operator V2
good first issue
help wanted
kind/feature
#2311
opened Oct 28, 2024 by
andreyvelich
KEP-2170: Add unit and E2E tests for model and dataset initializers
kind/feature
#2305
opened Oct 23, 2024 by
andreyvelich
Pytorch job running with pod exception unable to recover after retry
kind/bug
lifecycle/needs-triage
#2300
opened Oct 22, 2024 by
shaoqingyang
KEP-2170: Replace UPSERT operation for the objects with SSA PATCH
kind/feature
#2297
opened Oct 20, 2024 by
tenzen-y
Support Kubernetes v1.29 - v1.31 or v1.28 - v1.31
kind/feature
#2291
opened Oct 18, 2024 by
tenzen-y
KEP-2170: Implement Job Pipeline Framework plugins
kind/feature
#2290
opened Oct 18, 2024 by
tenzen-y
4 of 5 tasks
Add environment variables to containers
kind/feature
lifecycle/needs-triage
#2284
opened Oct 16, 2024 by
tarekabouzeid
KEP-2170: Migrate the container resource calculation mechanism to k/k library
kind/cleanup
kind/feature
#2280
opened Oct 10, 2024 by
tenzen-y
Document the spec.managedBy field and its use for MultiKueue
area/docs
kind/feature
#2279
opened Oct 9, 2024 by
mimowo
PET_NNODES env var for PyTorchJobs is incorrect when elasticPolicy is set
kind/bug
lifecycle/needs-triage
#2277
opened Oct 8, 2024 by
alenawang
Create Slurm runtime for model training using V2 APIs
area/runtime
kind/feature
#2249
opened Sep 5, 2024 by
andreyvelich
[Test] E2e Tests for Notebook Examples
area/testing
good first issue
help wanted
kind/feature
#2246
opened Sep 2, 2024 by
Electronic-Waste
KEP-2170: Create model exporter for checkpointing and training output
area/storage
#2245
opened Aug 30, 2024 by
andreyvelich
Include multiple files for TrainingClient().create_job()
area/sdk
kind/feature
#2233
opened Aug 23, 2024 by
u66u
Previous Next
ProTip!
Follow long discussions with comments:>50.