- resolve failing unit test
- avoid parsing stderr as JSON
- temporarily hardcode neuron cores for trn2
- Added p5 as a supported NCCL instance
- Add support for py39 and py310
- typo in the run unit tests command
- run unit tests in sequence order for release process as well to prevent coverage conflicting issues
- chore: removing unnecessary logging information
- update the boto deps to use latest boto
- bypass DNS check for studio local exec
- use smddprun only if it is installed
- Add NCCL_PROTO=simple environment variable to handle the out-of-order data delivery from EFA
- toolkit build failure
- support codeartifact for installing requirements.txt packages
- removed unused import statment
- forgot to run black on torch_distributed.py after updating my comments from last commit
- Modified my comment on line 98-103 in torch_distrbuted.py to comply with formatting standard.
- Revert "Ran black on entire sagemaker-trianing-toolkit directory"
- Ran black on entire sagemaker-trianing-toolkit directory
- Ran Black (python formatter) on the files with my code updates (torch_distributed.py and test_torch_distributed.py)
- Added test for neuron_parallel_compile in test_torch_distributed.py
- Updated comment syntax based on feedback in pull request as well as added full example of the neuron_parallel_compile command as it would appear in the command line
- added unit test for neuron_parallel_compile code change
- Updated torch_distributed.py
- add smddp exception classes in mpi distribution
- add NCCL_PROTO, NCCL_ALGO environments for modelparallel jobs
- unpin sagemaker version as the credential issue fixed
- increase worker waiting time for ORTE proc
- upagrade protobuf version for tensorflow 2.12
- Revert SMDDP collectives feature from smdataparallel runner
- Update libraries for SMDDP collectives validation
- Upgrade protobuf to prevent conflicts with smdebugger.
- Add support for p4de instances, update when FI_EFA_USE_DEVICE_RDMA flag is set to only p4d{e} instances.
- integrate SMDDP collectives into smdataparallel runner
- add general exception to filter
- integrate upcoming dataparallel change to modelparallel
- add unit tests for torchrun launcher and collections package deprecationWarning
- Add torch_distributed support for Trainium instances in SageMaker
-
- feature: Add neuron cores support (#21)
- Add SageMaker Debugger exceptions
- improve worker node wait logic and update EFA flags
- Enable PT XLA distributed training on homogeneous clusters
- relax exception type
- update num_processes_per_host for smdataparallel runner
- Removed version hardcoding for sagemaker test dependency
- update distribution_instance_group for pytorch ddp
- specify flake8 config explicitly
- handle utf-8 decoding exceptions while processing stdout and stderr streams
- Heterogeneous cluster changes
- update: protobuf version to overlap with TF requirements
- Fix none exception class issue for mpi
- Use framework provided error class and stack trace as error message
- fix flaky issue with incorrect rc being given
- missing args when shell script is used
- add back FI_EFA_USE_DEVICE_RDMA=1 flag, revert 2936f22
- Add py38, dropped py36 and py2 support. Bump pypi to 4.0.0 (changes from PR #108)
- Added
py38
, Removedpy36
andpy27
support
- Use asyncio to read stdout and stderr streams in realtime
- Fix delayed logging issues
- Convey user informative message if process gets OOM Killed
- Filter out stderr to look for error messages and report
- Report Exit code on training job failures
- Prepend tags to MPI logs to enable easy filtering in CloudWatch
- All the changes are from PR #108
- Update SM doc urls
- Update Amazon Licensing
- Install libssl1.1 and openssl packages in Dockerfiles
- Added
asyncio
package - Updated tests to use
asyncio
package
- Reverted -x FI_EFA_USE_DEVICE_RDMA=1 to fix a crash on PyTorch Dataloaders for Distributed training
- [smdataparallel] better messages to establish the SSH connection between workers
- smdataparallel enable EFA RDMA flag
- smdataparallel custom mpi options support
- Update Dockerfile to accomomdate Rust dependency.
- set btl_vader_single_copy_mechanism to none to avoid Read -1 Warning messages
- set btl_vader_single_copy_mechanism to none
- decode binary stderr string before dumping it out
- add data parallelism support (#3)
- update tox to use sagemaker 2.18.0 for tests
- use format in place of f-strings and use comment style type annotations
- workaround to print stderr when capturing
- use ECR-hosted image for ubuntu:16.04
- fix typo in ENVIRONMENT_VARIABLES.md
- propagate log level to aws services
- check for script entry point even if setup.py is present
- pin sagemaker<2 in test dependencies
- remove unofficially-supported environment variable
- use '-bind-to none' flag to improve performance.
- persist env vars in /etc/environment for MPI processes
- clarify feature request issue template
- run Python script entry point as script and install from requirements.txt
- clean up README usage examples
- Remove typing
- Test against Python 3.7 in PR builds
- Add Python 3.7 support
- Remove unused config files
- clean up README and other documentation
- Remove etc directory
- Add requirements.txt integration test in dummy container
- Remove modules.download_and_install
- Refactor env
- Refactor entry_point
- Update and add docstrings
- Update GitHub issue and pull request templates
- Refactor modules and entry_point (first pass)
- Revert "change: stream stderr even when capture_error is True"
- Use shlex.quote to construct bash command
- Relax dependencies version requirements
- Extract module to correct location in download_and_install
- Upgrade psutil
- Fix cleanup with requirements.txt functional tests
- create init.py file for Python 2 import of protobuf during tests (#260)
- Mark intermediate_output functional tests as xfail if not run on Linux
- Remove serving CLI entry point
- Pin inotify-simple version
- Remove legacy serving stack
- Support specifying S3 endpoint URL
- Fix memory leak in gethostname and adapt len semantics to Posix
- Remove beta directory
- rename package from sagemaker_containers to sagemaker_training_toolkit
- modify download_and_install to work with local tarball
- change scipy version pin to lower bound
- Add
scipy
to requried packages
- bug-fix: array_to_recordio_protobuf should return byte buffer instead of Stream
- bug-fix: Typo in the execution-parameters routing rule
- adding support for execution_parameters endpoint for serving
- Adding support for encoding to recordio
- stream stderr even when capture_error is True
- use built-in csv library in csv encoding/decoding for correct quoted string handling.
- Patch os.path.exists for sshd
- Mark gethostname tests as xfail if run locally
- Add Pylint to development process
- Use copy when installing user module from local path
- Integrate black into development process
- Update setup.py
- install user module before GUnicorn starts
- include /opt/ml/code to GUnicorn PYTHONPATH
- ensure exit code is an int
- pin flake and werkzeug versions
- add GPU default for MPI processes per host
- fix env var in readme
- Added execution-parameters to nginx.conf.template
- entrypoint run waits for hostname resolution
- fix path for training script location
- Detailed documentation for SageMaker Containers - training
- download_and_extract local tar file
- add test for network isolation mode training
- remove unnecessary name argument from download and extract function
- use mpi4py in MPI command for Python executables
- allow MPI options to be passed through entry_point.run
- add commit message format to CONTRIBUTING.md and PR template
- update for automated releases
- bug-fix: use specified args, entry point, and env vars when creating a runner
- doc-fix: Convert README to RST
- doc-fix: Update README with newer frameworks using SageMaker Containers
- Specify
long_description_content_type
in setup
- bug-fix: correctly set NGINX_PROXY_READ_TIMEOUT to match model_sever_timeout.
- enhancement: remove numpy version restriction.
- bug-fix: Fix recursive directory navigation in intermediate output.
- bug-fix: Rename libchangehostname to gethostname to match POSIX function name
- feature: C extension reads hostname from resourceconfig instead of env var.
- feature: Generic OpenMPI support
- bug-fix: Fix response content_type handling
- bug-fix: Accept header ANY ('/') fallback to default accept
- feature: Add intermediate output to S3 during training
- bug-fix: reintroduce
_modules.s3_download
and_modules.download_and_install
for backward compatibility
- feature: add capture_error flag to process.check_error and process.create and to all functions that runs process: modules.run, modules.run_module, and entry_point.run
- bug-fix: reintroduce _modules.prepare to import_module
- bug-fix: reintroduce _modules.prepare for backwards compatibility
- [breaking change] remove
_modules.prepare
and_modules.download_and_install
- [breaking change] move
_modules.s3_download
to_files.s3_download
- feature: support for Bash commands and Python scripts
- feature: Allow for dynamic nginx.conf creation
- feature: Provide support for additional environment variables. (http_port, safe_port_range and accept)
- feature: Making pip install less noisy
- bug-fix: Stream stderr instead of capturing it when running user script
- feature: Make it optional for run_module method to wait for the subprocess to exit
- feature: Allow additional sagemaker hyperparameters to be stored in TrainingEnv
- feature: Transformer: support user-supplied
transform_fn
- bug-fix: remove request size limit correctly
- enhancement: remove request size limit
- bug-fix: Fix choosing region for S3 client
- bug-fix: Use regional endpoint for S3 clients
- [breaking change] Remove
status_codes
module and usesix.moves.http_client
instead - [breaking change] Move
UnsupportedFormatError
fromencoders
module toerrors
module - Return 4XX status codes for
UnsupportedFormatError
from default input/output handlers
- Allow for local modules to work with AWS SageMaker framework containers.
- Support for training outside of AWS SageMaker Training.
- Fix output_data_dir to reference an existing directory.
- Fix error message.
- Make pip install verbose.
- Fix error class for user script errors.
- Adding Readme.
- Improve logging
- Support for hyperparameters with JSON serialized and non serialized keys altogether
- Training Environment transforms to env vars
- Created beta framework entrypoint
- Filter SageMaker provided hyperparameters and user provided hyperparameters
- Script mode
- Cache module installation
- Support to requirements.txt
- Decoder/Encoder support for numpy, JSON, and CSV
- bug: Configuration: Change module names to string in all
- bug: Environment: handle hyperparameter injected by tuning jobs
- bug: Training: Move processing of requirements file out to the specific container.
- feature: TrainingEnvironment: read new environment variable for job name
- feature: Documentation: add descriptive README
- Initial commit