Skip to content
This repository has been archived by the owner on Mar 20, 2023. It is now read-only.

TensorFlow-CPU quickstart issues #369

Open
fuglede opened this issue Sep 16, 2021 · 0 comments
Open

TensorFlow-CPU quickstart issues #369

fuglede opened this issue Sep 16, 2021 · 0 comments

Comments

@fuglede
Copy link

fuglede commented Sep 16, 2021

Following the TensorFlow CPU quickstart, I run into a couple of issues

  1. When creating the pool, I get a

RuntimeError: Could not find an Azure Batch Node Agent Sku for this offer=ubuntuserver publisher=canonical sku=16.04-lts. You can list the valid and available Marketplace images with the command: account images

From a look at Azure Portal, it looks like only 18.04 is currently available; indeed, changing pool.yml to use 18.04-LTS instead is enough to get rid of this issue. This probably affects many of the bundled recipes:

batch-shipyard/recipes$ grep -R 16.04 .
./Caffe-CPU/config/pool.yaml:      sku: 16.04-LTS
./Caffe-GPU/config/pool.yaml:      sku: 16.04-LTS
./Caffe2-CPU/config/pool.yaml:      sku: 16.04-LTS
./Caffe2-GPU/config/pool.yaml:      sku: 16.04-LTS
./Chainer-CPU/config/pool.yaml:      sku: 16.04-LTS
./Chainer-GPU/config/pool.yaml:      sku: 16.04-LTS
./CNTK-CPU-Infiniband-IntelMPI/docker/Dockerfile:FROM ubuntu:16.04
./CNTK-CPU-OpenMPI/config/multinode/pool.yaml:      sku: 16.04-LTS
./CNTK-CPU-OpenMPI/config/singlenode/pool.yaml:      sku: 16.04-LTS
./CNTK-GPU-Infiniband-IntelMPI/docker/Dockerfile:FROM nvidia/cuda:8.0-cudnn6-devel-ubuntu16.04
./CNTK-GPU-OpenMPI/config/multinode-multigpu/pool.yaml:      sku: 16.04-LTS
./CNTK-GPU-OpenMPI/config/singlenode-multigpu/pool.yaml:      sku: 16.04-LTS
./CNTK-GPU-OpenMPI/config/singlenode-singlegpu/pool.yaml:      sku: 16.04-LTS
./FFmpeg-GPU/config/pool.yaml:      sku: 16.04-LTS
./HPMLA-CPU-OpenMPI/config/pool.yaml:      sku: 16.04-LTS
./HPMLA-CPU-OpenMPI/Data-Shredding/README.md:* [TPN Ubuntu Container](https://github.com/saeedmaleki/Distributed-Linear-Learner/blob/master/TPN_Ubuntu%20Container_16-04-FINAL.txt)
./HPMLA-CPU-OpenMPI/docker/Dockerfile:FROM ubuntu:16.04
./HPMLA-CPU-OpenMPI/docker/README.md:* [TPN Ubuntu Container](https://github.com/saeedmaleki/Distributed-Linear-Learner/blob/master/TPN_Ubuntu%20Container_16-04-FINAL.txt)
./HPMLA-CPU-OpenMPI/README.md:* [TPN Ubuntu Container](https://github.com/saeedmaleki/Distributed-Linear-Learner/blob/master/TPN_Ubuntu%20Container_16-04-FINAL.txt)
./Keras+Theano-CPU/config/pool.yaml:      sku: 16.04-LTS
./Keras+Theano-GPU/config/pool.yaml:      sku: 16.04-LTS
./MXNet-CPU/config/multinode/pool.yaml:      sku: 16.04-LTS
./MXNet-CPU/config/singlenode/pool.yaml:      sku: 16.04-LTS
./MXNet-CPU/docker/Dockerfile:FROM ubuntu:16.04
./MXNet-GPU/config/multinode/pool.yaml:      sku: 16.04-LTS
./MXNet-GPU/config/singlenode/pool.yaml:      sku: 16.04-LTS
./NAMD-GPU/config/pool.yaml:      sku: 16.04-LTS
./NAMD-TCP/config/pool.yaml:      sku: 16.04-LTS
./RemoteFS-GlusterFS+BatchPool/config/pool.yaml:      sku: 16.04-LTS
./TensorFlow-CPU/config/pool.yaml:      sku: 16.04-LTS
./TensorFlow-Distributed/config/cpu/pool.yaml:      sku: 16.04-LTS
./TensorFlow-Distributed/config/gpu/pool.yaml:      sku: 16.04-LTS
./TensorFlow-GPU/config/docker/pool.yaml:      sku: 16.04-LTS
./TensorFlow-GPU/config/singularity/pool.yaml:      sku: 16.04-LTS
./Torch-CPU/config/pool.yaml:      sku: 16.04-LTS
./Torch-CPU/docker/Dockerfile:FROM ubuntu:16.04
./Torch-GPU/config/pool.yaml:      sku: 16.04-LTS
  1. After the pool is created and I try to create the included job, I get another error:
$ ../shipyard jobs add --tail stdout.txt
2021-09-16 10:16:30.581 INFO - Adding job tensorflowjob to pool tensorflow-cpu
2021-09-16 10:16:30.673 DEBUG - constructing 1 task specifications for submission to job tensorflowjob
2021-09-16 10:16:30.738 DEBUG - submitting 1 task specifications to job tensorflowjob
2021-09-16 10:16:30.741 DEBUG - submitting 1 tasks (0 -> 0) to job tensorflowjob
2021-09-16 10:16:30.971 INFO - submitted all 1 tasks to job tensorflowjob
2021-09-16 10:16:30.971 DEBUG - attempting to stream file stdout.txt from job=tensorflowjob task=task-00000
Traceback (most recent call last):
  File "/mnt/c/Users/username/repos/batch-shipyard/shipyard.py", line 3136, in <module>
    cli()
  File "/mnt/c/Users/username/repos/batch-shipyard/.shipyard/lib/python3.8/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/mnt/c/Users/username/repos/batch-shipyard/.shipyard/lib/python3.8/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/mnt/c/Users/username/repos/batch-shipyard/.shipyard/lib/python3.8/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/mnt/c/Users/username/repos/batch-shipyard/.shipyard/lib/python3.8/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/mnt/c/Users/username/repos/batch-shipyard/.shipyard/lib/python3.8/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/mnt/c/Users/username/repos/batch-shipyard/.shipyard/lib/python3.8/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/mnt/c/Users/username/repos/batch-shipyard/.shipyard/lib/python3.8/site-packages/click/decorators.py", line 64, in new_func
    return ctx.invoke(f, obj, *args, **kwargs)
  File "/mnt/c/Users/username/repos/batch-shipyard/.shipyard/lib/python3.8/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/mnt/c/Users/username/repos/batch-shipyard/shipyard.py", line 1968, in jobs_add
    convoy.fleet.action_jobs_add(
  File "/mnt/c/Users/username/repos/batch-shipyard/convoy/fleet.py", line 4065, in action_jobs_add
    batch.add_jobs(
  File "/mnt/c/Users/username/repos/batch-shipyard/convoy/batch.py", line 5892, in add_jobs
    stream_file_and_wait_for_task(
  File "/mnt/c/Users/username/repos/batch-shipyard/convoy/batch.py", line 3309, in stream_file_and_wait_for_task
    tfp = batch_client.file.get_properties_from_task(
  File "/mnt/c/Users/username/repos/batch-shipyard/.shipyard/lib/python3.8/site-packages/azure/batch/operations/_file_operations.py", line 328, in get_properties_from_task
    raise models.BatchErrorException(self._deserialize, response)
azure.batch.models._models_py3.BatchErrorException: Request encountered an exception.
Code: None
Message: None

Removing the resource_files section is enough to take care of the issue; probably unsurprising as the given blob_source (https://raw.githubusercontent.com/tensorflow/models/master/tutorials/image/mnist/convolutional.py) 404s.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants