For dependencies managed via Anaconda or the app-packages mechanism, simply import the dependencies required in the normal way. PNDA has already set up the necessary paths and depedendency caches so nothing further is required.
For dependencies to be delivered at runtime, use addPyFile() in your code. For example -
addPyFile('hdfs:///pnda/deployment/app_packages/sharedroutines-0.1.egg')
At present, PNDA does not set these up automatically in the same way as above.
For jobs to be scheduled via coordinators and workflows, do the following.
To your <spark-opts>
section in the workflow action, add
--conf spark.executorEnv.PYSPARK_PYTHON=/opt/pnda/anaconda/bin/python
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=/opt/pnda/anaconda/bin/python
Due to this unresolved issue in Spark, spark.yarn.appMasterEnv.PYTHONPATH isn't handled properly. Until this is resolved:
Add this to your code, before importing dependencies -
sys.path.insert(0, '/opt/pnda/app-packages/lib/python2.7/site-packages')
For dependencies to be delivered at runtime, use addPyFile() in the normal way as described above.