-
Notifications
You must be signed in to change notification settings - Fork 20
The Jobdef
A jobdef file describes your EMR cluster and zero or more "steps". A step is Amazon's name for a task or job submitted to the cluster. lemur reads your jobdef, which defines a bunch of options inside (defcluster) and (defstep). Finally, at the end of your jobdef, you execute (fire! ...) to make things happen. Also keep in mind that the jobdef is an interpreted clj file, so you can insert arbitrary Clojure code to be executed anywhere in the file (but see Jobdef Hooks for a better way).
What happens when you call fire! is dependent on the lemur "command". I.e. on the command line, you enter something like
lemur run my-jobdef.clj [args]
or
lemur local my-jobdef.clj [args]
The word right after lemur is the command. The 'run' command will start the EMR cluster and submit your steps. The 'local' command will run your job on your workstation using your local Hadoop installation. Try "lemur help" for more commands.
A good starting point for a new job is to copy 'examples/minimal-sample-jobdef.clj' and then refer to examples/sample-jobdef.clj for a more exhaustive list of features and options with more examples and documentation.
WARNING: examples/sample-jobdef.clj is large. You probably don't need most of it for typical jobs.
There are standard paths (S3 or local) created for job jar, inputs, outputs, logs, etc. These paths are referenced through the keys below. Their default values are also shown.
:base-uri "s3://${bucket}/runs/${run-path}"
In Local Mode, the base-uri will take on the value configured for the :local profile (see Profiles). For example:
:local {:base-uri "/tmp/lemur/${run-path}"}
The other paths are relative to :base-uri, e.g.:
:log-uri "${base-uri}/emr-logs"
:data-uri "${base-uri}/data"
:jar-uri "${base-uri}/jar"
There are several blocks or sections that may appear in the jobdef file (or base files). For the most part, these can appear in order, as nothing happens until you execute (fire!). As an exception, (use-base) and then (add-profiles) should be first. Each section is optional and can appear 0 or more times (two add-validators blocks, for example, would mean that all validators from both blocks would need to pass).
See examples/sample-jobdef.clj for more details/examples of usage for each block.
The use-base specifies another file that should be included here. The idea is that this file would set some default values and behavior for you (i.e. inherit some options and functionality).
'Packages' of options and functionality that can be enabled or disabled. See Profiles.
Add new command line options that Lemur should parse and make available to your jobdef. These are distinct from the options which are passed to your Hadoop job (see Step Job Args).
Rules to validate the configuration before you launch a cluster and/or run the steps. It is recommended that you validate as much as you can locally, before committing to the relatively heavy cluster launch and job run.
More at Validators
Tasks to be done before or after 'fire!'.
More at Jobdef Hooks
Where you can declare options for your cluster. All the options are specified in examples/sample-jobdef.clj. Your jobdef can have more than one defcluster, but only one of them can be passed to a particular call to (fire!).
Defines each Step. In EMR terms, a step is usually one hadoop job. Steps are executed serially. The options are specified in examples/sample-jobdef.clj.
This is where you would define the main-class for your job, and the arguments (Step Job Args) that should be passed to it. Java, Cascading, Cascalog (and presumably Scalding) jobs can all be run with a combination of a jar and a main-class.
Running Hive or Pig jobs. We haven't done this at Climate Corporation, so I don't have a declarative interface for it. But it should be relatively easy. I'll update this information if I have an opportunity to try it (or if someone supplies some better docs or a patch)
(fire! ...) is how you kick things off. It will construct the METAJOB, start the cluster, run the hooks and steps, etc.
fire! returns quickly. It does not wait for the cluster to launch or the job to complete. If you want to block (see wait and wait-on-step at the bottom of examples/sample-jobdef.clj). The jobflow-id is saved (context-get :jobflow-id).
(fire! cluster|cluster-fn steps*|step-fn)
The first arg is a cluster (created by defcluster) or a 'fn of eopts' that returns a cluster. Second arg is one or more steps, either as a collection or a variable length argument list. Alternatively, you can provide a single 'fn of eopts' which returns a step or collection of steps. Each step is a defstep or a StepConfig.
Examples:
(fire! my-cluster a-step b-step)
(fire! my-cluster step-selector-fn)
fire! returns the metajob, which is a YAML string with all the details for the launch.
This is a clj file, so any valid Clojure code can appear in and around these sections. Usually this would be function definitions, and clojure.core/require statements.
You should refer to examples/sample-jobdef.clj for a list of all options and the details of how to use them.