Running Hive, Pig or Streaming jobs

Running Hive, Pig or Streaming scripts can be done, but since we haven't had the need to do this, it does not have a declarative interface. The suggestions below are not difficult, but they do assume a fair understanding of Clojure (and Java interoperability).

Steps created with defstep are Maps which are eventually used to construct StepConfig objects. The entity defined by defstep is passed to (fire!) to make this happen. But (fire!) can also take StepConfig instances.

Hive or Pig

So, in brief, to make a StepConfig for Hive or Pig; you can use the StepFactory to construct a HadoopJarStepConfig object. You can then call the StepConfig constructor with a name (String) and the HadoopJarStepConfig instance.

Streaming

You have several options for a Streaming job.

Create a StepConfig similar to the hive/pig description above. But use StreamingStep, which is an instance of HadoopJarStepConfig.
At Climate Corporation, we wrap many of our Streaming jobs in Cascalog queries.
Here is some Clojure Code using the helper com.climate.services.aws.emr/step-config:

(emr/step-config
  "stream-step"
  false
  "/home/hadoop/contrib/streaming/hadoop-streaming.jar"
  nil
  ["-input" (format "s3://%s/data/simple.txt" bucket)
   "-output" "/out"
   "-mapper" (format "s3://%s/scripts/wc.sh" bucket)])

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running Hive, Pig or Streaming jobs

Hive or Pig

Streaming

Clone this wiki locally