Skip to content

Processing Big Data with Titanoboa

Miro Kubicek edited this page Mar 12, 2019 · 11 revisions

Thanks to its distributed nature, titanoboa is well predisposed for Big Data processing.

You can also fine-tune how performant and robust your Big Data processing will be - based on your job channel configuration - if you are using a job channel that is robust and highly-available so will be your big data processing.

If on the other hand you are using a job channel that does not persist messages you will probably have more performant set up (but less robust). ...And of course you can combine these two approaches (it is perfectly possible to use multiple job channels and core systems in one titanoboa server.

Ultimately, if you use SQS queue as a job channel, your processing can have unlimited scalability while being most robust - your titanoboa servers can be located across multiple regions and availability zones!

But lets not get ahead of ourselves and start from the beginning:

Map & Reduce steps

There are two workflow step supertypes that are designed exactly for purpose of processing large(r) datasets:

  • :map - based on a sequence returned by this step's workload function, many separate atomic jobs are created
  • :reduce - performs reduce function over results returned by jobs triggered by a map step

Obligatory Hello World: Counting Words in Shakespeare