Goal

Create an "interface" (written in Ruby) between OMA's processors boxes and Amazon's Elastic Map Reduce service.

The interface should allow to:

specify a job flow (collection of related jobs)
provide parameters to the job flow
specify callbacks (on success and on failure)

Additional options (should be taken into account, but not implemented immediately):

ability to monitor active jobs (flows)
ability to shutdown active jobs (flows)

Implementation

ActiveRecord based implementation (rejected)

Create an ActiveRecord model represented a single job flow instance. Create flow models for each flow kind using AR STI.

A cron task (oma-processors) each hour will check active/finished job flow records and call callbacks for finished.

Possible usage:


# oma-models/lib/models/postgres/job_flows/emr_base.rb

module JobFlows
  class EmrBase < ActiveRecord::Base
  ...

# oma-models/lib/models/postgres/job_flows/pages_es_index_updater.rb
module job_flows
  class PagesEsIndexUpdater < ActiveRecord::Base
  ...

# oma-processors/...

job_flow = ::JobFlows::PagesEsIndexUpdater.create!(domain_id: domain.id)
job_flow.run

active_flow = ::JobFlows::PagesEsIndexUpdater.active.first

JobFlows::EmrBase (and subclasses) uses rslifka/elasticity gem under the hood.

Pros

History. Already finished jobs stored in Postgres. It provides info about initial arguments, final statuses, created artifacts (URL of created files etc.).

Cons

New ActiveRecord class pollutes oma-models with information about processor implementation details. In particular oma-models depends on rslifka/elasticity gem
Callbacks (on job flow success or failure) are implemented as methods of an AR class. Thus there is no advantages of closures.

S3 based implementation (rejected)

Create a ruby class (module?) represented a single job flow instance. Use Amazon S3 as a persistence layer. Save a list of actual job flows (not finished) as a file on S3 (CSV?). Create a ruby class for each particular job flow kind.

Interface to EMR Hadoop jobs