Goal
Create an "interface" (written in Ruby) between OMA's processors boxes and Amazon's Elastic Map Reduce service.
The interface should allow to:
- specify a job flow (collection of related jobs)
- provide parameters to the job flow
- specify callbacks (on success and on failure)
Additional options (should be taken into account, but not implemented immediately):
- ability to monitor active jobs (flows)
- ability to shutdown active jobs (flows)
Implementation
ActiveRecord based implementation (rejected)
Create an ActiveRecord model represented a single job flow instance. Create flow models for each flow kind using AR STI.
A cron task (oma-processors) each hour will check active/finished job flow records and call callbacks for finished.
Possible usage:
# oma-models/lib/models/postgres/job_flows/emr_base.rb
module JobFlows
class EmrBase < ActiveRecord::Base
...
# oma-models/lib/models/postgres/job_flows/pages_es_index_updater.rb
module job_flows
class PagesEsIndexUpdater < ActiveRecord::Base
...
# oma-processors/...
job_flow = ::JobFlows::PagesEsIndexUpdater.create!(domain_id: domain.id)
job_flow.run
active_flow = ::JobFlows::PagesEsIndexUpdater.active.first
JobFlows::EmrBase (and subclasses) uses rslifka/elasticity gem under the hood.
Pros
- History. Already finished jobs stored in Postgres. It provides info about initial arguments, final statuses, created artifacts (URL of created files etc.).
Cons
- New ActiveRecord class pollutes oma-models with information about processor implementation details. In particular oma-models depends on rslifka/elasticity gem
- Callbacks (on job flow success or failure) are implemented as methods of an AR class. Thus there is no advantages of closures.
S3 based implementation (rejected)
Create a ruby class (module?) represented a single job flow instance. Use Amazon S3 as a persistence layer. Save a list of actual job flows (not finished) as a file on S3 (CSV?). Create a ruby class for each particular job flow kind.