An overview of data processing technologies and ecosystems that might be interesting for us.
Spark
- Seems to be the current favorite. Everyone seems to recommend it over hadoop.
- Has model for both streaming and batch (map-reduce)
- Supports explorative queries. Spark SQL. Designed to support ML algorithms.
- Supported on Amazon straight off the box (Elastic mapreduce)
- Very strong community
- No ruby. Scala or Java, and they don't seem to have a plan for JRuby
- Has beautiful support for elasticsearch
Fluentd
- Data collection framework made for logs.
- can split data into several endpoints, one being hdfs
- In memory aggregations?
- http://docs.fluentd.org/articles/cep-norikra
- Complex event processing with JRuby, including SQL queries of streams
Hadoop
- No concept of streams
- Old. Familiar. Mature.
Tutorials
- https://www.youtube.com/watch?v=Txjp37mR7xw
- Provides a tutorial on big data processing on google cloud with FluentD and Norikra. Go to 1hr 30min in it.