Big Data, an ecosystem

The Big Data ecosystem consists of multiple products, each of them having some or all the characteristics we were evoking in our definitions. Not only they will handle high volumes of data, but they will allow you to process data without specific schema (Variety) quickly (Velocity). This is a list of products and a short description of what they do.

Management layer

Apache Ambari

An Hadoop cluster management server, with Web GUI to setup HDP or HDF.

Ambari is an open source project from the Apache Foundation. You can download it for free from the main web site, But Ambari is also the core of the Big Data platforms delivered by the company HortonWorks ( HortonWorks. If you install an Ambari Server instance, whether you get the software from HortonWorks or the Apache Foundation web site, you are setting up the same thing. HortonWorks is participating into the development of Ambari and of all the software installed and managed thru it.
With Ambari, you will install an Hadoop cluster, being an HortonWorks Data Platform. Or you can use it to setup an HortonWorks Data Flow (the platform build around Apache NiFi).
Most of the software solutions listed here after can be installed and managed by Ambari (HDFS, Yarn, Zookeeper, Sqoop, MapReduce, Mahout, TEZ, Pig, Oozie, Hive, HBase, SOLR, NiFi, Kafka, Storm, Ranger, Flume, ...)
Note that the company HortonWorks is one of the major contributors to the various Hadoop open source projects.

Data storage layer

Tools to help you to store your (big) data
There will be more information regarding this layer, especially about the NoSQL and Indexing data stores enumerated here in the section about the NoSQL data stores.

Apache Hadoop Distributed File System (HDFS)

Core of the Hadoop framework with YARN and MapReduce
A distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.
It is one the 4 four core components of the Hadoop cluster:
  1. Hadoop Common – contains libraries and utilities needed by other Hadoop modules;
  2. HDFS
  3. Hadoop YARN - a resource-management platform responsible for managing computing resources in clusters and using them for scheduling of users' applications
  4. Hadoop MapReduce - an implementation of the MapReduce programming model for large scale data processing based in YARN
All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common and should be automatically handled by the framework.

Apache Cassandra

Simple NoSQL database, targeting large amount of data, scalability and simplicity
  • Column oriented, keys-values
  • Roots in Google’s Bigtable
  • Easy to manage at scale
  • Easy and powerful scale-out capabilities
  • Include support for multiple DC repartition
  • Exceptional write performances
  • You have to understand the data model
  • Support for automatic sharding
A column-oriented DBMS is a database management system (DBMS) that stores data tables as sections of columns of data rather than as rows of data (Wikipedia)
A database shard is a horizontal partition of data in a database or search engine. Each individual partition is referred to as a shard or database shard. Each shard is held on a separate database server instance, to spread load. (Wikipedia)


Scalable NoSQL database designed for OLTP workload
The disproportionate success of MongoDB is largely based on its innovation as a data structure store that lets us more easily and expressively model the "things" at the heart of our applications….
Having the same basic data model in our code and in the database is the superior method for most use cases, as it dramatically simplifies the task of application development, and eliminates the layers of complex mapping code that are otherwise required.
  • Easy out-of-the-box experience
  • Can take advantage of Cloud Infrastructure (to easy deploy new nodes for scaling out)
  • Document based, not pure keys-values
  • Automatic backup capabilities
  • Replication
  • Designed for OLTP workloads, not for complex transactions
  • Support for automatic sharding
On Line Transaction Processing applications are high throughput- and insert- or update-intensive in database management. These applications are used concurrently by hundreds of users. (Wikipedia)


Elastic (Elasticsearch)

A distributed generic full-text search engine and indexing, distributed, with visualization, alerting and notification mechanisms. Provide dedicated tools toward log-related search and indexing.
Tool to build a distributed search engine, using Apache Lucene for the indexing and a REST API and JSON data format to allow it to be used by any kind of programming language.
It is also multi-tenant, near real-time and able to do full-text search.
Able to search for any kind of documents.
Distributed search engine but lacking distributed
As such, Elastic itself is also an ecosystem build around the distributed search engine, including:
  • ElasticSearch, the distributed search engine
  • Logstash for collection and analyse linked to log messages
  • Kibana for visualization
  • Beats for data shipping
  • Watcher for alerting and notifications
  • Shield to implement a RBAC layer on your data
  • ES-hadoop to integrate ElasticSearch and Hadoop

Apache HBase

NoSQL database that can leverage existing SQL expertise while building on a more modern, distributed database
  • Roots in Google’s BigTable
  • Needs HDFS / Hadoop, can thus by managed thru the same cluster management tools (Ambari, Zookeeper, …)
  • Column oriented, keys-values
  • Enable fast, random reads & writes on top of unstructured HDFS
  • Use memory to cache a lot, HDFS as a persistency layer
  • Highly scalable by design (scale-out), can run on multiple nodes
  • Optimal performances when consistency is critical
  • Integration with the Hadoop stack for the analyse part
  • Support for automatic sharding


NoSQL database graph & document oriented, with scalability and distribution
  • Multi-models: document, graph, keys-values and objects
  • Scalable & distributed with multi-master mode
  • Sharding not (yet) fully functional
  • Strong security profiling
  • May not be ideal for non-huge datasets (size ?)

Analytic layer

Tools that perform what your business applications and users are requiring.


A framework to build, distribute and execute application performing analytic and visualization that will allow you to perform any kind of analyse of multiple set of data across a variety of systems.
It is based on Python, offering a commercial support.
For batches, interactive or real-time processing of huge amount of data.
Looks more like a framework to develop application or bundle of applications that needs to be deployed somewhere to be executed, like on a Hadoop cluster. Filled with a lot of “little” components that can help you to build all you’re in need.
Include various packages, based on Python or the R language. It even integrate ElasticSearch.
R Studio can be used, GIS data can be ingested.
Help you to develop, build, version, deploy and run your code. Because it integrates / includes a lot of various other projects, can be used to do almost everything from the deployment and management of a Hadoop cluster to the execution of applications performing complex data analytics and visualization, proposing interface to dig into the Big Data, …

Apache Drill

Retrieve, join and analyse data from Hadoop, NoSQL or Cloud Storage using a schema-free SQL Query Engine
Apache Drill is a low latency distributed query engine for large-scale datasets, including structured and semi-structured/nested data.
For example, you can join a user profile collection in MongoDB with a directory of event logs in Hadoop.
You can thus query different datastores simultaneously and join their results into the same resultset:
  • MongoDB
  • HDFS
  • Hive
  • MapR-DB & MapR-FS
  • HBase
  • Amazon S3
  • Microsoft Azure Blob Storage
  • Google Cloud Storage
  • Openstack Swift
  • NAS storage
  • Local files storage
Provide a REST API to integrate any applications or programming language.
On top of Drill, you can put your own Business, Analyst or Data Science tool, like MicroStrategy, SAS, D3, Excel, … to interact with Drill’s underlying non-SQL datastores.
It runs as a single application on your laptop or on a multiple nodes cluster for a distributed implementation of the engine.

Apache Hive

Pseudo-SQL engine to give structure to unstructured content
The Apache Hive ™ data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL.
At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

Apache Phoenix

A pure SQL layer on top of HBase
Phoenix is an open source SQL skin for HBase. You use the standard JDBC APIs instead of the regular HBase client APIs to create tables, insert data, and query your HBase data.
On top of that, using JDBC and SQL:
  • Are already well know and used technologies that can help to spread more HBase usage among enterprises.
  • Reduces the amount of code users need to write
  • Allows for performance optimizations transparent to the user
  • Opens the door for leveraging and integrating lots of existing tooling

Apache Pig

Script language and framework to create MapReduce tasks on an Hadoop cluster
Pig is a high-level platform for creating MapReduce programs used with Hadoop. The language for this platform is called Pig Latin. Pig Latin abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of SQL for RDBMS systems. Pig Latin can be extended using UDF (User Defined Functions) which the user can write in Java, Python, JavaScript, Ruby or Groovy and then call directly from the language.

Processing layer

The tools to bring your data into your storage

Apache Spark

Cluster computing framework allowing multi-stage in-memory for processing data from a distributed storage, not Hadoop specific.
Spark is a clustered framework to process data, running:
  • Under its own cluster control mechanism
  • Under YARN of the Hadoop cluster
  • Under Apache Mesos
It interfaces with a wide variety of distributed storage, including:
  • Hadoop Distributed File System (HDFS)
  • Cassandra
  • OpenStack Swift
  • Amazon S3
  • Kudu
  • Other custom solution
Spark claims to be 100x faster than standard Hadoop MapReduce applications in memory and 10x faster on disk.

Apache TEZ

Framework to allow the creation of application running complex set of tasks for processing data
By allowing projects like Apache Hive and Apache Pig to run a complex DAG of tasks, Tez can be used to process data, that earlier took multiple MapReduce jobs, now in a single Tez job, as show below:
Each circle represent a distinct job. 3 when using a pure Map-Reduce implementation, 1 when using TEZ.

Apache Storm

A real-time computational engine on a distributed infrastructure
Storm has many use cases: real-time analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.
Storm integrates with the queuing and database technologies you already use. A Storm topology consumes streams of data and process (transform, compute, …) those streams in arbitrarily complex ways, repartitioning the streams between each stage of the computation however needed.

Apache Sqoop

The link between unstructured datastore (Hadoop) and structured one (RDBMS)
Sqoop is a command-line interface application for transferring data between relational databases and Hadoop. It supports incremental loads of a single table or a free form SQL query as well as saved jobs which can be run multiple times to import updates made to a database since the last import. Imports can also be used to populate tables in Hive or HBase. Exports can be used to put data from Hadoop into a relational database.

Apache Kafka

Publish-subscribe queue (messaging) rethought as a distributed commit log
It is a real-time (low latency) queue (messaging) platform with high-throughput.
As any message queuing system, the purpose is to handle incoming message expressed in one protocol into the formal protocol expected by the receiver. A commit mechanism is in place to assure the sender that the message has been really delivered.

Apache NiFi

Distributed tool to Extract, Transform and Load data in a flow-based model with a graphical interface
Apache NiFi is a software project which enables the automation of data flow between systems. NiFi helps move and track data. The project is written using flow-based programming and provides a web-based user interface to manage data flows in real time.
Each process you define in one flow is independent of the others. Processes exchange data in the form of a so-called flow file, containing the data send to the flow and some attributes (meta-data).
Processes can be:
  • Data receivers (syslog, TCP, Kafka consumers, …)
  • Data writers (Kafka “put”, ElasticSearch, HDFS writer, …)
  • Data transformers (change text, mapping, adding information, …)

Apache Flume

Distributed tool to Extract, Transform and Load data
Apache Flume allows the building of distributed and reliable streams of data flows. It use a simple and flexible fault-tolerant architecture.
You configure your:
  • Sources (files, Kafka, syslog, TCP, …)
  • Sinks (Kafka “put”, ElasticSearch, HDFS writer, …)
  • Channels, that link sources and sinks together with interceptors that perform transformation of data (aggregations, modifications, mapping, …)

Visualization layer

The tools that focus more on presenting the data than allowing complex analyze and transformation on them.


Generic visualization of data in web browser
Javascript library allowing the presentation of data in the form of a web page using HTML, CSS and SVG. It is based on web standard and can use the full capabilities of all modern web browsers.
E.g.: generate an HTML table from an array of numbers.
Its goal: efficient manipulation of documents based on data
The data have to be loaded and passed to the web server where D3 is installed and used to represent it.


Analyse and visualization of geographical data
AN open source set of libraries and tools to create, edit, visualise, analyse and publish geospatial information.
  • View data
  • Explore data and compose maps
  • Create, edit, manage and export data
  • Analyse data
  • Publish maps on the Internet
This is a tools supporting functionality extensions by a plugin architecture => create your own plugin if you want to add specific features. Plugins can be written using Python. The core of the application is written in C++ using the QT framework.
This means that QGIS is a desktop tool, with a (nice) GUI to help you to create your geographical representation.

Apache Zeppelin

Multi-purpose notebook to discover, ingest, analyse, visualize and collaborate on data insights.
This is an implementation of a Notebook interface (also called a Computational notebook or Data science notebook). This is a virtual notebook environment used for literate programming (programming in a natural language). It pairs the functionality of word processing software with the programming language of the notebook.
Used for various purposes, like:
  • Visualization
  • Analyse
  • Ingestion

Programming languages

The programming languages often used to develop applications or jobs, sitting in any of the above mentioned layers

R environment

A programming language oriented to analytics and his Integrated Developing Environment
R is a free software environment for statistical computing and graphics. It is a GNU project.
One of R’s strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed.
R is an integrated suite of software facilities for data manipulation, calculation and graphical display. It includes
  • an effective data handling and storage facility,
  • a suite of operators for calculations on arrays, in particular matrices,
  • a large, coherent, integrated collection of intermediate tools for data analysis,
  • graphical facilities for data analysis and display either on-screen or on hardcopy, and
  • a well-developed, simple and effective programming language which includes conditionals, loops, user-defined recursive functions and input and output facilities.
Beside the R language, there is an IDE called R Studio which is offering all the facilities of any IDE (syntax highlight, debugging, integrated help, …)


A programming language, not specifically oriented towards analytics and data science
Programing language invented by people at Google, inspired by C and Pascal. Should be simple to learn and quick to compile.
This is not specifically oriented to the Big Data & analytics world.


A scripting and programming language, not related to analytics or Big Data as such
It works as a script (interpreted) or as a binary program (compiled).
It comes with tons of library to extend the core of the language.
This is not specifically oriented to the Big Data & analytics world.


Another programming language, not related to analytics or Big Data as such
A dynamic, open source programming language with a focus on simplicity and productivity. It has an elegant syntax that is natural to read and easy to write.
This is not specifically oriented to the Big Data & analytics world.