Big data definitions

Before digging into this world made of huge amount of data, streaming data flows and anayltic applications, let's fix some basic ideas.
Let's define the ground concepts of this world.

Big Data

Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time. Big data "size" is a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data.
Big data requires a set of techniques and technologies with new forms of integration to reveal insights from datasets that are diverse, complex, and of a massive scale.
In a 2001 research report and related lectures, Gartner analyst Doug Laney defined data growth challenges and opportunities as being three-dimensional, i.e. increasing volume (amount of data), velocity (speed of data in and out), and variety (range of data types and sources).

Gartner, and now much of the industry, continue to use this "3Vs" model for describing big data.
In 2012, Gartner updated its definition as follows: "Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization."
Gartner's definition of the 3Vs is still widely used, and in agreement with a consensual definition that states that "Big Data represents the Information assets characterized by such a High Volume, Velocity and Variety to require specific Technology and Analytical Methods for its transformation into Value".
Additionally, a new V "Veracity" is added by some organizations to describe it, revisionism challenged by some industry authorities. 
The 3Vs have been expanded to other complementary characteristics of big data:
  • Volume: big data doesn't sample; it just observes and tracks what happens
  • Velocity: big data is often available in real-time
  • Variety: big data draws from text, images, audio, video; plus it completes missing pieces through data fusion
  • Machine Learning: big data often doesn't ask why and simply detects patterns
  • Digital footprint: big data is often a cost-free byproduct of digital interaction
The growing maturity of the concept more starkly delineates the difference between big data and Business Intelligence:
  • Business Intelligence uses descriptive statistics with data with high information density to measure things, detect trends, etc..
  • Big data uses inductive statistics and concepts from nonlinear system identification to infer laws (regressions, nonlinear relationships, and causal effects) from large sets of data with low information density[26] to reveal relationships and dependencies, or to perform predictions of outcomes and behaviours.
 In a popular tutorial article published in IEEE Access Journal, the authors classified existing definitions of big data into three categories:
  • Attribute Definition
  • Comparative Definition
  • Architectural Definition.
The authors also presented a big-data technology map that illustrates its key technological evolutions.

Data Lake

A data lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data. The data structure and requirements are not defined until the data is needed. While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data. Each data element in a lake is assigned a unique identifier and tagged with a set of extended metadata tags. When a business question arises, the data lake can be queried for relevant data, and that smaller set of data can then be analysed to help answer the question. 

The term Data Lake is often associated with Hadoop-oriented object storage. In such a scenario, an organization's data is first loaded into the Hadoop platform, and then business analytics and data mining tools are applied to the data where it resides on Hadoop's cluster nodes of commodity computers.

Like big data, the term data lake is sometimes disparaged as being simply a marketing label for a product that supports Hadoop. Increasingly, however, the term is being accepted as a way to describe any large data pool in which the schema and data requirements are not defined until the data is queried.

As a marketer, you may hear rumblings that your organization is setting up a data lake and/or your marketing data warehouse is a candidate to be migrated to this data lake. It’s important to recognize that while both the data warehouse and data lake are storage repositories, the data lake is not Data Warehouse 2.0 nor is it a replacement for the data warehouse.

So to answer the question—isn’t a data lake just the data warehouse revisited?—my take is no. A data lake is not a data warehouse. They are both optimized for different purposes, and the goal is to use each one for what they were designed to do. Or in other words, use the best tool for the job.

Data Science

Data Science is an interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured, which is a continuation of some of the data analysis fields such as statistics, data mining, and predictive analytics, similar to Knowledge Discovery in Databases (KDD).
Data Science includes discipline from:
  • mathematicsData_visualization_process_v1.png
  • statistics
  • chemometrics (the science of extracting information from chemical systems by data-driven means)
  • information science
  • computer science:
    • signal processing
    • probability models
    • machine learning
    • statistical learning
    • data mining
    • database
    • data engineering
    • pattern recognition and learning
    • visualization
    • predictive analytics
    • uncertainty modeling
    • data warehousing
    • data compression
    • computer programming
    • artificial intelligence
    • High performance computing.

Data Warehouse

A data warehouse is a federated repository for all the data that an enterprise's various business systems collect. The repository may be physical or logical. They store current and historical data and are used for creating analytical reports for knowledge workers throughout the enterprise.

Examples of reports could range from annual and quarterly comparisons and trends to detailed daily sales analyses.
Data warehousing emphasizes the capture of data from diverse sources for useful analysis and access, but does not generally start from the point-of-view of the end user who may need access to specialized, sometimes local databases. The latter idea is known as the data mart.

The typical extract-transform-load (ETL)-based data warehouse uses staging, data integration, and access layers to house its key functions. The staging layer or staging database stores raw data extracted from each of the disparate source data systems. The integration layer integrates the disparate data sets by transforming the data from the staging layer often storing this transformed data in an operational data store (ODS) database. The integrated data are then moved to yet another database, often called the data warehouse database, where the data is arranged into hierarchical groups often called dimensions and into facts and aggregate facts. The combination of facts and dimensions is sometimes called a star schema. The access layer helps users retrieve data.
This definition of the data warehouse focuses on data storage. The main source of the data is cleaned, transformed, catalogued and made available for use by managers and other business professionals for data mining, online analytical processing, market research and decision support. However, the means to retrieve and analyse data, to extract, transform and load data, and to manage the data dictionary are also considered essential components of a data warehousing system. Many references to data warehousing use this broader context. Thus, an expanded definition for data warehousing includes business intelligence tools, tools to extract, transform and load data into the repository, and tools to manage and retrieve metadata.


A data mart is the access layer of the data warehouse environment that is used to get data out to the users. The data mart is a subset of the data warehouse that is usually oriented to a specific business line or team. Data marts are small slices of the data warehouse. Whereas data warehouses have an enterprise-wide depth, the information in data marts pertains to a single department. In some deployments, each department or business unit is considered the owner of its data mart including all the hardware, software and data.

This enables each department to use, manipulate and develop their data any way they see fit without altering information inside other data marts or the data warehouse. In other deployments where conformed dimensions are used, this business unit ownership will not hold true for shared dimensions like customer, product, etc.
So we can see a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption.

Business Intelligence

Business intelligence (BI), put simply, is any technology item that intelligently provides actionable information for business purposes. The phrase encompasses a range of software platforms and solutions that allow business stakeholders and decision-makers to quickly see up-to-date data relevant to their operational roles.

A very popular piece of many BI solutions is the dashboard: a configurable, highly personalized set of data tailored to the role of the user using it.
Other important parts of the BI solutions are:
  • Integration layer and data warehouse
  • Data cubes and analytics
  • Key performance indicators (KPI) and reports
Data integration is an important facet of many business intelligence solutions. These business intelligence solutions often depend on data drawn from multiple points across an organization.
BI allows you to:
  • Automation of report delivery, pre-calculation of important metrics and centralization of organizational data means that information is available fast.
  • Utilizing real-time data directly from source systems reduces risk and increases accuracy.
  • Centralization of information and role-driven dashboards make important trends visible to decision-makers.

Extract, transform and load process (ETL)

Different software systems across an organization store their information in a variety of different formats, numerical precisions and units, so this data must be transformed before being copied to the data warehouse. The process of reading the data from each system, transforming it and placing it into the warehouse is called an Extract, Transform and Load process, or ETL.
  1. Extract step: copy data out of the business system to have a snapshot at a given time and to minimize the impact of the ETL tool on the system's performances.
  2. Transform step : different techniques to prepare the data for the data warehouse
  3. Load step : move the transformed data into the warehouse

Online Analytical Processing (OLAP)

Creation of data cube: structure that points to the information stored into the data warehouse and pre-computes sums, averages, counts and other aggregations on a regular basis.
Data cube defines also common calculations and filters to provide speed boost when generating reports and dashboards.
You can have more than one cube, each overlapping each other's or some cube included into some others, with different access level on each cube.
OLAP allows to turn data into knowledge through data mining (finding trends and anomalies).


Purpose of sharding

Database systems with large data sets and high throughput applications can challenge the capacity of a single server. High query rates can exhaust the CPU capacity of the server. Larger data sets exceed the storage capacity of a single machine. Finally, working set sizes larger than the system’s RAM stress the I/O capacity of disk drives.
To address these issues of scales, database systems have two basic approaches: vertical scaling and sharding.

Vertical scaling

Adding more CPU and storage resources to increase capacity. Scaling by adding capacity has limitations: high performance systems with large numbers of CPUs and large amount of RAM are disproportionately more expensive than smaller systems. Additionally, cloud-based providers may only allow users to provision smaller instances. As a result there is a practical maximum capability for vertical scaling.


Or horizontal scaling, by contrast, divides the data set and distributes the data over multiple servers, or shards. Each shard is an independent database, and collectively, the shards make up a single logical database.
Sharding addresses the challenge of scaling to support high throughput and large data sets:
  • Sharding reduces the number of operations each shard handles. Each shard processes fewer operations as the cluster grows. As a result, a cluster can increase capacity and throughput horizontally.
  • For example, to insert data, the application only needs to access the shard responsible for that record.
  • Sharding reduces the amount of data that each server needs to store. Each shard stores less data as the cluster grows.
  • For example, if a database has a 1 terabyte data set, and there are 4 shards, then each shard might hold only 256 GB of data. If there are 40 shards, then each shard might hold only 25 GB of data.