to a cluster) (Figure 1).
The big data architecture (Figure 2) contains a filesystem at the lowest level, which
allows creation of files and directories
(Hadoop Distributed File System —HDFS
— or Google File System). This is very scalable and available due to replication across
machines. A hypertable (Google’s Big Table)
is the database that creates tables indexed
by a primary key. Each row has cells with
related information. Each cell contains a
row key, a column name (or column family), a column qualifier (a column instance),
and a timestamp.
Within the hypertable system, a distributed file system (DFS) broker handles all
filesystem requests. A range server handles
reading and writing of data. A master
creates and deletes tables, as well as balancing range server loads. The Hyperspace
provides a filesystem for metadata.
In conjunction with the hypertable,
MapReduce is a parallel computation
framework (the algorithm) that processes the data and aggregates it. Hadoop
contains a version of this framework. At
the top of this architecture is a runtime
scripting language (Sawzall, Pig or Hive)
that performs statistical analysis. Pig is a
procedural language that allows querying
of semi-structured data sets using Hadoop.
Hive has a simple query language based on
SQL that allows summarization, querying
and analysis. It is not designed for online
transaction processing, but is best used for
batch jobs. Complex extract, transform,
load (ETL) can be done by either chaining
scripts together so that the output of one is
the input to another or using a workflow
engine like Apache Oozie with actions
arranged in directed acyclic graph (DAG)
to gate actions. Its definitions are written in
hPDL, an XML process definition lan-
guage. It starts jobs in Hadoop cluster and
controls actions through workflows that
contain flow and action nodes. Apache
Sqoop can be used to transfer data be-
tween Hadoop and datastores. It can pop-
ulate tables in Hive and integrates with
Oozie. Apache Flume allows multi-hop,
fan-in and fan-out flows and contextual as
well as backup routes to provide reliable
delivery and manage failures.
Analysis of big data presents its own
challenge. Adopt a strategy of breaking up
the data into a relevant segment focusing on answering a simple question, and
then add data sets where needed, perhaps
breaking up the analysis across different
teams with complementary analytical
skills. Specific analytical tools that can be
applied are agent-based modeling, neural
networks, factor analysis, cluster analysis
and time series analysis.
Agent-based models consist of a system
of agents and their relationships. An agent
is an autonomous entity that can make its
own decisions according to a set of rules.
This analysis is applied in complex human
systems, such as business and marketing.
Neural networks predict responses
from a flexible network of input variables,
whereas factor analysis is used to reduce
the number of dimensions of the data.
Factor analysis is related to principal components where linear combinations of the
original variables are created, such that
the first component has the most variation, the second component has the next
most variation, etcetera.
K-means clustering can be used on
large data sets and functions by assign-
ing points to clusters and recalculating
cluster points in order to divide data into
sets that can be more thoroughly ana-
lyzed. Time series analysis also may prove
beneficial using auto regressive integrated
moving average (ARIMA) or smoothing
models with characterization of process
disturbances and autocorrelation. The iter-
ative nature of refining the question, query
and analysis is represented in Figure 3.
Big Data evolved from the need to
accommodate large data sets of varying
data types and updating at increasing
speed. The key is the development of a suitable architecture and selection of appropriate tools, often from data mining and
Figure 2: Big data architecture
Figure 3: Big data analytics
Mark Anawis is a Principal Scientist
and ASQ Six Sigma Black Belt at Abbott.
He may be reached at