can be compared using criteria such as the
Akaike’s Information Criterion (AIC) or
Schwarz’s Bayesian Criterion (SBC) and
RSquare, where larger values indicate a
better fit and Mean Absolute Error (MAE)
and Mean Absolute Percentage Error
(MAPE) where smaller values indicate a
better fit (Figure 3).
Spectral analysis is used to decompose a
time series into several sinusoidal functions
of a certain wavelength in order to identify
the seasonal variations of different lengths.
This can be extended in cross-spectrum
analysis to the simultaneous analysis of two
series to uncover their correlations at differ-
ent frequencies. The cross-spectrum consists
of complex numbers that can be smoothed
to calculate cross-density and quadrature
density values, which are combined into a
cross-amplitude, a measure of covariance
between the frequency components in the
series. Since the sine and cosine functions
are orthogonal, their squared coefficients
can be added for each frequency to produce
a periodogram where this periodogram
value can be interpreted in variance terms at
a given frequency.
The time requirements associated with
performing spectral analysis led to a
refinement in the fast Fourier algorithm
(FFT) where the time required is proportional to N*log2(N), although the number
in the series needs to be padded in order
to be a power of 2.
Time series analysis is found in weather
forecasting, economics, pattern recognition
and statistics. It involves an exploratory
phase to characterize the data to understand the underlying trend separate from
the random error. External factors which
may interrupt the trend need to be taken
into account. Analyzing serial dependence,
as well as seasonal and cyclic components
leads to models which can be compared to
identify a best fit model for forecasting.
*Note: Figures 1 and 3 were generated using
JMP v. 10 software.
Mark Anawis is a Principal Scientist and ASQ
Six Sigma Black Belt at Abbott. He may be
reached at editor@ScientificComputing.com.
Figure 3: Different models can be compared using criteria such as the Akaike’s Information Criterion (AIC) or Schwarz’s Bayesian Criterion (SBC)
and RSquare, where larger values indicate a better fit and Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) where smaller values indicate a better fit.
30 ScientificComputing.com June 2014
Instead, expect full immersion into the
multifaceted aspects of data science from
multiple points of view. This is a survey of
the existing landscape…” but it is not an
extensive how-to manual.
For those new to R, a lot more introduction is needed than merely snippets of code. I
find that many authors and even commercial
vendors touting the marvels of their software
leave out the very first step of actually
getting the data into the program. Usually,
they have the data set cleaned and prepped
and pre-loaded into the program. This can
require extra steps in areas where databases
need be matched as to data and labels.
There are other, small technical glitches
such as Figure 4.1, where the text is way
too small and light to read. Also, much in
that chapter sounds like it was addressed
primarily to the IT department, so their
comments above ring true: you already
need (what I consider to be advanced)
knowledge in statistics and computer
programming as well as some domain
knowledge in the area of work.
In summarizing this interesting book, it
does have many useful hints, tips and tricks
to addressing specific types of problems, as
well as pitfalls. The hammer and nail story
with linear regression is classic! Explanations
of algorithms are excellent, and there are also
interesting asides on people and the history
of algorithms, statistics, etcetera. It also
was very nice to see all known versions of
key words describing variables and analytic
features, which is often quite confusing to the
novice. I would appreciate far more scientific
examples than the business ones that were
in abundance. However, author/contributor
backgrounds must be considered.
Interested readers are strongly urged
to go to the book’s site at Amazon.com
and read sections of the scanned-in pages.
While having many pluses, this book is not
for every budding data analyst.
Doing Data Science: Straight Talk from the
Front Line, by Rachel Schutt and Cathy
O’Neil. O’Reilly Media, Inc.
Sebastopol, CA. pp 406 (2014). 39.99.
John Wass is a statistician based in
Chicago, IL. He may be reached at
continued from page 27