(e.g. “For information only”).
e. Normalize spelling: Use or create
a dictionary to correct misspell-
ings using fuzzy matching.
f. Case normalization: Convert
all text to lower case to prevent
counting words with different
g. Eliminate punctuation: Remove
punctuation to prevent counting
words with and without punctu-
ation being counted separately.
3. Reduce dimensionality and select
a. Term Document Matrix (TDM):
Create a two dimensional matrix
where each document is one row
and each column is a term from
the abbreviated list generated by
cleaning up the text. The relation-
ship between the row and column
is represented by indices. Singular
Value Decomposition (SVD) is
used to expose the underlying
meaning and structure by reduc-
ing the dimensionality. It is related
to principal components analysis.
b. Indices: At the simplest level, this
can be the count, or number of
times a term appears in a docu-
ment. Log or binary frequencies
can be used to dampen large
number of occurrences. The most
commonly used index is the in-
verse document frequency (ITF).
It represents the relative importance of a term and reflects the
relative frequency of occurrence
of terms and their document
4. Extract knowledge (examples of
a. Classification: Assign terms into
a predetermined set of catego-
ries. A training data set is used
with documents and categories.
This is used in genre detection,
spam filtering and Web page
b. Clustering: Terms are placed into
natural occurring but meaningful
groups. This is used in document
retrieval to enable improved Web
searches and analysis of large
c. Association: Identify terms that
are frequently found together.
This is known as market basket
analysis in the retail industry
where items are bought together.
d. Trend analysis: Identify time
dependent changes in a term.
This is used in identifying rising
popularity of technologies.
Figure 1: Text Mining and Related Fields
Figure 2: Text Mining Decision Tree