311x Filetype PDF File size 3.20 MB Source: itlab.uta.edu
Data Mining What is clustering
Cluster Analysis: Basic Concepts
and Algorithms Clustering: the process of grouping a set of objects into
classes of similar objects
Lecture Notes for Chapter 7 Documents within a cluster should be similar.
Documents from different clusters should be dissimilar.
The commonest form of unsupervised learning
Introduction to Data Mining Unsupervised learning = learning from raw data, as opposed to
by supervised data where a classification of examples/samples is
given/known
Tan, Steinbach, Kumar A common and important task that finds many applications in IR
+ and other places
Other sources
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 2
Intra- and inter-cluster distance
A data set with clear cluster structure Finding groups of objects such that the objects in a group
will be similar (or related) to one another and different
How would you from (or unrelated to) the objects in other groups
design an algorithm Inter-cluster
for finding the three Intra-cluster distances are
clusters in this case? distances are maximized
minimized
Humans do it effortlessly
3 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Historic application of clustering Applications
Location of new stores
Pizza delivery locations
Distribution centers (e.g., Amazon, …)
ATM machines
Location of artilleries in combat
Need to be careful about distance metric used
If you end up picking a place on the other side of the river with
only one bridge, it may not be a wise decision
Placement of artillery. Hills and other obstacles are not taken care
of in Euclidian distance!
6
Applications of Cluster Analysis What is not Cluster Analysis?
Understanding Discovered Clusters Industry Group Supervised classification
Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,
1 Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,
– Group related documents DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN, Technology1-DOWN – Have class label information
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,
for browsing, Sun-DOWN
Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,
2 ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,
– group genes and proteins Computer-Assoc-DOWN,Circuit-City-DOWN, Technology2-DOWN
Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN, Simple segmentation
that have similar Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN
Fannie-Mae-DOWN,Fed-Home-Loan-DOWN, – Dividing students into different registration groups
functionality, or 3 MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN
Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP, alphabetically, by last name, zip code, age, …
4 Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP, Oil-UP
– group stocks with similar Schlumberger-UP
price fluctuations Results of a query
– Groupings are a result of an external specification
Summarization
– Reduce the size of large Graph partitioning
data sets – Some mutual relevance and synergy, but areas are not
Clustering precipitation identical
in Australia
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Notion of a Cluster can be Ambiguous Types of Clusterings
A clustering is a set of clusters
Important distinction between hierarchical and
partitional sets of clusters
How many clusters? Six Clusters Partitional Clustering
– A division of data objects into non-overlapping subsets
(clusters) such that each data object is in exactly one subset
Hierarchical clustering
Two Clusters Four Clusters – A set of nested clusters organized as a hierarchical tree
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Partitional Clustering Hierarchical Clustering
p1
p3 p4
p2
p1 p2 p3 p4
Traditional Hierarchical Clustering Traditional Dendrogram
p1
p3 p4
p2
p1 p2 p3 p4
Original Points A Partitional Clustering Non-traditional Hierarchical Clustering Non-traditional Dendrogram
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Other Distinctions Between Sets of Clusters Types of Clusters
Exclusive versus non-exclusive Well-separated clusters
– In non-exclusive clusterings, points may belong to multiple
clusters.
– Can represent multiple classes or ‘border’ points Center-based clusters
Fuzzy versus non-fuzzy
– In fuzzy clustering, a point belongs to every cluster with some Contiguous clusters
weight between 0 and 1
– Weights must sum to 1
– Probabilistic clustering has similar characteristics Density-based clusters
Partial versus complete
– In some cases, we only want to cluster some of the data Property or Conceptual
Heterogeneous versus homogeneous
– Cluster of widely different sizes, shapes, and densities Described by an Objective Function
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Types of Clusters: Well-Separated Types of Clusters: Center-Based
Well-Separated Clusters: Center-based
– A cluster is a set of points such that any point in a cluster is – A cluster is a set of objects such that an object in a cluster is
closer (or more similar) to every other point in the cluster than closer (more similar) to the “center” of a cluster, than to the
to any point not in the cluster. center of any other cluster
– The center of a cluster is often a centroid, the average of all
the points in the cluster, or a medoid, the most “representative”
point of a cluster
– Medoid is an actual point where as centroid need not be.
3 well-separated clusters 4 center-based clusters
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
no reviews yet
Please Login to review.