For each of the following types of data or clusters, discuss briefly if (1) sam- pling will cause problems for this approach and (2) what those problems are. Assume that the sampling technique randomly chooses points from the to- tal set of m points and that any unmentioned characteristics of the data or clusters are as optimal as possible. In other words, focus only on problems caused by the particular characteristic mentioned. Finally, assume that K is very much less than m.

Hierarchical clustering algorithms require O(m2 log(m)) time, and conse-
quently, are impractical to use directly on larger data sets. One possible

technique for reducing the time required is to sample the data set. For ex-
ample, if K clusters are desired and ?m points are sampled from the m

points, then a hierarchical clustering algorithm will produce a hierarchical

clustering in roughly O(m) time. K clusters can be extracted from this hier-
archical clustering by taking the clusters on the Kth level of the dendrogram.

The remaining points can then be assigned to a cluster in linear time, by
using various strategies. To give a specific example, the centroids of the K
clusters can be computed, and then each of the m ? ?m remaining points
can be assigned to the cluster associated with the closest centroid.
(a) Data with very different sized clusters.
(b) High-dimensional data.
(c) Data with outliers, i.e., atypical points.
(d) Data with highly irregular regions.
(e) Data with globular clusters.
(f) Data with widely different densities.
(g) Data with a small percentage of noise points.
(h) Non-Euclidean data.
(i) Euclidean data.
(j) Data with many and mixed attribute types.

(a) This can be a problem, particularly if the number of points in a cluster
is small. For example, if we have a thousand points, with two clusters,
one of size 900 and one of size 100, and take a 5% sample, then we
will, on average, end up with 45 points from the first cluster and 5
points from the second cluster. Five points is much easier to miss or
cluster improperly than 50. Also, the second cluster will sometimes
be represented by fewer than 5 points, just by the nature of random
samples.
(b) This can be a problem because data in high-dimensional space is typi-
cally sparse and more points may be needed to define the structure of
a cluster in high-dimensional space.
(c) By definition, outliers are not very frequent and most of them will be
omitted when sampling. Thus, if finding the correct clustering depends
on having the outliers present, the clustering produced by sampling will
likely be misleading. Otherwise, it is beneficial.
(d) This can be a problem because the structure of the border can be lost
when sampling unless a large number of points are sampled.
(e) This is typically not a problem since not as many points need to be
sampled to retain the structure of a globular cluster as an irregular
one.
(f) In this case the data will tend to come from the denser region. Note
that the effect of sampling is to reduce the density of all clusters by
the sampling factor, e.g., if we take a 10% sample, then the density of
the clusters is decreased by a factor of 10. For clusters that aren’t very
dense to begin with, this may means that they are now treated as noise
or outliers.
(g) Sampling will not cause a problem. Actually, since we would like to
exclude noise, and since the amount of noise is small, this may be
beneficial.
(h) This has no particular impact.
(i) This has no particular impact.
(j) Many attributes was discussed under high-dimensionality. Mixed at-
tributes have no particular impact.

Computer Science & Information Technology

You might also like to view...

When you create a link to another file, the linked file is not included within the PowerPoint file.

Answer the following statement true (T) or false (F)

Computer Science & Information Technology

COGNITIVE ASSESSMENT When a document contains text displayed in 10-point Cambria, to what does the 10-point refer?

A. the font style B. the number of characters that can fit in one linear inch of text C. the number of colors in which a character can be displayed D. the font size

Computer Science & Information Technology