Given K equally sized clusters, the probability that a randomly chosen initial centroid will come from any given cluster is 1/K, but the probability that each cluster will have exactly one initial centroid is much lower. (It should be clear that having one initial centroid in each cluster is a good starting situation for K-means.) In general, if there are K clusters and each cluster has n points, then the probability, p, of selecting in a sample of size K one initial centroid from each cluster is given by Equation 8.1. (This assumes sampling with replacement.) From this formula we can calculate, for example, that the chance of having one initial centroid from each of four clusters is 4!/44 = 0.0938.
(a) Plot the probability of obtaining one point from each cluster in a sample
of size K for values of K between 2 and 100.
For K clusters, K = 10, 100, and 1000, find the probability that a
sample of size 2K contains at least one point from each cluster. You can
use either mathematical methods or statistical simulation to determine
the answer.
(a) The solution is shown in Figure 4. Note that the probability is essen-
tially 0 by the time K = 10.
(b) We used simulation to compute the answer. Respectively, the proba-
bilities are 0.21, < 10?6, and < 10?6.
Proceeding analytically, the probability that a point doesn’t come from
a particular cluster is, 1 ? 1
K , and thus, the probability that all 2K
points don’t come from a particular cluster is (1 ? 1
K )2K. Hence, the
probability that at least one of the 200 points comes from a particular
cluster is 1 ? (1 ? 1
K )2K. If we assume independence (which is too
optimistic, but becomes approximately true for larger values of K), then
an upper bound for the probability that all clusters are represented in
the final sample is given by (1?(1? 1
K )2K)K. The values given by this
bound are 0.27, 5.7e-07, and 8.2e-64, respectively.
You might also like to view...
Open-source software does not limit users' rights to either source or object code
Indicate whether the statement is true or false
Software publishers often distribute beta, or test, versions of their software to select users so they can help identify problems before final software is released.
Answer the following statement true (T) or false (F)