The k-Means algorithm uses a similarly metric of distance between a record and a cluster centroid. If the attributes of the records are not quantitative but categorical in nature, such as Income Level with values {low, medium, high} or Married with values {yes, no} or State of Residence with values {Alabama, Alaska,…, Wyoming} then the distance metric is not meaningful. Define a more suitable similarity metric that can be used for clustering data records that contain categorical data.
What will be an ideal response?
We can define a distance metric or rather a similarity metric between
two records based on the number of common values the two records have
across all dimensions.
For n dimensional data records, we define the similarity of two records
rj and rk as
similarity(rj, rk) = sim(rj1, rk1) + sim(rj2, rk2) + ... + sim(rjn, rkn)
where sim(rji, rki) is 1 if rji = rki else 0.
In this case, higher similarity means the records are closer together
with respect to the usual distance metric.
For example, consider the following 3 records:
RID INCOME LEVEL MARRIED STATE
1 high yes ny
2 low no ny
3 high yes ca
We have the following similarity values:
similarity(1,2) = 1
similarity(1,3) = 2
similarity(2,3) = 0
Records 1 and 3 are the most similar (or closest).
You might also like to view...
A SELECT clause lists which tables hold the fields used in the FROM clause
Indicate whether the statement is true or false
PDF is an acronym for ____.
A. Printable Document Format B. Portable Document Format C. Publisher Document Format D. Portable Descriptive Format