The k-Means algorithm uses a similarly metric of distance between a record and a cluster centroid. If the attributes of the records are not quantitative but categorical in nature, such as Income Level with values {low, medium, high} or Married with values {yes, no} or State of Residence with values {Alabama, Alaska,…, Wyoming} then the distance metric is not meaningful. Define a more suitable similarity metric that can be used for clustering data records that contain categorical data.

What will be an ideal response?

We can define a distance metric or rather a similarity metric between
two records based on the number of common values the two records have
across all dimensions.

For n dimensional data records, we define the similarity of two records
rj and rk as

similarity(rj, rk) = sim(rj1, rk1) + sim(rj2, rk2) + ... + sim(rjn, rkn)

where sim(rji, rki) is 1 if rji = rki else 0.

In this case, higher similarity means the records are closer together
with respect to the usual distance metric.

For example, consider the following 3 records:

RID INCOME LEVEL MARRIED STATE
1 high yes ny
2 low no ny
3 high yes ca

We have the following similarity values:
similarity(1,2) = 1
similarity(1,3) = 2
similarity(2,3) = 0

Records 1 and 3 are the most similar (or closest).

Computer Science & Information Technology

You might also like to view...

A SELECT clause lists which tables hold the fields used in the FROM clause

Indicate whether the statement is true or false

View

Computer Science & Information Technology

PDF is an acronym for ____.

A. Printable Document Format B. Portable Document Format C. Publisher Document Format D. Portable Descriptive Format

View

Computer Science & Information Technology