Clusters of documents can be summarized by finding the top terms (words) for the documents in the cluster, e.g., by taking the most frequent k terms, where k is a constant, say 10, or by taking all terms that occur more fre- quently than a specified threshold. Suppose that K-means is used to find clusters of both documents and words for a document data set.

(a) How might a set of term clusters defined by the top terms in a document
cluster differ from the word clusters found by clustering the terms with
K-means?
(b) How could term clustering be used to define clusters of documents?

(a) First, the top words clusters could, and likely would, overlap somewhat.
Second, it is likely that many terms would not appear in any of the
clusters formed by the top terms. In contrast, a K-means clustering of
the terms would cover all the terms and would not be overlapping.
(b) An obvious approach would be to take the top documents for a term
cluster; i.e., those documents that most frequently contain the terms in
the cluster.

Computer Science & Information Technology

You might also like to view...

Which of the following is true about Bluetooth?

A) Bluetooth works only with smart phones. B) Devices using Bluetooth must be no more than 30 feet apart. C) Bluetooth uses a high frequency soundwave. D) Bluetooth is the slowest connection of all ports.

Computer Science & Information Technology

f a stack is used in a nonrecursive solution to the HPAir problem, when is it necessary to backtrack from a city?

`What will be an ideal response?

Computer Science & Information Technology