Many statistical tests for outliers were developed in an environment in which a few hundred observations was a large data set. We explore the limitations of such approaches.
(a) For a set of 1,000,000 values, how likely are we to have outliers according
to the test that says a value is an outlier if it is more than three standard
deviations from the average? (Assume a normal distribution.)
(b) Does the approach that states an outlier is an object of unusually low
probability need to be adjusted when dealing with large data sets? If
so, how?
(a) This question should have asked how many outliers we would have since
the object of this question is to show that even a small probability of
an outlier yields a large number of outliers for a large data set. The
probability is unaffected by the number of objects.
The probability is either 0.00135 for a single sided deviation of 3 stan-
dard deviations or 0.0027 for a double-sided deviation. Thus, the num-
ber of anomalous objects will be either 1,350 or 2,700.
This question should have asked how many outliers we would have since
the object of this question is to show that even a small probability of
an outlier yields a large number of outliers for a large data set. The
probability is unaffected by the number of objects.
The probability is either 0.00135 for a single sided deviation of 3 stan-
dard deviations or 0.0027 for a double-sided deviation. Thus, the num-
ber of anomalous objects will be either 1,350 or 2,700.
(b) There are thousands of outliers (under the specified definition) in a
million objects. We may choose to accept these objects as outliers or
prefer to increase the threshold so that fewer outliers result.
You might also like to view...
What is the native file system utilized by Windows Server 2016?
A. ReFS B. NTFS C. FAT32 D. ext3
The word shown in bold is used correctly in the following sentence.How will the cuts affect passenger service??
Answer the following statement true (T) or false (F)