Unit 4 sec 2.3
Checking and cleaning data
21 November 2015
15:12
First, there some
blank cells in the spreadsheet where data are missing. Second, you probably
also noticed the value of 99, appearing in the columns that otherwise contain
only small integer values.
An explanation for these, other than a typing mistake, is
that numbers such as 99 are sometimes used as codes that signal value missing. Good
documentation of the data file should make the presence and value of a missing
data code clear, but with this secondary data this information can get lost.
Column H contains the
weights of the babies, in kilograms. The minimum weight is 2.05kg, which is not
a great deal less than the next smallest weight, 2.22kg and 2.44kg. The largest
baby’s weight, however, appears to be 34kg., most probably the decimal point
was missed out on their weight of 3.4kg (but without confirmation from the
original data collection source. There is no certainty that this is the
explanation).
Outliers
one or more data values that are considerably smaller or
larger than the other values in the same dataset are called Outliers. However, in the case of the
tallest mother mentioned in the solution to activity 10, often there is no such
obvious reason and the outlier may just be unusual, but not unreasonable of
observation. Either way, a game, there are sophisticated techniques available
to deal with outliers, but these are not explored.
No comments:
Post a Comment