Saturday 21 November 2015

Unit 4 sec 2.3 Checking and cleaning data

Unit 4 sec 2.3 Checking and cleaning data
21 November 2015
15:12

 First, there some blank cells in the spreadsheet where data are missing. Second, you probably also noticed the value of 99, appearing in the columns that otherwise contain only small integer values.
An explanation for these, other than a typing mistake, is that numbers such as 99 are sometimes used as codes that signal value missing. Good documentation of the data file should make the presence and value of a missing data code clear, but with this secondary data this information can get lost.
 Column H contains the weights of the babies, in kilograms. The minimum weight is 2.05kg, which is not a great deal less than the next smallest weight, 2.22kg and 2.44kg. The largest baby’s weight, however, appears to be 34kg., most probably the decimal point was missed out on their weight of 3.4kg (but without confirmation from the original data collection source. There is no certainty that this is the explanation).

Outliers

one or more data values that are considerably smaller or larger than the other values in the same dataset are called Outliers. However, in the case of the tallest mother mentioned in the solution to activity 10, often there is no such obvious reason and the outlier may just be unusual, but not unreasonable of observation. Either way, a game, there are sophisticated techniques available to deal with outliers, but these are not explored.

No comments:

Post a Comment