Descriptive statistics: (Use a stats package or even a spreadsheet (e.g., Excel) to help you).

Consider the following numbers

A	B	C	D	E	F
1	0	16	7	1	1
2	7	17	62	12	2
3	0	9	0	5	4
4	7	18	35	18	8
5	7	13	5	28	16
6	8	11	10	78	32
7	9	13	14	0	64
8	2	10	48	46	128
9	7	16	0	23	256
10	3	10	13	23	512
11	4	14	8	11	1024
12	4	12	9	34	2048
13	3	22	5	10	4096
14	0	10	59	5	8192
15	5	13	96	24	16384
16	7	22	97	43	32768

For each column, find the (arithmetic) mean, median, and standard deviation. How well do these conventional statistics describe the basic characteristics of the data? Arithmetic mean, Median, Standard Deviation

By examining the data, there are possible transformations that might better capture the underlying characteristics of each column. What transforms would you recommend that would make the data easier to understand? Find the same descriptive statistics on these transformed data.


The following code in the R system will do this. (Note that I am shortcutting the input step by copying the data to the clipboad and using a procedure to read the clipboard. My "read.clipboard()" function supposedly combines the code for PCs and Macs into one function. You can get it by downloading my "useful.r" routines.
source("http://personality-project.org/r/useful.r")   #get a small package of psychometrically useful functions
problem1 <- read.clipboard()   #after first copying the table with the header row from above
summary(problem1) #get the basic summary statistics
boxplot(problem1) #show this graphically
pairs.panels(problem1) #show a graphic with scatterplots and histograms

#produces this output
 problem1 <- read.clipboard()   #after first copying the table with the header row from above
> summary(problem1)
       A               B               C               D               E        
 Min.   : 1.00   Min.   :0.000   Min.   : 9.00   Min.   : 0.00   Min.   : 0.00  
 1st Qu.: 4.75   1st Qu.:2.750   1st Qu.:10.75   1st Qu.: 6.50   1st Qu.: 8.75  
 Median : 8.50   Median :4.500   Median :13.00   Median :11.50   Median :20.50  
 Mean   : 8.50   Mean   :4.562   Mean   :14.12   Mean   :29.25   Mean   :22.56  
 3rd Qu.:12.25   3rd Qu.:7.000   3rd Qu.:16.25   3rd Qu.:50.75   3rd Qu.:29.50  
 Max.   :16.00   Max.   :9.000   Max.   :22.00   Max.   :97.00   Max.   :78.00  
       F        
 Min.   :    1  
 1st Qu.:   14  
 Median :  192  
 Mean   : 4096  
 3rd Qu.: 2560  
 Max.   :32768  
> boxplot(problem1) #show this graphically
> pairs.panels(problem1) #show a graphic with scatterplots and histograms

Note that the boxplot isn't very helpful, because the range of variable F is so great. What happens if we do a log transform of the data?
logprob <- log(problem1)
summary(logprob)
boxplot(logprob)

logprob <- log(problem1)
> summary(logprob)
       A               B                C               D               E        
 Min.   :0.000   Min.   :  -Inf   Min.   :2.197   Min.   : -Inf   Min.   : -Inf  
 1st Qu.:1.554   1st Qu.:0.9972   1st Qu.:2.374   1st Qu.:1.862   1st Qu.:2.129  
 Median :2.138   Median :1.4979   Median :2.565   Median :2.434   Median :3.013  
 Mean   :1.917   Mean   :  -Inf   Mean   :2.611   Mean   : -Inf   Mean   : -Inf  
 3rd Qu.:2.505   3rd Qu.:1.9459   3rd Qu.:2.788   3rd Qu.:3.923   3rd Qu.:3.381  
 Max.   :2.773   Max.   :2.1972   Max.   :3.091   Max.   :4.575   Max.   :4.357  
       F         
 Min.   : 0.000  
 1st Qu.: 2.599  
 Median : 5.199  
 Mean   : 5.199  
 3rd Qu.: 7.798  
 Max.   :10.397  
> boxplot(logprob)
Warning messages: 
1: Outlier (-Inf) in 2nd boxplot are NOT drawn in: bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group ==  
2: Outlier (-Inf) in 4th boxplot are NOT drawn in: bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group ==  
3: Outlier (-Inf) in 5th boxplot are NOT drawn in: bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group ==  

The complaint about the box plot arises because we logged numbers for B, D and E that were zero. Try adding one to the numbers before taking the logs.

log1prob <- log(problem1+1)
summary(log1prob)
boxplot(log1prob)


log1prob <- log(problem1+1)
> summary(log1prob)
       A                B               C               D               E        
 Min.   :0.6931   Min.   :0.000   Min.   :2.303   Min.   :0.000   Min.   :0.000  
 1st Qu.:1.7462   1st Qu.:1.314   1st Qu.:2.463   1st Qu.:2.008   1st Qu.:2.246  
 Median :2.2499   Median :1.701   Median :2.639   Median :2.518   Median :3.061  
 Mean   :2.0941   Mean   :1.486   Mean   :2.684   Mean   :2.674   Mean   :2.698  
 3rd Qu.:2.5835   3rd Qu.:2.079   3rd Qu.:2.848   3rd Qu.:3.942   3rd Qu.:3.414  
 Max.   :2.8332   Max.   :2.303   Max.   :3.135   Max.   :4.585   Max.   :4.369  
       F          
 Min.   : 0.6931  
 1st Qu.: 2.6742  
 Median : 5.2044  
 Mean   : 5.2962  
 3rd Qu.: 7.7983  
 Max.   :10.3972  
> boxplot(log1prob)



The final boxplot is shown below (I have not shown the ones that are not as useful.)