Some comments about characters and numbers

Most of the data we analyze is numeric. But sometimes we will have variables that are character. This will lead to some interesting problems.

Consider the following dataset created by a fellow student.

First, we need to remember to make psych active.

library(psych)
library(psychTools)
fn <- "https://personality-project.org/courses/350/datasets/hp.csv"
hp <- read.file(fn)
## Data from the .csv file https://personality-project.org/courses/350/datasets/hp.csv has been loaded.
headTail(hp) #just show the first and last 4 lines
##                 Name Gender House Likability Has.Glasses    sex   Gen
## 1       Harry Potter      0     1         10           1   male   Man
## 2   Hermione Granger      1     1         10           0 female Woman
## 3        Ron Weasley      0     1         10           0   male   Man
## 4      Ginny Weasley      1     1         10           0 female Woman
## ...             <NA>    ...   ...        ...         ...   <NA>  <NA>
## 23           Quirrel      0     3          1           0   male   Man
## 24          Slughorn      0     4          6           0   male   Man
## 25             Goyle      0     4          1           0   male   Man
## 26   Pansy Parkinson      1     4          1           0 female Woman
summary(hp)  #this the R way of summarizing
##      Name               Gender           House         Likability      Has.Glasses    
##  Length:26          Min.   :0.0000   Min.   :1.000   Min.   : 1.000   Min.   :0.0000  
##  Class :character   1st Qu.:0.0000   1st Qu.:2.000   1st Qu.: 1.750   1st Qu.:0.0000  
##  Mode  :character   Median :0.0000   Median :3.000   Median : 5.000   Median :0.0000  
##                     Mean   :0.4231   Mean   :2.654   Mean   : 5.346   Mean   :0.1538  
##                     3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.: 8.000   3rd Qu.:0.0000  
##                     Max.   :1.0000   Max.   :4.000   Max.   :10.000   Max.   :1.0000  
##      sex                Gen           
##  Length:26          Length:26         
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 
describe(hp)  #this is the psych way of describing 
##             vars  n  mean   sd median trimmed  mad min max range  skew kurtosis   se
## Name*          1 26 13.50 7.65   13.5   13.50 9.64   1  26    25  0.00    -1.34 1.50
## Gender         2 26  0.42 0.50    0.0    0.41 0.00   0   1     1  0.29    -1.99 0.10
## House          3 26  2.65 1.20    3.0    2.68 1.48   1   4     3 -0.15    -1.59 0.23
## Likability     4 26  5.35 3.30    5.0    5.32 5.19   1  10     9 -0.04    -1.42 0.65
## Has.Glasses    5 26  0.15 0.37    0.0    0.09 0.00   0   1     1  1.81     1.33 0.07
## sex*           6 26  1.58 0.50    2.0    1.59 0.00   1   2     1 -0.29    -1.99 0.10
## Gen*           7 26  1.42 0.50    1.0    1.41 0.00   1   2     1  0.29    -1.99 0.10

That some of the data are character means that the cor function will not work. describe andlowerCorconverts the character data to numeric using thechar2numeric` function and then does normal operations on the data.

But this leads to some confusion, in that characters are converted to numeric values in alphabetical order. Thus, female' becomes 1 andmalebecomes 2, butmanbecomes 1 andwoman` becomes 2.

To let you it has automatically done this conversion, it adds and * to the variable name. Thus sex, and Gen are renamed as sex* and Gen*.

Look at the correlations. sex* and Gen* are negative correlated.

lowerCor(hp)
##             Name* Gendr House Lkblt Hs.Gl sex*  Gen* 
## Name*        1.00                                    
## Gender      -0.02  1.00                              
## House        0.17 -0.21  1.00                        
## Likability  -0.19  0.03 -0.70  1.00                  
## Has.Glasses -0.21 -0.15 -0.42  0.38  1.00            
## sex*         0.02 -1.00  0.21 -0.03  0.15  1.00      
## Gen*        -0.02  1.00 -0.21  0.03 -0.15 -1.00  1.00

This is easy to see if we show the data after we convert it using char2numeric.

converted <- char2numeric(hp)
headTail(converted)
##     Name. Gender House Likability Has.Glasses sex. Gen.
## 1      11      0     1         10           1    2    1
## 2      12      1     1         10           0    1    2
## 3      20      0     1         10           0    2    1
## 4       8      1     1         10           0    1    2
## ...   ...    ...   ...        ...         ...  ...  ...
## 23     19      0     3          1           0    2    1
## 24     22      0     4          6           0    2    1
## 25      9      0     4          1           0    2    1
## 26     17      1     4          1           0    1    2