Most of the data we analyze is numeric
. But sometimes we
will have variables that are character
. This will lead to
some interesting problems.
Consider the following dataset created by a fellow student.
First, we need to remember to make psych
active.
library(psych)
library(psychTools)
fn <- "https://personality-project.org/courses/350/datasets/hp.csv"
hp <- read.file(fn)
## Data from the .csv file https://personality-project.org/courses/350/datasets/hp.csv has been loaded.
headTail(hp) #just show the first and last 4 lines
## Name Gender House Likability Has.Glasses sex Gen
## 1 Harry Potter 0 1 10 1 male Man
## 2 Hermione Granger 1 1 10 0 female Woman
## 3 Ron Weasley 0 1 10 0 male Man
## 4 Ginny Weasley 1 1 10 0 female Woman
## ... <NA> ... ... ... ... <NA> <NA>
## 23 Quirrel 0 3 1 0 male Man
## 24 Slughorn 0 4 6 0 male Man
## 25 Goyle 0 4 1 0 male Man
## 26 Pansy Parkinson 1 4 1 0 female Woman
summary(hp) #this the R way of summarizing
## Name Gender House Likability Has.Glasses
## Length:26 Min. :0.0000 Min. :1.000 Min. : 1.000 Min. :0.0000
## Class :character 1st Qu.:0.0000 1st Qu.:2.000 1st Qu.: 1.750 1st Qu.:0.0000
## Mode :character Median :0.0000 Median :3.000 Median : 5.000 Median :0.0000
## Mean :0.4231 Mean :2.654 Mean : 5.346 Mean :0.1538
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.: 8.000 3rd Qu.:0.0000
## Max. :1.0000 Max. :4.000 Max. :10.000 Max. :1.0000
## sex Gen
## Length:26 Length:26
## Class :character Class :character
## Mode :character Mode :character
##
##
##
describe(hp) #this is the psych way of describing
## vars n mean sd median trimmed mad min max range skew kurtosis se
## Name* 1 26 13.50 7.65 13.5 13.50 9.64 1 26 25 0.00 -1.34 1.50
## Gender 2 26 0.42 0.50 0.0 0.41 0.00 0 1 1 0.29 -1.99 0.10
## House 3 26 2.65 1.20 3.0 2.68 1.48 1 4 3 -0.15 -1.59 0.23
## Likability 4 26 5.35 3.30 5.0 5.32 5.19 1 10 9 -0.04 -1.42 0.65
## Has.Glasses 5 26 0.15 0.37 0.0 0.09 0.00 0 1 1 1.81 1.33 0.07
## sex* 6 26 1.58 0.50 2.0 1.59 0.00 1 2 1 -0.29 -1.99 0.10
## Gen* 7 26 1.42 0.50 1.0 1.41 0.00 1 2 1 0.29 -1.99 0.10
That some of the data are character
means that the
cor
function will not work.
describe and
lowerCorconverts the character data to numeric using the
char2numeric`
function and then does normal operations on the data.
But this leads to some confusion, in that characters are converted to
numeric values in alphabetical order. Thus,
female' becomes 1 and
malebecomes 2, but
manbecomes 1 and
woman`
becomes 2.
To let you it has automatically done this conversion, it adds and * to the variable name. Thus sex, and Gen are renamed as sex* and Gen*.
Look at the correlations. sex* and Gen* are negative correlated.
lowerCor(hp)
## Name* Gendr House Lkblt Hs.Gl sex* Gen*
## Name* 1.00
## Gender -0.02 1.00
## House 0.17 -0.21 1.00
## Likability -0.19 0.03 -0.70 1.00
## Has.Glasses -0.21 -0.15 -0.42 0.38 1.00
## sex* 0.02 -1.00 0.21 -0.03 0.15 1.00
## Gen* -0.02 1.00 -0.21 0.03 -0.15 -1.00 1.00
This is easy to see if we show the data after we convert it using char2numeric.
converted <- char2numeric(hp)
headTail(converted)
## Name. Gender House Likability Has.Glasses sex. Gen.
## 1 11 0 1 10 1 2 1
## 2 12 1 1 10 0 1 2
## 3 20 0 1 10 0 2 1
## 4 8 1 1 10 0 1 2
## ... ... ... ... ... ... ... ...
## 23 19 0 3 1 0 2 1
## 24 22 0 4 6 0 2 1
## 25 9 0 4 1 0 2 1
## 26 17 1 4 1 0 1 2