1. Data Manipulation
We may manipulate data in R
using the tidyverse
packages.
dplyr
for data frame manipulating and filteringtidyr
for data frame restructuringpurr
for functional programmingreadr
for persisting and depersisting data
Visit the official tidyverse website for more information.
1.1. dplyr
Let’s use the dplyr
package to manipulate data frames. This package uses the following verbs
to manipulate data frames.
select
extracts columns of interestsfilter
removes columnsmutate
changes valuesarrange
reorders rowssummarize
applies aggregate computations to datajoin
merges data into a single data frame
Note that these verbs
provide a grammar
for data manipulation.
[1]:
suppressMessages({
library('dplyr')
})
df <- data.frame(
age = c(18, 16, 15, 19),
grade = c('A', 'B', 'C', 'B'),
name = c('Jane', 'Joyce', 'Joe', 'John'),
male = c(FALSE, FALSE, TRUE, TRUE),
stringsAsFactors=FALSE
)
print(df)
age grade name male
1 18 A Jane FALSE
2 16 B Joyce FALSE
3 15 C Joe TRUE
4 19 B John TRUE
1.1.1. select
Example of select.
[2]:
g <- select(df, grade)
print(g)
grade
1 A
2 B
3 C
4 B
1.1.2. filter
Example of filter.
[3]:
m <- filter(df, male == TRUE)
print(m)
age grade name male
1 15 C Joe TRUE
2 19 B John TRUE
1.1.3. mutate
Example of mutate.
[4]:
n <- mutate(df, status=ifelse(age < 18, 'minor', 'adult'))
print(n)
age grade name male status
1 18 A Jane FALSE adult
2 16 B Joyce FALSE minor
3 15 C Joe TRUE minor
4 19 B John TRUE adult
1.1.4. arrange
Example of arrange.
[5]:
n <- arrange(df, age)
print(n)
age grade name male
1 15 C Joe TRUE
2 16 B Joyce FALSE
3 18 A Jane FALSE
4 19 B John TRUE
1.1.5. summarize
Example of summarize.
[6]:
n <- summarize(df, avg_age=mean(age))
print(n)
avg_age
1 17
1.1.6. pipe
You may chain multiple verbs
by using the pipe
operator %>%
. In the example below, we filter the data frame df
for students whose ages are less than 18 and male, and then compute the average age.
[7]:
n <- df %>%
filter(age < 18) %>%
filter(male == TRUE) %>%
summarize(avg_age=mean(age))
print(n)
avg_age
1 15
1.1.7. group_by
Here’s an example of using the group_by
function. The group_by
function will return a tibble
object that looks like a data frame, but keeps track of the groups.
[8]:
n <- group_by(df, male)
print(n)
# A tibble: 4 x 4
# Groups: male [2]
age grade name male
<dbl> <chr> <chr> <lgl>
1 18 A Jane FALSE
2 16 B Joyce FALSE
3 15 C Joe TRUE
4 19 B John TRUE
You may also apply verbs
to the tibble above. Below, we compute the average age, but notice how they are separated into male and female (as a result of the group_by
function)?
[9]:
r <- n %>% summarize(avg_age=mean(age))
print(r)
# A tibble: 2 x 2
male avg_age
<lgl> <dbl>
1 FALSE 17
2 TRUE 17
1.1.8. join
You may join data frames using SQL
like joins.
left_join
joins two data frames return all records from the left data frameinner_join
joins two data frames returning all records that only exist in both data framesright_join
joins two data frames returning all records from the right data framefull_join
joins two data frames all records from both data frames
[10]:
subjects <- data.frame(
name = c('Jane', 'Joyce', 'Joe', 'Jeremy'),
course = c('Statistics', 'Calculus', 'Statistics', 'Calculus'),
stringsAsFactors=FALSE
)
Left join.
[11]:
n <- left_join(df, subjects, by='name')
print(n)
age grade name male course
1 18 A Jane FALSE Statistics
2 16 B Joyce FALSE Calculus
3 15 C Joe TRUE Statistics
4 19 B John TRUE <NA>
Inner join.
[12]:
n <- inner_join(df, subjects, by='name')
print(n)
age grade name male course
1 18 A Jane FALSE Statistics
2 16 B Joyce FALSE Calculus
3 15 C Joe TRUE Statistics
Right join.
[13]:
n <- right_join(df, subjects, by='name')
print(n)
age grade name male course
1 18 A Jane FALSE Statistics
2 16 B Joyce FALSE Calculus
3 15 C Joe TRUE Statistics
4 NA <NA> Jeremy NA Calculus
Full join.
[14]:
n <- full_join(df, subjects, by='name')
print(n)
age grade name male course
1 18 A Jane FALSE Statistics
2 16 B Joyce FALSE Calculus
3 15 C Joe TRUE Statistics
4 19 B John TRUE <NA>
5 NA <NA> Jeremy NA Calculus
1.2. tidyr
The package tidyr
is used to restructure data. Its semantic concepts are as follows.
variables
are columnsobservations
are rowsvalues
are elements
A nice cheat sheat is available.
[15]:
library('tidyr')
df <- data.frame(
name = c('Jane', 'Joyce', 'Joe', 'John'),
quizz1 = c(90, 89, 75, 91),
quizz2 = c(95, 91, 85, 89),
quizz3 = c(92, 82, 80, 93),
stringsAsFactors=FALSE
)
print(df)
name quizz1 quizz2 quizz3
1 Jane 90 95 92
2 Joyce 89 91 82
3 Joe 75 85 80
4 John 91 89 93
1.2.1. gather
The gather
function changes the wide
format (column-oriented) of df
to a long
format (row-oriented).
[16]:
n <- gather(df, key=quizz, value=score, -name)
print(n)
name quizz score
1 Jane quizz1 90
2 Joyce quizz1 89
3 Joe quizz1 75
4 John quizz1 91
5 Jane quizz2 95
6 Joyce quizz2 91
7 Joe quizz2 85
8 John quizz2 89
9 Jane quizz3 92
10 Joyce quizz3 82
11 Joe quizz3 80
12 John quizz3 93
The newer API has pivot_longer
.
[17]:
n <- df %>% pivot_longer(-name, names_to='quizz', values_to='score')
print(n)
# A tibble: 12 x 3
name quizz score
<chr> <chr> <dbl>
1 Jane quizz1 90
2 Jane quizz2 95
3 Jane quizz3 92
4 Joyce quizz1 89
5 Joyce quizz2 91
6 Joyce quizz3 82
7 Joe quizz1 75
8 Joe quizz2 85
9 Joe quizz3 80
10 John quizz1 91
11 John quizz2 89
12 John quizz3 93
[18]:
r <- n %>%
summarize(avg_score=mean(score))
print(r)
# A tibble: 1 x 1
avg_score
<dbl>
1 87.7
[19]:
r <- n %>%
group_by(name) %>%
summarize(avg_score=mean(score))
print(r)
# A tibble: 4 x 2
name avg_score
<chr> <dbl>
1 Jane 92.3
2 Joe 80
3 John 91
4 Joyce 87.3
1.2.2. spread
The spread
function converts the data frame from long (row-oriented) to wide (column-oriented) form.
[20]:
r <- spread(n, key=quizz, value=score)
print(r)
# A tibble: 4 x 4
name quizz1 quizz2 quizz3
<chr> <dbl> <dbl> <dbl>
1 Jane 90 95 92
2 Joe 75 85 80
3 John 91 89 93
4 Joyce 89 91 82
The newer API has pivot_wider
.
[21]:
r <- n %>% pivot_wider(names_from='quizz', values_from='score')
print(r)
# A tibble: 4 x 4
name quizz1 quizz2 quizz3
<chr> <dbl> <dbl> <dbl>
1 Jane 90 95 92
2 Joyce 89 91 82
3 Joe 75 85 80
4 John 91 89 93
1.2.3. unite
The unite
function merges several columns.
[22]:
df <- data.frame(
name = c('Jane', 'Joyce', 'Joe', 'John'),
dobMonth = c(1, 2, 3, 4),
dobDay = c(5, 10, 15, 20),
dobYear = c(89, 88, 87, 91),
stringsAsFactors=FALSE
)
print(df)
name dobMonth dobDay dobYear
1 Jane 1 5 89
2 Joyce 2 10 88
3 Joe 3 15 87
4 John 4 20 91
[23]:
n <- df %>% unite(dobMonth, dobDay, dobYear, col='dob', sep='-')
print(n)
name dob
1 Jane 1-5-89
2 Joyce 2-10-88
3 Joe 3-15-87
4 John 4-20-91
1.2.4. separate
The separate
function breaks apart a column.
[24]:
df <- data.frame(
name = c('Jane', 'Joyce', 'Joe', 'John'),
dob = c('1-5-89', '2-10-88', '3-15-87', '4-20-91'),
stringsAsFactors=FALSE
)
print(df)
name dob
1 Jane 1-5-89
2 Joyce 2-10-88
3 Joe 3-15-87
4 John 4-20-91
[25]:
n <- df %>% separate(dob, sep='-', into=c('dobMonth', 'dobDay', 'dobYear'))
print(n)
name dobMonth dobDay dobYear
1 Jane 1 5 89
2 Joyce 2 10 88
3 Joe 3 15 87
4 John 4 20 91
1.3. purr
The purr
library enables functional programming over data frames. A purr cheat sheat is available.
[26]:
library('purrr')
df <- data.frame(
name = c('Jane', 'Joyce', 'Joe', 'John'),
quizz1 = c(90, 89, 75, 91),
quizz2 = c(95, 91, 85, 89),
quizz3 = c(92, 82, 80, 93),
stringsAsFactors=FALSE
)
print(df)
name quizz1 quizz2 quizz3
1 Jane 90 95 92
2 Joyce 89 91 82
3 Joe 75 85 80
4 John 91 89 93
1.3.1. modify_if
The modify_if
function modifies a variable (column) if the specified condition is satisified. Below, we use the comprehenr
package to mimic a vector comprehension (or, in Python
, a list comprehension
).
[27]:
library('comprehenr')
letterGrade = function(score) {
if (score >= 90) {
return('A')
} else if (score >= 80) {
return('B')
} else if (score >= 70) {
return('C')
} else if (score >= 60) {
return('D')
} else {
return('F')
}
}
letterGrades = function(scores) {
to_vec(for(s in scores) letterGrade(s))
}
n <- df %>%
modify_if(is.numeric, .f=letterGrades)
print(n)
name quizz1 quizz2 quizz3
1 Jane A A A
2 Joyce B A B
3 Joe C B B
4 John A B A
1.3.2. modify_at
The modify_at
function modifies the specified columns.
[28]:
n <- df %>%
modify_at(.at=c('quizz1', 'quizz2', 'quizz3'), .f=letterGrades)
print(n)
name quizz1 quizz2 quizz3
1 Jane A A A
2 Joyce B A B
3 Joe C B B
4 John A B A
1.4. readr
The readr
package is used to write and read data. A variety of formats is supported.
csv
comma separatedtsv
tab separateddelim
delimitedfwf
fixed width filestable
tablular file with columns separated by spacelog
web log files
[29]:
library('readr')
df <- data.frame(
name = c('Jane', 'Joyce', 'Joe', 'John'),
dob = c('1-5-89', '2-10-88', '3-15-87', '4-20-91'),
stringsAsFactors=FALSE
)
1.4.1. write_csv
Use write_csv
to write data to a CSV file.
[30]:
df %>% write_csv(path='students.csv')
1.4.2. read_csv
Use read_csv
to read CSV data. Note that the object returned will be a tibble
.
[31]:
df <- read_csv(file='students.csv')
print(df)
Parsed with column specification:
cols(
name = col_character(),
dob = col_character()
)
# A tibble: 4 x 2
name dob
<chr> <chr>
1 Jane 1-5-89
2 Joyce 2-10-88
3 Joe 3-15-87
4 John 4-20-91
Note that you may specify the schema as well.
[32]:
df <- read_csv(file='students.csv', col_types=cols(
name = col_character(),
dob = col_character()
))
print(df)
# A tibble: 4 x 2
name dob
<chr> <chr>
1 Jane 1-5-89
2 Joyce 2-10-88
3 Joe 3-15-87
4 John 4-20-91