1. Data Manipulation
We may manipulate data in R using the tidyverse packages.
dplyrfor data frame manipulating and filteringtidyrfor data frame restructuringpurrfor functional programmingreadrfor persisting and depersisting data
Visit the official tidyverse website for more information.
1.1. dplyr
Let’s use the dplyr package to manipulate data frames. This package uses the following verbs to manipulate data frames.
selectextracts columns of interestsfilterremoves columnsmutatechanges valuesarrangereorders rowssummarizeapplies aggregate computations to datajoinmerges data into a single data frame
Note that these verbs provide a grammar for data manipulation.
[1]:
suppressMessages({
library('dplyr')
})
df <- data.frame(
age = c(18, 16, 15, 19),
grade = c('A', 'B', 'C', 'B'),
name = c('Jane', 'Joyce', 'Joe', 'John'),
male = c(FALSE, FALSE, TRUE, TRUE),
stringsAsFactors=FALSE
)
print(df)
age grade name male
1 18 A Jane FALSE
2 16 B Joyce FALSE
3 15 C Joe TRUE
4 19 B John TRUE
1.1.1. select
Example of select.
[2]:
g <- select(df, grade)
print(g)
grade
1 A
2 B
3 C
4 B
1.1.2. filter
Example of filter.
[3]:
m <- filter(df, male == TRUE)
print(m)
age grade name male
1 15 C Joe TRUE
2 19 B John TRUE
1.1.3. mutate
Example of mutate.
[4]:
n <- mutate(df, status=ifelse(age < 18, 'minor', 'adult'))
print(n)
age grade name male status
1 18 A Jane FALSE adult
2 16 B Joyce FALSE minor
3 15 C Joe TRUE minor
4 19 B John TRUE adult
1.1.4. arrange
Example of arrange.
[5]:
n <- arrange(df, age)
print(n)
age grade name male
1 15 C Joe TRUE
2 16 B Joyce FALSE
3 18 A Jane FALSE
4 19 B John TRUE
1.1.5. summarize
Example of summarize.
[6]:
n <- summarize(df, avg_age=mean(age))
print(n)
avg_age
1 17
1.1.6. pipe
You may chain multiple verbs by using the pipe operator %>%. In the example below, we filter the data frame df for students whose ages are less than 18 and male, and then compute the average age.
[7]:
n <- df %>%
filter(age < 18) %>%
filter(male == TRUE) %>%
summarize(avg_age=mean(age))
print(n)
avg_age
1 15
1.1.7. group_by
Here’s an example of using the group_by function. The group_by function will return a tibble object that looks like a data frame, but keeps track of the groups.
[8]:
n <- group_by(df, male)
print(n)
# A tibble: 4 x 4
# Groups: male [2]
age grade name male
<dbl> <chr> <chr> <lgl>
1 18 A Jane FALSE
2 16 B Joyce FALSE
3 15 C Joe TRUE
4 19 B John TRUE
You may also apply verbs to the tibble above. Below, we compute the average age, but notice how they are separated into male and female (as a result of the group_by function)?
[9]:
r <- n %>% summarize(avg_age=mean(age))
print(r)
# A tibble: 2 x 2
male avg_age
<lgl> <dbl>
1 FALSE 17
2 TRUE 17
1.1.8. join
You may join data frames using SQL like joins.
left_joinjoins two data frames return all records from the left data frameinner_joinjoins two data frames returning all records that only exist in both data framesright_joinjoins two data frames returning all records from the right data framefull_joinjoins two data frames all records from both data frames
[10]:
subjects <- data.frame(
name = c('Jane', 'Joyce', 'Joe', 'Jeremy'),
course = c('Statistics', 'Calculus', 'Statistics', 'Calculus'),
stringsAsFactors=FALSE
)
Left join.
[11]:
n <- left_join(df, subjects, by='name')
print(n)
age grade name male course
1 18 A Jane FALSE Statistics
2 16 B Joyce FALSE Calculus
3 15 C Joe TRUE Statistics
4 19 B John TRUE <NA>
Inner join.
[12]:
n <- inner_join(df, subjects, by='name')
print(n)
age grade name male course
1 18 A Jane FALSE Statistics
2 16 B Joyce FALSE Calculus
3 15 C Joe TRUE Statistics
Right join.
[13]:
n <- right_join(df, subjects, by='name')
print(n)
age grade name male course
1 18 A Jane FALSE Statistics
2 16 B Joyce FALSE Calculus
3 15 C Joe TRUE Statistics
4 NA <NA> Jeremy NA Calculus
Full join.
[14]:
n <- full_join(df, subjects, by='name')
print(n)
age grade name male course
1 18 A Jane FALSE Statistics
2 16 B Joyce FALSE Calculus
3 15 C Joe TRUE Statistics
4 19 B John TRUE <NA>
5 NA <NA> Jeremy NA Calculus
1.2. tidyr
The package tidyr is used to restructure data. Its semantic concepts are as follows.
variablesare columnsobservationsare rowsvaluesare elements
A nice cheat sheat is available.
[15]:
library('tidyr')
df <- data.frame(
name = c('Jane', 'Joyce', 'Joe', 'John'),
quizz1 = c(90, 89, 75, 91),
quizz2 = c(95, 91, 85, 89),
quizz3 = c(92, 82, 80, 93),
stringsAsFactors=FALSE
)
print(df)
name quizz1 quizz2 quizz3
1 Jane 90 95 92
2 Joyce 89 91 82
3 Joe 75 85 80
4 John 91 89 93
1.2.1. gather
The gather function changes the wide format (column-oriented) of df to a long format (row-oriented).
[16]:
n <- gather(df, key=quizz, value=score, -name)
print(n)
name quizz score
1 Jane quizz1 90
2 Joyce quizz1 89
3 Joe quizz1 75
4 John quizz1 91
5 Jane quizz2 95
6 Joyce quizz2 91
7 Joe quizz2 85
8 John quizz2 89
9 Jane quizz3 92
10 Joyce quizz3 82
11 Joe quizz3 80
12 John quizz3 93
The newer API has pivot_longer.
[17]:
n <- df %>% pivot_longer(-name, names_to='quizz', values_to='score')
print(n)
# A tibble: 12 x 3
name quizz score
<chr> <chr> <dbl>
1 Jane quizz1 90
2 Jane quizz2 95
3 Jane quizz3 92
4 Joyce quizz1 89
5 Joyce quizz2 91
6 Joyce quizz3 82
7 Joe quizz1 75
8 Joe quizz2 85
9 Joe quizz3 80
10 John quizz1 91
11 John quizz2 89
12 John quizz3 93
[18]:
r <- n %>%
summarize(avg_score=mean(score))
print(r)
# A tibble: 1 x 1
avg_score
<dbl>
1 87.7
[19]:
r <- n %>%
group_by(name) %>%
summarize(avg_score=mean(score))
print(r)
# A tibble: 4 x 2
name avg_score
<chr> <dbl>
1 Jane 92.3
2 Joe 80
3 John 91
4 Joyce 87.3
1.2.2. spread
The spread function converts the data frame from long (row-oriented) to wide (column-oriented) form.
[20]:
r <- spread(n, key=quizz, value=score)
print(r)
# A tibble: 4 x 4
name quizz1 quizz2 quizz3
<chr> <dbl> <dbl> <dbl>
1 Jane 90 95 92
2 Joe 75 85 80
3 John 91 89 93
4 Joyce 89 91 82
The newer API has pivot_wider.
[21]:
r <- n %>% pivot_wider(names_from='quizz', values_from='score')
print(r)
# A tibble: 4 x 4
name quizz1 quizz2 quizz3
<chr> <dbl> <dbl> <dbl>
1 Jane 90 95 92
2 Joyce 89 91 82
3 Joe 75 85 80
4 John 91 89 93
1.2.3. unite
The unite function merges several columns.
[22]:
df <- data.frame(
name = c('Jane', 'Joyce', 'Joe', 'John'),
dobMonth = c(1, 2, 3, 4),
dobDay = c(5, 10, 15, 20),
dobYear = c(89, 88, 87, 91),
stringsAsFactors=FALSE
)
print(df)
name dobMonth dobDay dobYear
1 Jane 1 5 89
2 Joyce 2 10 88
3 Joe 3 15 87
4 John 4 20 91
[23]:
n <- df %>% unite(dobMonth, dobDay, dobYear, col='dob', sep='-')
print(n)
name dob
1 Jane 1-5-89
2 Joyce 2-10-88
3 Joe 3-15-87
4 John 4-20-91
1.2.4. separate
The separate function breaks apart a column.
[24]:
df <- data.frame(
name = c('Jane', 'Joyce', 'Joe', 'John'),
dob = c('1-5-89', '2-10-88', '3-15-87', '4-20-91'),
stringsAsFactors=FALSE
)
print(df)
name dob
1 Jane 1-5-89
2 Joyce 2-10-88
3 Joe 3-15-87
4 John 4-20-91
[25]:
n <- df %>% separate(dob, sep='-', into=c('dobMonth', 'dobDay', 'dobYear'))
print(n)
name dobMonth dobDay dobYear
1 Jane 1 5 89
2 Joyce 2 10 88
3 Joe 3 15 87
4 John 4 20 91
1.3. purr
The purr library enables functional programming over data frames. A purr cheat sheat is available.
[26]:
library('purrr')
df <- data.frame(
name = c('Jane', 'Joyce', 'Joe', 'John'),
quizz1 = c(90, 89, 75, 91),
quizz2 = c(95, 91, 85, 89),
quizz3 = c(92, 82, 80, 93),
stringsAsFactors=FALSE
)
print(df)
name quizz1 quizz2 quizz3
1 Jane 90 95 92
2 Joyce 89 91 82
3 Joe 75 85 80
4 John 91 89 93
1.3.1. modify_if
The modify_if function modifies a variable (column) if the specified condition is satisified. Below, we use the comprehenr package to mimic a vector comprehension (or, in Python, a list comprehension).
[27]:
library('comprehenr')
letterGrade = function(score) {
if (score >= 90) {
return('A')
} else if (score >= 80) {
return('B')
} else if (score >= 70) {
return('C')
} else if (score >= 60) {
return('D')
} else {
return('F')
}
}
letterGrades = function(scores) {
to_vec(for(s in scores) letterGrade(s))
}
n <- df %>%
modify_if(is.numeric, .f=letterGrades)
print(n)
name quizz1 quizz2 quizz3
1 Jane A A A
2 Joyce B A B
3 Joe C B B
4 John A B A
1.3.2. modify_at
The modify_at function modifies the specified columns.
[28]:
n <- df %>%
modify_at(.at=c('quizz1', 'quizz2', 'quizz3'), .f=letterGrades)
print(n)
name quizz1 quizz2 quizz3
1 Jane A A A
2 Joyce B A B
3 Joe C B B
4 John A B A
1.4. readr
The readr package is used to write and read data. A variety of formats is supported.
csvcomma separatedtsvtab separateddelimdelimitedfwffixed width filestabletablular file with columns separated by spacelogweb log files
[29]:
library('readr')
df <- data.frame(
name = c('Jane', 'Joyce', 'Joe', 'John'),
dob = c('1-5-89', '2-10-88', '3-15-87', '4-20-91'),
stringsAsFactors=FALSE
)
1.4.1. write_csv
Use write_csv to write data to a CSV file.
[30]:
df %>% write_csv(path='students.csv')
1.4.2. read_csv
Use read_csv to read CSV data. Note that the object returned will be a tibble.
[31]:
df <- read_csv(file='students.csv')
print(df)
Parsed with column specification:
cols(
name = col_character(),
dob = col_character()
)
# A tibble: 4 x 2
name dob
<chr> <chr>
1 Jane 1-5-89
2 Joyce 2-10-88
3 Joe 3-15-87
4 John 4-20-91
Note that you may specify the schema as well.
[32]:
df <- read_csv(file='students.csv', col_types=cols(
name = col_character(),
dob = col_character()
))
print(df)
# A tibble: 4 x 2
name dob
<chr> <chr>
1 Jane 1-5-89
2 Joyce 2-10-88
3 Joe 3-15-87
4 John 4-20-91