1. Data Manipulation

We may manipulate data in R using the tidyverse packages.

dplyr for data frame manipulating and filtering
tidyr for data frame restructuring
purr for functional programming
readr for persisting and depersisting data

Visit the official tidyverse website for more information.

1.1. dplyr

Let’s use the dplyr package to manipulate data frames. This package uses the following verbs to manipulate data frames.

select extracts columns of interests
filter removes columns
mutate changes values
arrange reorders rows
summarize applies aggregate computations to data
join merges data into a single data frame

Note that these verbs provide a grammar for data manipulation.

[1]:

suppressMessages({
    library('dplyr')
})


df <- data.frame(
    age = c(18, 16, 15, 19),
    grade = c('A', 'B', 'C', 'B'),
    name = c('Jane', 'Joyce', 'Joe', 'John'),
    male = c(FALSE, FALSE, TRUE, TRUE),
    stringsAsFactors=FALSE
)

print(df)

  age grade  name  male
1  18     A  Jane FALSE
2  16     B Joyce FALSE
3  15     C   Joe  TRUE
4  19     B  John  TRUE

1.1.1. select

Example of select.

[2]:

g <- select(df, grade)
print(g)

  grade
1     A
2     B
3     C
4     B

1.1.2. filter

Example of filter.

[3]:

m <- filter(df, male == TRUE)
print(m)

  age grade name male
1  15     C  Joe TRUE
2  19     B John TRUE

1.1.3. mutate

Example of mutate.

[4]:

n <- mutate(df, status=ifelse(age < 18, 'minor', 'adult'))
print(n)

  age grade  name  male status
1  18     A  Jane FALSE  adult
2  16     B Joyce FALSE  minor
3  15     C   Joe  TRUE  minor
4  19     B  John  TRUE  adult

1.1.4. arrange

Example of arrange.

[5]:

n <- arrange(df, age)
print(n)

  age grade  name  male
1  15     C   Joe  TRUE
2  16     B Joyce FALSE
3  18     A  Jane FALSE
4  19     B  John  TRUE

1.1.5. summarize

Example of summarize.

[6]:

n <- summarize(df, avg_age=mean(age))
print(n)

  avg_age
1      17

1.1.6. pipe

You may chain multiple verbs by using the pipe operator %>%. In the example below, we filter the data frame df for students whose ages are less than 18 and male, and then compute the average age.

[7]:

n <- df %>%
    filter(age < 18) %>%
    filter(male == TRUE) %>%
    summarize(avg_age=mean(age))
print(n)

  avg_age
1      15

1.1.7. group_by

Here’s an example of using the group_by function. The group_by function will return a tibble object that looks like a data frame, but keeps track of the groups.

[8]:

n <- group_by(df, male)
print(n)

# A tibble: 4 x 4
# Groups:   male [2]
    age grade name  male
  <dbl> <chr> <chr> <lgl>
1    18 A     Jane  FALSE
2    16 B     Joyce FALSE
3    15 C     Joe   TRUE
4    19 B     John  TRUE

You may also apply verbs to the tibble above. Below, we compute the average age, but notice how they are separated into male and female (as a result of the group_by function)?

[9]:

r <- n %>% summarize(avg_age=mean(age))
print(r)

# A tibble: 2 x 2
  male  avg_age
  <lgl>   <dbl>
1 FALSE      17
2 TRUE       17

1.1.8. join

You may join data frames using SQL like joins.

left_join joins two data frames return all records from the left data frame
inner_join joins two data frames returning all records that only exist in both data frames
right_join joins two data frames returning all records from the right data frame
full_join joins two data frames all records from both data frames

[10]:

subjects <- data.frame(
    name = c('Jane', 'Joyce', 'Joe', 'Jeremy'),
    course = c('Statistics', 'Calculus', 'Statistics', 'Calculus'),
    stringsAsFactors=FALSE
)

Left join.

[11]:

n <- left_join(df, subjects, by='name')
print(n)

  age grade  name  male     course
1  18     A  Jane FALSE Statistics
2  16     B Joyce FALSE   Calculus
3  15     C   Joe  TRUE Statistics
4  19     B  John  TRUE       <NA>

Inner join.

[12]:

n <- inner_join(df, subjects, by='name')
print(n)

  age grade  name  male     course
1  18     A  Jane FALSE Statistics
2  16     B Joyce FALSE   Calculus
3  15     C   Joe  TRUE Statistics

Right join.

[13]:

n <- right_join(df, subjects, by='name')
print(n)

  age grade   name  male     course
1  18     A   Jane FALSE Statistics
2  16     B  Joyce FALSE   Calculus
3  15     C    Joe  TRUE Statistics
4  NA  <NA> Jeremy    NA   Calculus

Full join.

[14]:

n <- full_join(df, subjects, by='name')
print(n)

  age grade   name  male     course
1  18     A   Jane FALSE Statistics
2  16     B  Joyce FALSE   Calculus
3  15     C    Joe  TRUE Statistics
4  19     B   John  TRUE       <NA>
5  NA  <NA> Jeremy    NA   Calculus

1.2. tidyr

The package tidyr is used to restructure data. Its semantic concepts are as follows.

variables are columns
observations are rows
values are elements

A nice cheat sheat is available.

[15]:

library('tidyr')

df <- data.frame(
    name = c('Jane', 'Joyce', 'Joe', 'John'),
    quizz1 = c(90, 89, 75, 91),
    quizz2 = c(95, 91, 85, 89),
    quizz3 = c(92, 82, 80, 93),
    stringsAsFactors=FALSE
)

print(df)

   name quizz1 quizz2 quizz3
1  Jane     90     95     92
2 Joyce     89     91     82
3   Joe     75     85     80
4  John     91     89     93

1.2.1. gather

The gather function changes the wide format (column-oriented) of df to a long format (row-oriented).

[16]:

n <- gather(df, key=quizz, value=score, -name)
print(n)

    name  quizz score
1   Jane quizz1    90
2  Joyce quizz1    89
3    Joe quizz1    75
4   John quizz1    91
5   Jane quizz2    95
6  Joyce quizz2    91
7    Joe quizz2    85
8   John quizz2    89
9   Jane quizz3    92
10 Joyce quizz3    82
11   Joe quizz3    80
12  John quizz3    93

The newer API has pivot_longer.

[17]:

n <- df %>% pivot_longer(-name, names_to='quizz', values_to='score')
print(n)

# A tibble: 12 x 3
   name  quizz  score
   <chr> <chr>  <dbl>
 1 Jane  quizz1    90
 2 Jane  quizz2    95
 3 Jane  quizz3    92
 4 Joyce quizz1    89
 5 Joyce quizz2    91
 6 Joyce quizz3    82
 7 Joe   quizz1    75
 8 Joe   quizz2    85
 9 Joe   quizz3    80
10 John  quizz1    91
11 John  quizz2    89
12 John  quizz3    93

[18]:

r <- n %>%
    summarize(avg_score=mean(score))
print(r)

# A tibble: 1 x 1
  avg_score
      <dbl>
1      87.7

[19]:

r <- n %>%
    group_by(name) %>%
    summarize(avg_score=mean(score))
print(r)

# A tibble: 4 x 2
  name  avg_score
  <chr>     <dbl>
1 Jane       92.3
2 Joe        80
3 John       91
4 Joyce      87.3

1.2.2. spread

The spread function converts the data frame from long (row-oriented) to wide (column-oriented) form.

[20]:

r <- spread(n, key=quizz, value=score)
print(r)

# A tibble: 4 x 4
  name  quizz1 quizz2 quizz3
  <chr>  <dbl>  <dbl>  <dbl>
1 Jane      90     95     92
2 Joe       75     85     80
3 John      91     89     93
4 Joyce     89     91     82

The newer API has pivot_wider.

[21]:

r <- n %>% pivot_wider(names_from='quizz', values_from='score')
print(r)

# A tibble: 4 x 4
  name  quizz1 quizz2 quizz3
  <chr>  <dbl>  <dbl>  <dbl>
1 Jane      90     95     92
2 Joyce     89     91     82
3 Joe       75     85     80
4 John      91     89     93

1.2.3. unite

The unite function merges several columns.

[22]:

df <- data.frame(
    name = c('Jane', 'Joyce', 'Joe', 'John'),
    dobMonth = c(1, 2, 3, 4),
    dobDay = c(5, 10, 15, 20),
    dobYear = c(89, 88, 87, 91),
    stringsAsFactors=FALSE
)

print(df)

   name dobMonth dobDay dobYear
1  Jane        1      5      89
2 Joyce        2     10      88
3   Joe        3     15      87
4  John        4     20      91

[23]:

n <- df %>% unite(dobMonth, dobDay, dobYear, col='dob', sep='-')
print(n)

   name     dob
1  Jane  1-5-89
2 Joyce 2-10-88
3   Joe 3-15-87
4  John 4-20-91

1.2.4. separate

The separate function breaks apart a column.

[24]:

df <- data.frame(
    name = c('Jane', 'Joyce', 'Joe', 'John'),
    dob = c('1-5-89', '2-10-88', '3-15-87', '4-20-91'),
    stringsAsFactors=FALSE
)

print(df)

   name     dob
1  Jane  1-5-89
2 Joyce 2-10-88
3   Joe 3-15-87
4  John 4-20-91

[25]:

n <- df %>% separate(dob, sep='-', into=c('dobMonth', 'dobDay', 'dobYear'))
print(n)

   name dobMonth dobDay dobYear
1  Jane        1      5      89
2 Joyce        2     10      88
3   Joe        3     15      87
4  John        4     20      91

1.3. purr

The purr library enables functional programming over data frames. A purr cheat sheat is available.

[26]:

library('purrr')

df <- data.frame(
    name = c('Jane', 'Joyce', 'Joe', 'John'),
    quizz1 = c(90, 89, 75, 91),
    quizz2 = c(95, 91, 85, 89),
    quizz3 = c(92, 82, 80, 93),
    stringsAsFactors=FALSE
)

print(df)

   name quizz1 quizz2 quizz3
1  Jane     90     95     92
2 Joyce     89     91     82
3   Joe     75     85     80
4  John     91     89     93

1.3.1. modify_if

The modify_if function modifies a variable (column) if the specified condition is satisified. Below, we use the comprehenr package to mimic a vector comprehension (or, in Python, a list comprehension).

[27]:

library('comprehenr')

letterGrade = function(score) {
    if (score >= 90) {
        return('A')
    } else if (score >= 80) {
        return('B')
    } else if (score >= 70) {
        return('C')
    } else if (score >= 60) {
        return('D')
    } else {
        return('F')
    }
}

letterGrades = function(scores) {
    to_vec(for(s in scores) letterGrade(s))
}

n <- df %>%
        modify_if(is.numeric, .f=letterGrades)
print(n)

   name quizz1 quizz2 quizz3
1  Jane      A      A      A
2 Joyce      B      A      B
3   Joe      C      B      B
4  John      A      B      A

1.3.2. modify_at

The modify_at function modifies the specified columns.

[28]:

n <- df %>%
        modify_at(.at=c('quizz1', 'quizz2', 'quizz3'), .f=letterGrades)
print(n)

   name quizz1 quizz2 quizz3
1  Jane      A      A      A
2 Joyce      B      A      B
3   Joe      C      B      B
4  John      A      B      A

1.4. readr

The readr package is used to write and read data. A variety of formats is supported.

csv comma separated
tsv tab separated
delim delimited
fwf fixed width files
table tablular file with columns separated by space
log web log files

[29]:

library('readr')

df <- data.frame(
    name = c('Jane', 'Joyce', 'Joe', 'John'),
    dob = c('1-5-89', '2-10-88', '3-15-87', '4-20-91'),
    stringsAsFactors=FALSE
)

1.4.1. write_csv

Use write_csv to write data to a CSV file.

[30]:

df %>% write_csv(path='students.csv')

1.4.2. read_csv

Use read_csv to read CSV data. Note that the object returned will be a tibble.

[31]:

df <- read_csv(file='students.csv')
print(df)

Parsed with column specification:
cols(
  name = col_character(),
  dob = col_character()
)

# A tibble: 4 x 2
  name  dob
  <chr> <chr>
1 Jane  1-5-89
2 Joyce 2-10-88
3 Joe   3-15-87
4 John  4-20-91

Note that you may specify the schema as well.

[32]:

df <- read_csv(file='students.csv', col_types=cols(
    name = col_character(),
    dob = col_character()
))

print(df)

# A tibble: 4 x 2
  name  dob
  <chr> <chr>
1 Jane  1-5-89
2 Joyce 2-10-88
3 Joe   3-15-87
4 John  4-20-91