1. Data Manipulation

We may manipulate data in R using the tidyverse packages.

  • dplyr for data frame manipulating and filtering

  • tidyr for data frame restructuring

  • purr for functional programming

  • readr for persisting and depersisting data

Visit the official tidyverse website for more information.

1.1. dplyr

Let’s use the dplyr package to manipulate data frames. This package uses the following verbs to manipulate data frames.

  • select extracts columns of interests

  • filter removes columns

  • mutate changes values

  • arrange reorders rows

  • summarize applies aggregate computations to data

  • join merges data into a single data frame

Note that these verbs provide a grammar for data manipulation.

[1]:
suppressMessages({
    library('dplyr')
})


df <- data.frame(
    age = c(18, 16, 15, 19),
    grade = c('A', 'B', 'C', 'B'),
    name = c('Jane', 'Joyce', 'Joe', 'John'),
    male = c(FALSE, FALSE, TRUE, TRUE),
    stringsAsFactors=FALSE
)

print(df)
  age grade  name  male
1  18     A  Jane FALSE
2  16     B Joyce FALSE
3  15     C   Joe  TRUE
4  19     B  John  TRUE

1.1.1. select

Example of select.

[2]:
g <- select(df, grade)
print(g)
  grade
1     A
2     B
3     C
4     B

1.1.2. filter

Example of filter.

[3]:
m <- filter(df, male == TRUE)
print(m)
  age grade name male
1  15     C  Joe TRUE
2  19     B John TRUE

1.1.3. mutate

Example of mutate.

[4]:
n <- mutate(df, status=ifelse(age < 18, 'minor', 'adult'))
print(n)
  age grade  name  male status
1  18     A  Jane FALSE  adult
2  16     B Joyce FALSE  minor
3  15     C   Joe  TRUE  minor
4  19     B  John  TRUE  adult

1.1.4. arrange

Example of arrange.

[5]:
n <- arrange(df, age)
print(n)
  age grade  name  male
1  15     C   Joe  TRUE
2  16     B Joyce FALSE
3  18     A  Jane FALSE
4  19     B  John  TRUE

1.1.5. summarize

Example of summarize.

[6]:
n <- summarize(df, avg_age=mean(age))
print(n)
  avg_age
1      17

1.1.6. pipe

You may chain multiple verbs by using the pipe operator %>%. In the example below, we filter the data frame df for students whose ages are less than 18 and male, and then compute the average age.

[7]:
n <- df %>%
    filter(age < 18) %>%
    filter(male == TRUE) %>%
    summarize(avg_age=mean(age))
print(n)
  avg_age
1      15

1.1.7. group_by

Here’s an example of using the group_by function. The group_by function will return a tibble object that looks like a data frame, but keeps track of the groups.

[8]:
n <- group_by(df, male)
print(n)
# A tibble: 4 x 4
# Groups:   male [2]
    age grade name  male
  <dbl> <chr> <chr> <lgl>
1    18 A     Jane  FALSE
2    16 B     Joyce FALSE
3    15 C     Joe   TRUE
4    19 B     John  TRUE

You may also apply verbs to the tibble above. Below, we compute the average age, but notice how they are separated into male and female (as a result of the group_by function)?

[9]:
r <- n %>% summarize(avg_age=mean(age))
print(r)
# A tibble: 2 x 2
  male  avg_age
  <lgl>   <dbl>
1 FALSE      17
2 TRUE       17

1.1.8. join

You may join data frames using SQL like joins.

  • left_join joins two data frames return all records from the left data frame

  • inner_join joins two data frames returning all records that only exist in both data frames

  • right_join joins two data frames returning all records from the right data frame

  • full_join joins two data frames all records from both data frames

[10]:
subjects <- data.frame(
    name = c('Jane', 'Joyce', 'Joe', 'Jeremy'),
    course = c('Statistics', 'Calculus', 'Statistics', 'Calculus'),
    stringsAsFactors=FALSE
)

Left join.

[11]:
n <- left_join(df, subjects, by='name')
print(n)
  age grade  name  male     course
1  18     A  Jane FALSE Statistics
2  16     B Joyce FALSE   Calculus
3  15     C   Joe  TRUE Statistics
4  19     B  John  TRUE       <NA>

Inner join.

[12]:
n <- inner_join(df, subjects, by='name')
print(n)
  age grade  name  male     course
1  18     A  Jane FALSE Statistics
2  16     B Joyce FALSE   Calculus
3  15     C   Joe  TRUE Statistics

Right join.

[13]:
n <- right_join(df, subjects, by='name')
print(n)
  age grade   name  male     course
1  18     A   Jane FALSE Statistics
2  16     B  Joyce FALSE   Calculus
3  15     C    Joe  TRUE Statistics
4  NA  <NA> Jeremy    NA   Calculus

Full join.

[14]:
n <- full_join(df, subjects, by='name')
print(n)
  age grade   name  male     course
1  18     A   Jane FALSE Statistics
2  16     B  Joyce FALSE   Calculus
3  15     C    Joe  TRUE Statistics
4  19     B   John  TRUE       <NA>
5  NA  <NA> Jeremy    NA   Calculus

1.2. tidyr

The package tidyr is used to restructure data. Its semantic concepts are as follows.

  • variables are columns

  • observations are rows

  • values are elements

A nice cheat sheat is available.

[15]:
library('tidyr')

df <- data.frame(
    name = c('Jane', 'Joyce', 'Joe', 'John'),
    quizz1 = c(90, 89, 75, 91),
    quizz2 = c(95, 91, 85, 89),
    quizz3 = c(92, 82, 80, 93),
    stringsAsFactors=FALSE
)

print(df)
   name quizz1 quizz2 quizz3
1  Jane     90     95     92
2 Joyce     89     91     82
3   Joe     75     85     80
4  John     91     89     93

1.2.1. gather

The gather function changes the wide format (column-oriented) of df to a long format (row-oriented).

[16]:
n <- gather(df, key=quizz, value=score, -name)
print(n)
    name  quizz score
1   Jane quizz1    90
2  Joyce quizz1    89
3    Joe quizz1    75
4   John quizz1    91
5   Jane quizz2    95
6  Joyce quizz2    91
7    Joe quizz2    85
8   John quizz2    89
9   Jane quizz3    92
10 Joyce quizz3    82
11   Joe quizz3    80
12  John quizz3    93

The newer API has pivot_longer.

[17]:
n <- df %>% pivot_longer(-name, names_to='quizz', values_to='score')
print(n)
# A tibble: 12 x 3
   name  quizz  score
   <chr> <chr>  <dbl>
 1 Jane  quizz1    90
 2 Jane  quizz2    95
 3 Jane  quizz3    92
 4 Joyce quizz1    89
 5 Joyce quizz2    91
 6 Joyce quizz3    82
 7 Joe   quizz1    75
 8 Joe   quizz2    85
 9 Joe   quizz3    80
10 John  quizz1    91
11 John  quizz2    89
12 John  quizz3    93
[18]:
r <- n %>%
    summarize(avg_score=mean(score))
print(r)
# A tibble: 1 x 1
  avg_score
      <dbl>
1      87.7
[19]:
r <- n %>%
    group_by(name) %>%
    summarize(avg_score=mean(score))
print(r)
# A tibble: 4 x 2
  name  avg_score
  <chr>     <dbl>
1 Jane       92.3
2 Joe        80
3 John       91
4 Joyce      87.3

1.2.2. spread

The spread function converts the data frame from long (row-oriented) to wide (column-oriented) form.

[20]:
r <- spread(n, key=quizz, value=score)
print(r)
# A tibble: 4 x 4
  name  quizz1 quizz2 quizz3
  <chr>  <dbl>  <dbl>  <dbl>
1 Jane      90     95     92
2 Joe       75     85     80
3 John      91     89     93
4 Joyce     89     91     82

The newer API has pivot_wider.

[21]:
r <- n %>% pivot_wider(names_from='quizz', values_from='score')
print(r)
# A tibble: 4 x 4
  name  quizz1 quizz2 quizz3
  <chr>  <dbl>  <dbl>  <dbl>
1 Jane      90     95     92
2 Joyce     89     91     82
3 Joe       75     85     80
4 John      91     89     93

1.2.3. unite

The unite function merges several columns.

[22]:
df <- data.frame(
    name = c('Jane', 'Joyce', 'Joe', 'John'),
    dobMonth = c(1, 2, 3, 4),
    dobDay = c(5, 10, 15, 20),
    dobYear = c(89, 88, 87, 91),
    stringsAsFactors=FALSE
)

print(df)
   name dobMonth dobDay dobYear
1  Jane        1      5      89
2 Joyce        2     10      88
3   Joe        3     15      87
4  John        4     20      91
[23]:
n <- df %>% unite(dobMonth, dobDay, dobYear, col='dob', sep='-')
print(n)
   name     dob
1  Jane  1-5-89
2 Joyce 2-10-88
3   Joe 3-15-87
4  John 4-20-91

1.2.4. separate

The separate function breaks apart a column.

[24]:
df <- data.frame(
    name = c('Jane', 'Joyce', 'Joe', 'John'),
    dob = c('1-5-89', '2-10-88', '3-15-87', '4-20-91'),
    stringsAsFactors=FALSE
)

print(df)
   name     dob
1  Jane  1-5-89
2 Joyce 2-10-88
3   Joe 3-15-87
4  John 4-20-91
[25]:
n <- df %>% separate(dob, sep='-', into=c('dobMonth', 'dobDay', 'dobYear'))
print(n)
   name dobMonth dobDay dobYear
1  Jane        1      5      89
2 Joyce        2     10      88
3   Joe        3     15      87
4  John        4     20      91

1.3. purr

The purr library enables functional programming over data frames. A purr cheat sheat is available.

[26]:
library('purrr')

df <- data.frame(
    name = c('Jane', 'Joyce', 'Joe', 'John'),
    quizz1 = c(90, 89, 75, 91),
    quizz2 = c(95, 91, 85, 89),
    quizz3 = c(92, 82, 80, 93),
    stringsAsFactors=FALSE
)

print(df)
   name quizz1 quizz2 quizz3
1  Jane     90     95     92
2 Joyce     89     91     82
3   Joe     75     85     80
4  John     91     89     93

1.3.1. modify_if

The modify_if function modifies a variable (column) if the specified condition is satisified. Below, we use the comprehenr package to mimic a vector comprehension (or, in Python, a list comprehension).

[27]:
library('comprehenr')

letterGrade = function(score) {
    if (score >= 90) {
        return('A')
    } else if (score >= 80) {
        return('B')
    } else if (score >= 70) {
        return('C')
    } else if (score >= 60) {
        return('D')
    } else {
        return('F')
    }
}

letterGrades = function(scores) {
    to_vec(for(s in scores) letterGrade(s))
}

n <- df %>%
        modify_if(is.numeric, .f=letterGrades)
print(n)
   name quizz1 quizz2 quizz3
1  Jane      A      A      A
2 Joyce      B      A      B
3   Joe      C      B      B
4  John      A      B      A

1.3.2. modify_at

The modify_at function modifies the specified columns.

[28]:
n <- df %>%
        modify_at(.at=c('quizz1', 'quizz2', 'quizz3'), .f=letterGrades)
print(n)
   name quizz1 quizz2 quizz3
1  Jane      A      A      A
2 Joyce      B      A      B
3   Joe      C      B      B
4  John      A      B      A

1.4. readr

The readr package is used to write and read data. A variety of formats is supported.

  • csv comma separated

  • tsv tab separated

  • delim delimited

  • fwf fixed width files

  • table tablular file with columns separated by space

  • log web log files

[29]:
library('readr')

df <- data.frame(
    name = c('Jane', 'Joyce', 'Joe', 'John'),
    dob = c('1-5-89', '2-10-88', '3-15-87', '4-20-91'),
    stringsAsFactors=FALSE
)

1.4.1. write_csv

Use write_csv to write data to a CSV file.

[30]:
df %>% write_csv(path='students.csv')

1.4.2. read_csv

Use read_csv to read CSV data. Note that the object returned will be a tibble.

[31]:
df <- read_csv(file='students.csv')
print(df)
Parsed with column specification:
cols(
  name = col_character(),
  dob = col_character()
)
# A tibble: 4 x 2
  name  dob
  <chr> <chr>
1 Jane  1-5-89
2 Joyce 2-10-88
3 Joe   3-15-87
4 John  4-20-91

Note that you may specify the schema as well.

[32]:
df <- read_csv(file='students.csv', col_types=cols(
    name = col_character(),
    dob = col_character()
))

print(df)
# A tibble: 4 x 2
  name  dob
  <chr> <chr>
1 Jane  1-5-89
2 Joyce 2-10-88
3 Joe   3-15-87
4 John  4-20-91