2. Data Structures

There are five main data structures in R.

  • vector

  • matrix

  • array

  • list

  • data frames

Note that vector, matrix and array contain elements that are all the same time (homogeneous), and list and data frames may contain elements of mixed types (heterogeneous). A list is like a vector, but heterogeneous and a data frame is like a matrix, but heterogeneous.

2.1. Vectors

2.1.1. Basic vectors

A vector is created using the c function. Note that a mixture of data types may be stored in a vector.

[1]:
a <- c(TRUE, FALSE, -10, 10, 'X', 'Y', NA, NULL, NaN, Inf, -Inf)
print(a)
 [1] "TRUE"  "FALSE" "-10"   "10"    "X"     "Y"     NA      "NaN"   "Inf"
[10] "-Inf"

Note that we shoved logicals, numerics and characters into the vector a, but when interrogating its type, all the elements are coerced to character type. The most complex type in a vector will be the type that all the elements will be coerced into.

[2]:
typeof(a)
'character'

2.1.2. Length of vector

To get the length of a vector, use length.

[3]:
a <- c(1, 2, 3)
length(a)
3

2.1.3. Vector elements

It’s crazy, but true, the elements of a R vector are indexed starting from 1 instead of 0. To access each element, an operation called subsetting, in a vector in R by position, use the brackets [] starting with 1 as follows.

[4]:
a[1]
1
[5]:
a[2]
2

2.1.4. Named vector

A named vector is when each element in a vector is given a name. When accessing elements in a named vector, you may access the element by name.

[6]:
a <- c(A=TRUE, B=FALSE, C=-10, D=10, E='X', F='Y')
print(a)
      A       B       C       D       E       F
 "TRUE" "FALSE"   "-10"    "10"     "X"     "Y"

Here, we may access the first element by a['A'] or a[1].

[7]:
print(a['A'])
     A
"TRUE"
[8]:
print(a[1])
     A
"TRUE"

Likewise, we may access the second element by a['B'] or a[2].

[9]:
print(a['B'])
      B
"FALSE"
[10]:
print(a[2])
      B
"FALSE"

2.1.5. Subsetting vectors

Here, we show different ways to subset a multiple elements of a vector. Note that subsetting is also called slicing. We may select multiple elements using a supplied vector of indices.

[11]:
a <- c(1, 2, 3, 4, 5)
b <- a[c(2, 4)]
print(b)
[1] 2 4

We may also use the colon : operator to specify a range of elements.

[12]:
a <- c(1, 2, 3, 4, 5)
b <- a[1:2]
print(b)
[1] 1 2

We may use a logical masking vector as well to select elements.

[13]:
a <- c(1, 2, 3, 4, 5)
b <- a[c(TRUE, FALSE, TRUE, FALSE, TRUE)]
print(b)
[1] 1 3 5

When you use a negative sign - subsetting will exclude the matching elements.

[14]:
a <- c(1, 2, 3, 4, 5)
b <- a[-c(2, 4)]
print(b)
[1] 1 3 5

2.1.6. Math operations

Math operations on vectors proceed as on numerics. Note the %*% operator, which is the dot product operator and returns a matrix. When you perform math operations on vectors, make sure they are of the same length or you might get unexpected results.

[15]:
a <- c(1, 2, 3)
b <- c(4, 5, 6)
c <- a + b
d <- a - b
e <- a * b
f <- a / b
g <- a %% b
h <- a %*% b
[16]:
print(c)
[1] 5 7 9
[17]:
print(d)
[1] -3 -3 -3
[18]:
print(e)
[1]  4 10 18
[19]:
print(f)
[1] 0.25 0.40 0.50
[20]:
print(g)
[1] 1 2 3
[21]:
print(h)
     [,1]
[1,]   32

2.1.7. Combining

Use the c function to combine vectors.

[22]:
a <- c(1, 2, 3)
b <- c(4, 5, 6)
c <- c(a, b)
print(c)
[1] 1 2 3 4 5 6

2.1.8. Sorting

Use the sort function to sort elements of a vector.

[23]:
a <- c(10, 5, 2, 8, 7)
b <- sort(a)
print(b)
[1]  2  5  7  8 10

2.1.9. Factors

Factors represent categorical values and you may create them from vectors using the factor function. Note that factors have levels, which is an ordering placed on the unique elements of the factor. If no explicit levels is specified, the ordering is alphabetical.

[24]:
a <- factor(c('water', 'soda', 'tea', 'coffee'))
print(a)
[1] water  soda   tea    coffee
Levels: coffee soda tea water

Here is a specific levels placed on the factor (most caffeine to least).

[25]:
a <- factor(
        c('water', 'soda', 'tea', 'coffee'),
        levels=c('tea', 'coffee', 'soda', 'water'))
print(a)
[1] water  soda   tea    coffee
Levels: tea coffee soda water

2.2. Matrix

2.2.1. Creation

A matrix is created using the matrix function. Note that you may supply a vector and the number of rows and columns of the matrix during instantiation/creation. The matrix is created column-wise by default.

[26]:
A <- matrix(c(1, 2, 3, 4), nrow=2, ncol=2)
print(A)
     [,1] [,2]
[1,]    1    3
[2,]    2    4

To create a matrix with a vector of data by row, use byrow=TRUE.

[27]:
A <- matrix(c(1, 2, 3, 4), nrow=2, ncol=2, byrow=TRUE)
print(A)
     [,1] [,2]
[1,]    1    2
[2,]    3    4

2.2.2. Subsetting matrices

To access one or more elements in an array, use positional indices with the brackets [] and :.

[28]:
A <- matrix(c(0, 1, 2, 3, 4, 5, 6, 7, 9), nrow=3, byrow=TRUE)
print(A)
     [,1] [,2] [,3]
[1,]    0    1    2
[2,]    3    4    5
[3,]    6    7    9

Get one element.

[29]:
a <- A[1, 1]
print(a)
[1] 0

Get multiple elements (second and third rows, first column).

[30]:
a <- A[2:3, 1]
print(a)
[1] 3 6

Get multiple elements (first column, second and third rows).

[31]:
a <- A[1, 2:3]
print(a)
[1] 1 2

Get multiple elements (second and third rows, second and third columns).

[32]:
a <- A[2:3, 2:3]
print(a)
     [,1] [,2]
[1,]    4    5
[2,]    7    9

2.2.3. Transposing

Use the t function to transpose a matrix.

[33]:
A <- matrix(c(0, 1, 2, 3, 4, 5, 6, 7, 9), nrow=3, byrow=TRUE)
print(A)
     [,1] [,2] [,3]
[1,]    0    1    2
[2,]    3    4    5
[3,]    6    7    9
[34]:
a <- t(A)
print(a)
     [,1] [,2] [,3]
[1,]    0    3    6
[2,]    1    4    7
[3,]    2    5    9

2.2.4. Getting matrix dimensions

[35]:
A <- matrix(c(0, 1, 2, 3, 4, 5, 6, 7, 9), nrow=3, byrow=TRUE)
rows <- nrow(A)
cols <- ncol(A)

print(rows)
print(cols)
[1] 3
[1] 3

2.2.5. Matrix math

[36]:
A <- matrix(c(1, 2, 3, 4, 5, 6, 7, 8, 9), nrow=3, byrow=TRUE)
B <- matrix(c(2, 3, 4, 5, 6, 7, 8, 9, 1), nrow=3, byrow=TRUE)
x <- c(1, 2, 3)

print(A)
print(B)
print(x)
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9
     [,1] [,2] [,3]
[1,]    2    3    4
[2,]    5    6    7
[3,]    8    9    1
[1] 1 2 3

Add two matrices.

[37]:
r <- A + B
print(r)
     [,1] [,2] [,3]
[1,]    3    5    7
[2,]    9   11   13
[3,]   15   17   10

Subtract two matrices.

[38]:
r <- A - B
print(r)
     [,1] [,2] [,3]
[1,]   -1   -1   -1
[2,]   -1   -1   -1
[3,]   -1   -1    8

Multiply two matrices.

[39]:
r <- A * B
print(r)
     [,1] [,2] [,3]
[1,]    2    6   12
[2,]   20   30   42
[3,]   56   72    9

Dot product of two matrices.

[40]:
r <- A %*% B
print(r)
     [,1] [,2] [,3]
[1,]   36   42   21
[2,]   81   96   57
[3,]  126  150   93

Divide two matrices.

[41]:
r <- A / B
print(r)
      [,1]      [,2]      [,3]
[1,] 0.500 0.6666667 0.7500000
[2,] 0.800 0.8333333 0.8571429
[3,] 0.875 0.8888889 9.0000000

Multiply vector with matrix.

[42]:
r <- x * A
print(r)
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    8   10   12
[3,]   21   24   27

Dot product of vector and matrix.

[43]:
r <- x %*% A
print(r)
     [,1] [,2] [,3]
[1,]   30   36   42

Determinant of matrix.

[44]:
r <- det(A)
print(r)
[1] 6.661338e-16

Diagonal of matrix.

[45]:
r <- diag(A)
print(r)
[1] 1 5 9

2.3. Lists

Lists are somewhat of a misnomer in R as they have more capabilities than a mere list in other programming languages. A list in R actually behaves like a dictionary, map or associative array. We create a list using the list function.

[46]:
a <- list(TRUE, FALSE, -10, 10, 'X', 'Y', NA, NULL, NaN, Inf, -Inf)
[47]:
typeof(a)
'list'
[48]:
print(paste(a, collapse=','))
[1] "TRUE,FALSE,-10,10,X,Y,NA,NULL,NaN,Inf,-Inf"

2.3.1. Subsetting lists

We may use brackets [] and : to access elements of a list.

[49]:
b <- a[1]
print(b)
[[1]]
[1] TRUE

[50]:
b <- a[2:5]
print(b)
[[1]]
[1] FALSE

[[2]]
[1] -10

[[3]]
[1] 10

[[4]]
[1] "X"

2.3.2. Named lists

As with named vectors, we also have named lists.

[51]:
a <- list(A=TRUE, B=FALSE, C=-10, D=10)
print(a)
$A
[1] TRUE

$B
[1] FALSE

$C
[1] -10

$D
[1] 10

We may access the first element of the list a with a['A'] or a[1].

[52]:
b <- a['A']
print(b)
$A
[1] TRUE

[53]:
b <- a[1]
print(b)
$A
[1] TRUE

2.3.3. List apply

The lapply function can be used to apply a function to each element of a list. Here, we get the class of each element in the list.

[54]:
a <- list(A=TRUE, B=FALSE, C=-10, D=10)
b <- lapply(a, class)
print(b)
$A
[1] "logical"

$B
[1] "logical"

$C
[1] "numeric"

$D
[1] "numeric"

2.4. Data Frames

A data frame is perhaps the most powerful data structure in R. There are me-too data frame data structures in Python with Pandas and Spark (Spark has DataFrame and DataSet). To create a data frame in R, use the data.frame function.

[55]:
s <- data.frame(
    age = c(18, 16, 15),
    grade = c('A', 'B', 'C'),
    name = c('Jane', 'Jack', 'Joe'),
    male = c(FALSE, TRUE, TRUE)
)

print(s)
  age grade name  male
1  18     A Jane FALSE
2  16     B Jack  TRUE
3  15     C  Joe  TRUE

2.4.1. Subsetting data frames

To access the first row.

[56]:
a <- s[1, ]
print(a)
  age grade name  male
1  18     A Jane FALSE

To access the first column.

[57]:
a <- s[, 1]
print(a)
[1] 18 16 15

To access columns by name.

[58]:
a <- s$age
print(a)
[1] 18 16 15
[59]:
a <- s$grade
print(a)
[1] A B C
Levels: A B C
[60]:
a <- s$name
print(a)
[1] Jane Jack Joe
Levels: Jack Jane Joe
[61]:
a <- s$male
print(a)
[1] FALSE  TRUE  TRUE

To access elements by filtering with positional indices.

[62]:
a <- s[1:2, 1:2]
print(a)
  age grade
1  18     A
2  16     B

If there is missing data NA in your data frame, use the complete.cases function to create a logical vector mask to filter for rows (or cases) with only complete data.

[63]:
s <- data.frame(
    age = c(18, 16, 15, 19),
    grade = c('A', 'B', 'C', NA),
    name = c('Jane', 'Jack', 'Joe', 'Jerry'),
    male = c(FALSE, TRUE, TRUE, TRUE)
)

print(s)
  age grade  name  male
1  18     A  Jane FALSE
2  16     B  Jack  TRUE
3  15     C   Joe  TRUE
4  19  <NA> Jerry  TRUE
[64]:
a <- complete.cases(s)
print(a)
[1]  TRUE  TRUE  TRUE FALSE
[65]:
a <- s[complete.cases(s), ]
print(a)
  age grade name  male
1  18     A Jane FALSE
2  16     B Jack  TRUE
3  15     C  Joe  TRUE

2.4.2. Data frame functions

Let’s take a look at some functions that we may apply to data frames.

[66]:
s <- data.frame(
    v=c('large', 'small', 'small'),
    w=c(1, 2, 3),
    x=c(4, 5, 6),
    y=c(7, 8, 9),
    z=c(10, 11, 12)
)

print(s)
      v w x y  z
1 large 1 4 7 10
2 small 2 5 8 11
3 small 3 6 9 12

To get the number of rows and columns, use nrow and ncol.

[67]:
rows <- nrow(s)
cols <- ncol(s)

print(paste(rows, cols))
[1] "3 5"

To get the dimension, use dim.

[68]:
d <- dim(s)
print(d)
[1] 3 5

To get the column names, use colnames.

[69]:
n <- colnames(s)
print(n)
[1] "v" "w" "x" "y" "z"

To get the row names, use rownames.

[70]:
n <- rownames(s)
print(n)
[1] "1" "2" "3"

To peek at the top or bottom rows of the data frame, use head and tail, respectively.

[71]:
h <- head(s, 2)
print(h)
      v w x y  z
1 large 1 4 7 10
2 small 2 5 8 11
[72]:
t <- tail(s, 2)
print(t)
      v w x y  z
2 small 2 5 8 11
3 small 3 6 9 12