2. Data Structures
There are five main data structures in R.
vector
matrix
array
list
data frames
Note that vector, matrix and array contain elements that are all the same time (homogeneous
), and list and data frames may contain elements of mixed types (heterogeneous
). A list is like a vector, but heterogeneous and a data frame is like a matrix, but heterogeneous.
2.1. Vectors
2.1.1. Basic vectors
A vector is created using the c
function. Note that a mixture of data types may be stored in a vector.
[1]:
a <- c(TRUE, FALSE, -10, 10, 'X', 'Y', NA, NULL, NaN, Inf, -Inf)
print(a)
[1] "TRUE" "FALSE" "-10" "10" "X" "Y" NA "NaN" "Inf"
[10] "-Inf"
Note that we shoved logicals, numerics and characters into the vector a
, but when interrogating its type, all the elements are coerced
to character
type. The most complex type in a vector will be the type that all the elements will be coerced into.
[2]:
typeof(a)
2.1.2. Length of vector
To get the length of a vector, use length
.
[3]:
a <- c(1, 2, 3)
length(a)
2.1.3. Vector elements
It’s crazy, but true, the elements of a R
vector are indexed starting from 1
instead of 0
. To access each element, an operation called subsetting
, in a vector in R
by position, use the brackets []
starting with 1
as follows.
[4]:
a[1]
[5]:
a[2]
2.1.4. Named vector
A named vector
is when each element in a vector is given a name. When accessing elements in a named vector, you may access the element by name.
[6]:
a <- c(A=TRUE, B=FALSE, C=-10, D=10, E='X', F='Y')
print(a)
A B C D E F
"TRUE" "FALSE" "-10" "10" "X" "Y"
Here, we may access the first element by a['A']
or a[1]
.
[7]:
print(a['A'])
A
"TRUE"
[8]:
print(a[1])
A
"TRUE"
Likewise, we may access the second element by a['B']
or a[2]
.
[9]:
print(a['B'])
B
"FALSE"
[10]:
print(a[2])
B
"FALSE"
2.1.5. Subsetting vectors
Here, we show different ways to subset a multiple elements of a vector. Note that subsetting is also called slicing
. We may select multiple elements using a supplied vector of indices.
[11]:
a <- c(1, 2, 3, 4, 5)
b <- a[c(2, 4)]
print(b)
[1] 2 4
We may also use the colon :
operator to specify a range of elements.
[12]:
a <- c(1, 2, 3, 4, 5)
b <- a[1:2]
print(b)
[1] 1 2
We may use a logical masking vector as well to select elements.
[13]:
a <- c(1, 2, 3, 4, 5)
b <- a[c(TRUE, FALSE, TRUE, FALSE, TRUE)]
print(b)
[1] 1 3 5
When you use a negative sign -
subsetting will exclude
the matching elements.
[14]:
a <- c(1, 2, 3, 4, 5)
b <- a[-c(2, 4)]
print(b)
[1] 1 3 5
2.1.6. Math operations
Math operations on vectors proceed as on numerics. Note the %*%
operator, which is the dot product
operator and returns a matrix
. When you perform math operations on vectors, make sure they are of the same length or you might get unexpected results.
[15]:
a <- c(1, 2, 3)
b <- c(4, 5, 6)
c <- a + b
d <- a - b
e <- a * b
f <- a / b
g <- a %% b
h <- a %*% b
[16]:
print(c)
[1] 5 7 9
[17]:
print(d)
[1] -3 -3 -3
[18]:
print(e)
[1] 4 10 18
[19]:
print(f)
[1] 0.25 0.40 0.50
[20]:
print(g)
[1] 1 2 3
[21]:
print(h)
[,1]
[1,] 32
2.1.7. Combining
Use the c
function to combine vectors.
[22]:
a <- c(1, 2, 3)
b <- c(4, 5, 6)
c <- c(a, b)
print(c)
[1] 1 2 3 4 5 6
2.1.8. Sorting
Use the sort
function to sort elements of a vector.
[23]:
a <- c(10, 5, 2, 8, 7)
b <- sort(a)
print(b)
[1] 2 5 7 8 10
2.1.9. Factors
Factors represent categorical values and you may create them from vectors using the factor
function. Note that factors have levels
, which is an ordering placed on the unique elements of the factor. If no explicit levels
is specified, the ordering is alphabetical.
[24]:
a <- factor(c('water', 'soda', 'tea', 'coffee'))
print(a)
[1] water soda tea coffee
Levels: coffee soda tea water
Here is a specific levels
placed on the factor (most caffeine to least).
[25]:
a <- factor(
c('water', 'soda', 'tea', 'coffee'),
levels=c('tea', 'coffee', 'soda', 'water'))
print(a)
[1] water soda tea coffee
Levels: tea coffee soda water
2.2. Matrix
2.2.1. Creation
A matrix is created using the matrix
function. Note that you may supply a vector and the number of rows and columns of the matrix during instantiation/creation. The matrix is created column-wise by default.
[26]:
A <- matrix(c(1, 2, 3, 4), nrow=2, ncol=2)
print(A)
[,1] [,2]
[1,] 1 3
[2,] 2 4
To create a matrix with a vector of data by row, use byrow=TRUE
.
[27]:
A <- matrix(c(1, 2, 3, 4), nrow=2, ncol=2, byrow=TRUE)
print(A)
[,1] [,2]
[1,] 1 2
[2,] 3 4
2.2.2. Subsetting matrices
To access one or more elements in an array, use positional indices with the brackets []
and :
.
[28]:
A <- matrix(c(0, 1, 2, 3, 4, 5, 6, 7, 9), nrow=3, byrow=TRUE)
print(A)
[,1] [,2] [,3]
[1,] 0 1 2
[2,] 3 4 5
[3,] 6 7 9
Get one element.
[29]:
a <- A[1, 1]
print(a)
[1] 0
Get multiple elements (second and third rows, first column).
[30]:
a <- A[2:3, 1]
print(a)
[1] 3 6
Get multiple elements (first column, second and third rows).
[31]:
a <- A[1, 2:3]
print(a)
[1] 1 2
Get multiple elements (second and third rows, second and third columns).
[32]:
a <- A[2:3, 2:3]
print(a)
[,1] [,2]
[1,] 4 5
[2,] 7 9
2.2.3. Transposing
Use the t
function to transpose a matrix.
[33]:
A <- matrix(c(0, 1, 2, 3, 4, 5, 6, 7, 9), nrow=3, byrow=TRUE)
print(A)
[,1] [,2] [,3]
[1,] 0 1 2
[2,] 3 4 5
[3,] 6 7 9
[34]:
a <- t(A)
print(a)
[,1] [,2] [,3]
[1,] 0 3 6
[2,] 1 4 7
[3,] 2 5 9
2.2.4. Getting matrix dimensions
[35]:
A <- matrix(c(0, 1, 2, 3, 4, 5, 6, 7, 9), nrow=3, byrow=TRUE)
rows <- nrow(A)
cols <- ncol(A)
print(rows)
print(cols)
[1] 3
[1] 3
2.2.5. Matrix math
[36]:
A <- matrix(c(1, 2, 3, 4, 5, 6, 7, 8, 9), nrow=3, byrow=TRUE)
B <- matrix(c(2, 3, 4, 5, 6, 7, 8, 9, 1), nrow=3, byrow=TRUE)
x <- c(1, 2, 3)
print(A)
print(B)
print(x)
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
[,1] [,2] [,3]
[1,] 2 3 4
[2,] 5 6 7
[3,] 8 9 1
[1] 1 2 3
Add two matrices.
[37]:
r <- A + B
print(r)
[,1] [,2] [,3]
[1,] 3 5 7
[2,] 9 11 13
[3,] 15 17 10
Subtract two matrices.
[38]:
r <- A - B
print(r)
[,1] [,2] [,3]
[1,] -1 -1 -1
[2,] -1 -1 -1
[3,] -1 -1 8
Multiply two matrices.
[39]:
r <- A * B
print(r)
[,1] [,2] [,3]
[1,] 2 6 12
[2,] 20 30 42
[3,] 56 72 9
Dot product of two matrices.
[40]:
r <- A %*% B
print(r)
[,1] [,2] [,3]
[1,] 36 42 21
[2,] 81 96 57
[3,] 126 150 93
Divide two matrices.
[41]:
r <- A / B
print(r)
[,1] [,2] [,3]
[1,] 0.500 0.6666667 0.7500000
[2,] 0.800 0.8333333 0.8571429
[3,] 0.875 0.8888889 9.0000000
Multiply vector with matrix.
[42]:
r <- x * A
print(r)
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 8 10 12
[3,] 21 24 27
Dot product of vector and matrix.
[43]:
r <- x %*% A
print(r)
[,1] [,2] [,3]
[1,] 30 36 42
Determinant of matrix.
[44]:
r <- det(A)
print(r)
[1] 6.661338e-16
Diagonal of matrix.
[45]:
r <- diag(A)
print(r)
[1] 1 5 9
2.3. Lists
Lists
are somewhat of a misnomer in R
as they have more capabilities than a mere list in other programming languages. A list in R
actually behaves like a dictionary, map or associative array. We create a list using the list
function.
[46]:
a <- list(TRUE, FALSE, -10, 10, 'X', 'Y', NA, NULL, NaN, Inf, -Inf)
[47]:
typeof(a)
[48]:
print(paste(a, collapse=','))
[1] "TRUE,FALSE,-10,10,X,Y,NA,NULL,NaN,Inf,-Inf"
2.3.1. Subsetting lists
We may use brackets []
and :
to access elements of a list.
[49]:
b <- a[1]
print(b)
[[1]]
[1] TRUE
[50]:
b <- a[2:5]
print(b)
[[1]]
[1] FALSE
[[2]]
[1] -10
[[3]]
[1] 10
[[4]]
[1] "X"
2.3.2. Named lists
As with named vectors, we also have named lists
.
[51]:
a <- list(A=TRUE, B=FALSE, C=-10, D=10)
print(a)
$A
[1] TRUE
$B
[1] FALSE
$C
[1] -10
$D
[1] 10
We may access the first element of the list a
with a['A']
or a[1]
.
[52]:
b <- a['A']
print(b)
$A
[1] TRUE
[53]:
b <- a[1]
print(b)
$A
[1] TRUE
2.3.3. List apply
The lapply
function can be used to apply a function to each element of a list. Here, we get the class of each element in the list.
[54]:
a <- list(A=TRUE, B=FALSE, C=-10, D=10)
b <- lapply(a, class)
print(b)
$A
[1] "logical"
$B
[1] "logical"
$C
[1] "numeric"
$D
[1] "numeric"
2.4. Data Frames
A data frame
is perhaps the most powerful data structure in R
. There are me-too
data frame data structures in Python
with Pandas
and Spark
(Spark has DataFrame
and DataSet
). To create a data frame in R
, use the data.frame
function.
[55]:
s <- data.frame(
age = c(18, 16, 15),
grade = c('A', 'B', 'C'),
name = c('Jane', 'Jack', 'Joe'),
male = c(FALSE, TRUE, TRUE)
)
print(s)
age grade name male
1 18 A Jane FALSE
2 16 B Jack TRUE
3 15 C Joe TRUE
2.4.1. Subsetting data frames
To access the first row.
[56]:
a <- s[1, ]
print(a)
age grade name male
1 18 A Jane FALSE
To access the first column.
[57]:
a <- s[, 1]
print(a)
[1] 18 16 15
To access columns by name.
[58]:
a <- s$age
print(a)
[1] 18 16 15
[59]:
a <- s$grade
print(a)
[1] A B C
Levels: A B C
[60]:
a <- s$name
print(a)
[1] Jane Jack Joe
Levels: Jack Jane Joe
[61]:
a <- s$male
print(a)
[1] FALSE TRUE TRUE
To access elements by filtering with positional indices.
[62]:
a <- s[1:2, 1:2]
print(a)
age grade
1 18 A
2 16 B
If there is missing data NA
in your data frame, use the complete.cases
function to create a logical vector mask to filter for rows (or cases
) with only complete data.
[63]:
s <- data.frame(
age = c(18, 16, 15, 19),
grade = c('A', 'B', 'C', NA),
name = c('Jane', 'Jack', 'Joe', 'Jerry'),
male = c(FALSE, TRUE, TRUE, TRUE)
)
print(s)
age grade name male
1 18 A Jane FALSE
2 16 B Jack TRUE
3 15 C Joe TRUE
4 19 <NA> Jerry TRUE
[64]:
a <- complete.cases(s)
print(a)
[1] TRUE TRUE TRUE FALSE
[65]:
a <- s[complete.cases(s), ]
print(a)
age grade name male
1 18 A Jane FALSE
2 16 B Jack TRUE
3 15 C Joe TRUE
2.4.2. Data frame functions
Let’s take a look at some functions that we may apply to data frames.
[66]:
s <- data.frame(
v=c('large', 'small', 'small'),
w=c(1, 2, 3),
x=c(4, 5, 6),
y=c(7, 8, 9),
z=c(10, 11, 12)
)
print(s)
v w x y z
1 large 1 4 7 10
2 small 2 5 8 11
3 small 3 6 9 12
To get the number of rows and columns, use nrow
and ncol
.
[67]:
rows <- nrow(s)
cols <- ncol(s)
print(paste(rows, cols))
[1] "3 5"
To get the dimension, use dim
.
[68]:
d <- dim(s)
print(d)
[1] 3 5
To get the column names, use colnames
.
[69]:
n <- colnames(s)
print(n)
[1] "v" "w" "x" "y" "z"
To get the row names, use rownames
.
[70]:
n <- rownames(s)
print(n)
[1] "1" "2" "3"
To peek at the top or bottom rows of the data frame, use head
and tail
, respectively.
[71]:
h <- head(s, 2)
print(h)
v w x y z
1 large 1 4 7 10
2 small 2 5 8 11
[72]:
t <- tail(s, 2)
print(t)
v w x y z
2 small 2 5 8 11
3 small 3 6 9 12