7. Formulas

Formulas are R’s way of expressing models against the data we have. Formulas are used nearly everywhere in R, from defining models to transforming data. The general form of a formula is

\(y \sim x_1 + x_2 + \dots\),

where

\(y\) is the output (dependent variable),
\(\sim\) separates the left and right hand sides of the formula, and
\(x_1 + x_2 + \ldots\) is the input (independent variable).

The power of formulas comes with the operators on the right hand side.

+ means to use a variable (or a set)
: indicates interactions excluding lower level ones
* indicates interactions including lower level ones
- indicates to remove variables
** or ^ indicates expanding interactions between sets of variables

Let’s look how some of these operators work. First, let’s simulate some data.

[1]:

set.seed(37)

[2]:

N <- 5
x1 <- rnorm(N, mean=5.1, sd=1)
x2 <- rnorm(N, mean=3.2, sd=1)
y <- 5 + 3.2 * x1 * x2 + rnorm(N, mean=0, sd=1)
df <- data.frame(x1=x1, x2=x2, y=y)

[3]:

df

A data.frame: 5 x 3
x1	x2	y
<dbl>	<dbl>	<dbl>
5.224754	2.867286	52.56107
5.482075	3.007840	57.80415
5.679243	4.562983	89.35053
4.806252	4.055954	68.36291
4.271651	3.415995	52.00467

With the data stored in a dataframe df, let’s build some models using formulas. Observe how we can apply R functions on the variables too (e.g. with log or exp).

y ~ x1 + x2 will create a model of \(y\) against \(x_1\) and \(x_2\)
y ~ (x1 + x2) ** 2 will create a model of \(y\) against \(x_1\), \(x_2\) and \(x_1 x_2\)
y ~ (x1 + x2)^2 will create a model of \(y\) against \(x_1\), \(x_2\) and \(x_1 x_2\) (same as above)
y ~ log(x1) + exp(x2) will create a model of \(y\) against \(\log(x_1)\) and \(e^{x_2}\)

[4]:

m1 = lm(y ~ x1 + x2, data=df)
m2 = lm(y ~ (x1 + x2) ** 2, data=df)
m3 = lm(y ~ (x1 + x2)^2, data=df)
m4 = lm(y ~ log(x1) + exp(x2), data=df)

[5]:

print(summary(m1))


Call:
lm(formula = y ~ x1 + x2, data = df)

Residuals:
       1        2        3        4        5
 0.09791 -0.23830  0.51508 -0.99120  0.61651

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -61.0669     4.4971  -13.58  0.00538 **
x1           11.6752     0.8276   14.11  0.00499 **
x2           18.3205     0.6504   28.17  0.00126 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9204 on 2 degrees of freedom
Multiple R-squared:  0.9983,    Adjusted R-squared:  0.9965
F-statistic: 574.4 on 2 and 2 DF,  p-value: 0.001738

[6]:

print(summary(m2))


Call:
lm(formula = y ~ (x1 + x2)^2, data = df)

Residuals:
       1        2        3        4        5
-0.12166  0.10999  0.00240 -0.03846  0.04773

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  -2.5418     7.9916  -0.318   0.8040
x1            0.8218     1.4819   0.555   0.6777
x2            2.2141     2.1902   1.011   0.4965
x1:x2         2.9760     0.4040   7.366   0.0859 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1751 on 1 degrees of freedom
Multiple R-squared:      1,     Adjusted R-squared:  0.9999
F-statistic: 1.06e+04 on 3 and 1 DF,  p-value: 0.007141

[7]:

print(summary(m3))


Call:
lm(formula = y ~ (x1 + x2)^2, data = df)

Residuals:
       1        2        3        4        5
-0.12166  0.10999  0.00240 -0.03846  0.04773

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  -2.5418     7.9916  -0.318   0.8040
x1            0.8218     1.4819   0.555   0.6777
x2            2.2141     2.1902   1.011   0.4965
x1:x2         2.9760     0.4040   7.366   0.0859 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1751 on 1 degrees of freedom
Multiple R-squared:      1,     Adjusted R-squared:  0.9999
F-statistic: 1.06e+04 on 3 and 1 DF,  p-value: 0.007141

[8]:

print(summary(m4))


Call:
lm(formula = y ~ log(x1) + exp(x2), data = df)

Residuals:
      1       2       3       4       5
-1.2924  1.1336 -0.3295  0.6047 -0.1165

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) -12.08037    9.60952  -1.257  0.33562
log(x1)      35.41011    6.07079   5.833  0.02816 *
exp(x2)       0.41994    0.02107  19.929  0.00251 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.312 on 2 degrees of freedom
Multiple R-squared:  0.9965,    Adjusted R-squared:  0.9929
F-statistic: 282.1 on 2 and 2 DF,  p-value: 0.003532

What about factors or categorical variables? How do formulas work against those types of variables? Let’s generate a dataset with some categorical variables.

x1 is a continuous variable
x2 is a continuous variable
x3 is a categorical variable
x4 is a categorical variable
y is the result of \(5 + 3.2 x_1 x_2 + 6.6 x_3 x_4 + e\), where \(e = \mathcal{N}(0, 1)\)

[9]:

N <- 2000
x1 <- rnorm(N, mean=5.1, sd=1)
x2 <- rnorm(N, mean=3.2, sd=1)
x3 <- factor(sample(0:1, N, replace=TRUE), labels=c('private', 'public'))
x4 <- factor(sample(0:2, N, replace=TRUE), labels=c('low', 'med', 'high'))
y <- 5 + 3.2 * x1 * x2 + 6.6 * as.numeric(x3) * as.numeric(x4) + rnorm(N, mean=0, sd=1)
df <- data.frame(x1=x1, x2=x2, x3=x3, x4=x4, y=y)

[10]:

head(df)

A data.frame: 6 x 5
	x1	x2	x3	x4	y
	<dbl>	<dbl>	<fct>	<fct>	<dbl>
1	3.432472	2.672058	public	low	45.63740
2	2.395150	3.734765	public	med	59.05097
3	3.363076	1.725066	private	high	43.52035
4	5.503075	5.002680	public	high	132.36280
5	4.367927	2.429795	private	med	53.37352
6	4.821639	2.318772	public	med	66.96244

Here some example models that we can define against the categorical variables and a mixture of all the variables.

[11]:

m1 = lm(y ~ x3)
m2 = lm(y ~ x4)
m3 = lm(y ~ x1 + x2 + x3 + x4, data=df)
m4 = lm(y ~ (x1 + x2 + x3 + x4)^2, data=df)
m5 = lm(y ~ x1*x2 + x3*x4, data=df)
m6 = lm(y ~ x1:x2 + x3:x4, data=df)

Notice how there are 2 values for x3, which are

public
private

If we one-hot encode x3, we would expect to see new variables, one to indicate when x3 = public and one when x3 = private.

x3public
x3private

However, using a formula, the correct thing to do, which is to drop one of these new variables, is taken care of (we only see x3public as the coefficient).

[12]:

summary(m1)


Call:
lm(formula = y ~ x3)

Residuals:
    Min      1Q  Median      3Q     Max
-60.824 -14.969  -1.775  14.612  82.802

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  70.7760     0.6955  101.76   <2e-16 ***
x3public     13.3969     0.9831   13.63   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 21.98 on 1998 degrees of freedom
Multiple R-squared:  0.08503,   Adjusted R-squared:  0.08457
F-statistic: 185.7 on 1 and 1998 DF,  p-value: < 2.2e-16

Likewise, there are 3 values for x4, which are

low
medium
high

Again, the formula drops one of the new one-hot encoded variable (in this case, x4low) from the model.

[13]:

summary(m2)


Call:
lm(formula = y ~ x4)

Residuals:
    Min      1Q  Median      3Q     Max
-58.143 -14.989  -1.665  13.338  85.816

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  67.9659     0.8322  81.671  < 2e-16 ***
x4med         9.0121     1.1646   7.739 1.59e-14 ***
x4high       20.0773     1.1921  16.842  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 21.51 on 1997 degrees of freedom
Multiple R-squared:  0.1246,    Adjusted R-squared:  0.1237
F-statistic: 142.1 on 2 and 1997 DF,  p-value: < 2.2e-16

A simple model against all the continuous and categorical variables just works. Observe the coefficients:

x1
x2
x3public
x4med
x4high

[14]:

summary(m3)


Call:
lm(formula = y ~ x1 + x2 + x3 + x4, data = df)

Residuals:
    Min      1Q  Median      3Q     Max
-25.284  -2.929  -0.063   2.923  24.662

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) -43.13332    0.61813  -69.78   <2e-16 ***
x1           10.13366    0.09791  103.50   <2e-16 ***
x2           16.30242    0.09683  168.36   <2e-16 ***
x3public     12.99560    0.19968   65.08   <2e-16 ***
x4med         9.73536    0.24155   40.30   <2e-16 ***
x4high       19.57656    0.24730   79.16   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.46 on 1994 degrees of freedom
Multiple R-squared:  0.9624,    Adjusted R-squared:  0.9623
F-statistic: 1.021e+04 on 5 and 1994 DF,  p-value: < 2.2e-16

If we wanted all variables and up to the 2-way interactions, a mixture of continuous and categorical variables expressed with a formula knows what to do.

[15]:

summary(m4)


Call:
lm(formula = y ~ (x1 + x2 + x3 + x4)^2, data = df)

Residuals:
    Min      1Q  Median      3Q     Max
-3.8136 -0.6552  0.0010  0.6548  3.1319

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)
(Intercept)     11.827014   0.422299  28.006   <2e-16 ***
x1              -0.051900   0.078882  -0.658    0.511
x2              -0.135147   0.115097  -1.174    0.240
x3public         6.423600   0.275072  23.352   <2e-16 ***
x4med            6.987297   0.322002  21.700   <2e-16 ***
x4high          13.358834   0.335365  39.834   <2e-16 ***
x1:x2            3.225207   0.021014 153.477   <2e-16 ***
x1:x3public      0.008138   0.043964   0.185    0.853
x1:x4med        -0.042248   0.052587  -0.803    0.422
x1:x4high       -0.031927   0.054832  -0.582    0.560
x2:x3public      0.031676   0.043548   0.727    0.467
x2:x4med        -0.034890   0.052861  -0.660    0.509
x2:x4high        0.028974   0.053573   0.541    0.589
x3public:x4med   6.569441   0.108319  60.649   <2e-16 ***
x3public:x4high 13.180202   0.110819 118.935   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9979 on 1985 degrees of freedom
Multiple R-squared:  0.9981,    Adjusted R-squared:  0.9981
F-statistic: 7.556e+04 on 14 and 1985 DF,  p-value: < 2.2e-16

Below, we only wanted interaction between x1 and x2, and only between x3 and x4. Since we use *, lower order terms are included.

[16]:

summary(m5)


Call:
lm(formula = y ~ x1 * x2 + x3 * x4, data = df)

Residuals:
    Min      1Q  Median      3Q     Max
-3.8313 -0.6466 -0.0017  0.6646  3.2065

Coefficients:
                Estimate Std. Error t value Pr(>|t|)
(Intercept)     11.90330    0.36588  32.533   <2e-16 ***
x1              -0.07572    0.06986  -1.084    0.279
x2              -0.12615    0.10910  -1.156    0.248
x3public         6.56763    0.07720  85.077   <2e-16 ***
x4med            6.66300    0.07740  86.088   <2e-16 ***
x4high          13.28896    0.07737 171.749   <2e-16 ***
x1:x2            3.22624    0.02091 154.259   <2e-16 ***
x3public:x4med   6.56425    0.10806  60.749   <2e-16 ***
x3public:x4high 13.18146    0.11064 119.138   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.997 on 1991 degrees of freedom
Multiple R-squared:  0.9981,    Adjusted R-squared:  0.9981
F-statistic: 1.325e+05 on 8 and 1991 DF,  p-value: < 2.2e-16

Below, we only wanted interaction between x1 and x2, and only between x3 and x4. Since we use :, lower order terms are excluded.

[17]:

summary(m6)


Call:
lm(formula = y ~ x1:x2 + x3:x4, data = df)

Residuals:
    Min      1Q  Median      3Q     Max
-3.8383 -0.6496 -0.0037  0.6631  3.2602

Coefficients: (1 not defined because of singularities)
                   Estimate Std. Error t value Pr(>|t|)
(Intercept)       44.538203   0.082103   542.5   <2e-16 ***
x1:x2              3.202488   0.003532   906.7   <2e-16 ***
x3private:x4low  -33.034951   0.079399  -416.1   <2e-16 ***
x3public:x4low   -26.469080   0.079061  -334.8   <2e-16 ***
x3private:x4med  -26.375155   0.079266  -332.7   <2e-16 ***
x3public:x4med   -13.241870   0.077710  -170.4   <2e-16 ***
x3private:x4high -19.747553   0.079238  -249.2   <2e-16 ***
x3public:x4high          NA         NA      NA       NA
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9968 on 1993 degrees of freedom
Multiple R-squared:  0.9981,    Adjusted R-squared:  0.9981
F-statistic: 1.767e+05 on 6 and 1993 DF,  p-value: < 2.2e-16

If we want the resulting matrix from the formula, we can use the model.matrix() function. Observe that the (Intercept) is returned. To exclude the intercept, add - 1 to the model.

[18]:

d1 = model.matrix(y ~ x1 + x2 + x3 + x4, data=df)
d2 = model.matrix(y ~ (x1 + x2 + x3 + x4)^2, data=df)
d3 = model.matrix(y ~ x1*x2 + x3*x4, data=df)
d4 = model.matrix(y ~ x1:x2 + x3:x4, data=df)
d5 = model.matrix(y ~ x1 + x2 + x3 + x4 - 1, data=df)

[19]:

head(d1)

A matrix: 6 x 6 of type dbl
	(Intercept)	x1	x2	x3public	x4med	x4high
1	1	3.432472	2.672058	1	0	0
2	1	2.395150	3.734765	1	1	0
3	1	3.363076	1.725066	0	0	1
4	1	5.503075	5.002680	1	0	1
5	1	4.367927	2.429795	0	1	0
6	1	4.821639	2.318772	1	1	0

[20]:

head(d2)

A matrix: 6 x 15 of type dbl
	(Intercept)	x1	x2	x3public	x4med	x4high	x1:x2	x1:x3public	x1:x4med	x1:x4high	x2:x3public	x2:x4med	x2:x4high	x3public:x4med	x3public:x4high
1	1	3.432472	2.672058	1	0	0	9.171762	3.432472	0.000000	0.000000	2.672058	0.000000	0.000000	0	0
2	1	2.395150	3.734765	1	1	0	8.945324	2.395150	2.395150	0.000000	3.734765	3.734765	0.000000	1	0
3	1	3.363076	1.725066	0	0	1	5.801526	0.000000	0.000000	3.363076	0.000000	0.000000	1.725066	0	0
4	1	5.503075	5.002680	1	0	1	27.530126	5.503075	0.000000	5.503075	5.002680	0.000000	5.002680	0	1
5	1	4.367927	2.429795	0	1	0	10.613166	0.000000	4.367927	0.000000	0.000000	2.429795	0.000000	0	0
6	1	4.821639	2.318772	1	1	0	11.180282	4.821639	4.821639	0.000000	2.318772	2.318772	0.000000	1	0

[21]:

head(d3)

A matrix: 6 x 9 of type dbl
	(Intercept)	x1	x2	x3public	x4med	x4high	x1:x2	x3public:x4med	x3public:x4high
1	1	3.432472	2.672058	1	0	0	9.171762	0	0
2	1	2.395150	3.734765	1	1	0	8.945324	1	0
3	1	3.363076	1.725066	0	0	1	5.801526	0	0
4	1	5.503075	5.002680	1	0	1	27.530126	0	1
5	1	4.367927	2.429795	0	1	0	10.613166	0	0
6	1	4.821639	2.318772	1	1	0	11.180282	1	0

[22]:

head(d4)

A matrix: 6 x 8 of type dbl
	(Intercept)	x1:x2	x3public:x4low	x3private:x4med	x3public:x4med	x3private:x4high	x3public:x4high
1	1	9.171762	1	0	0	0	0
2	1	8.945324	0	0	1	0	0
3	1	5.801526	0	0	0	1	0
4	1	27.530126	0	0	0	0	1
5	1	10.613166	0	1	0	0	0
6	1	11.180282	0	0	1	0	0

[23]:

head(model.matrix(y ~ x1:x2 + x3:x4, data=df))

A matrix: 6 x 8 of type dbl
	(Intercept)	x1:x2	x3public:x4low	x3private:x4med	x3public:x4med	x3private:x4high	x3public:x4high
1	1	9.171762	1	0	0	0	0
2	1	8.945324	0	0	1	0	0
3	1	5.801526	0	0	0	1	0
4	1	27.530126	0	0	0	0	1
5	1	10.613166	0	1	0	0	0
6	1	11.180282	0	0	1	0	0

[24]:

head(d5)

A matrix: 6 x 6 of type dbl
	x1	x2	x3private	x3public	x4med	x4high
1	3.432472	2.672058	0	1	0	0
2	2.395150	3.734765	0	1	1	0
3	3.363076	1.725066	1	0	0	1
4	5.503075	5.002680	0	1	0	1
5	4.367927	2.429795	1	0	1	0
6	4.821639	2.318772	0	1	1	0