7. Formulas

Formulas are R’s way of expressing models against the data we have. Formulas are used nearly everywhere in R, from defining models to transforming data. The general form of a formula is

\(y \sim x_1 + x_2 + \dots\),

where

  • \(y\) is the output (dependent variable),

  • \(\sim\) separates the left and right hand sides of the formula, and

  • \(x_1 + x_2 + \ldots\) is the input (independent variable).

The power of formulas comes with the operators on the right hand side.

  • + means to use a variable (or a set)

  • : indicates interactions excluding lower level ones

  • * indicates interactions including lower level ones

  • - indicates to remove variables

  • ** or ^ indicates expanding interactions between sets of variables

Let’s look how some of these operators work. First, let’s simulate some data.

[1]:
set.seed(37)
[2]:
N <- 5
x1 <- rnorm(N, mean=5.1, sd=1)
x2 <- rnorm(N, mean=3.2, sd=1)
y <- 5 + 3.2 * x1 * x2 + rnorm(N, mean=0, sd=1)
df <- data.frame(x1=x1, x2=x2, y=y)
[3]:
df
A data.frame: 5 x 3
x1x2y
<dbl><dbl><dbl>
5.2247542.86728652.56107
5.4820753.00784057.80415
5.6792434.56298389.35053
4.8062524.05595468.36291
4.2716513.41599552.00467

With the data stored in a dataframe df, let’s build some models using formulas. Observe how we can apply R functions on the variables too (e.g. with log or exp).

  • y ~ x1 + x2 will create a model of \(y\) against \(x_1\) and \(x_2\)

  • y ~ (x1 + x2) ** 2 will create a model of \(y\) against \(x_1\), \(x_2\) and \(x_1 x_2\)

  • y ~ (x1 + x2)^2 will create a model of \(y\) against \(x_1\), \(x_2\) and \(x_1 x_2\) (same as above)

  • y ~ log(x1) + exp(x2) will create a model of \(y\) against \(\log(x_1)\) and \(e^{x_2}\)

[4]:
m1 = lm(y ~ x1 + x2, data=df)
m2 = lm(y ~ (x1 + x2) ** 2, data=df)
m3 = lm(y ~ (x1 + x2)^2, data=df)
m4 = lm(y ~ log(x1) + exp(x2), data=df)
[5]:
print(summary(m1))

Call:
lm(formula = y ~ x1 + x2, data = df)

Residuals:
       1        2        3        4        5
 0.09791 -0.23830  0.51508 -0.99120  0.61651

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -61.0669     4.4971  -13.58  0.00538 **
x1           11.6752     0.8276   14.11  0.00499 **
x2           18.3205     0.6504   28.17  0.00126 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9204 on 2 degrees of freedom
Multiple R-squared:  0.9983,    Adjusted R-squared:  0.9965
F-statistic: 574.4 on 2 and 2 DF,  p-value: 0.001738

[6]:
print(summary(m2))

Call:
lm(formula = y ~ (x1 + x2)^2, data = df)

Residuals:
       1        2        3        4        5
-0.12166  0.10999  0.00240 -0.03846  0.04773

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  -2.5418     7.9916  -0.318   0.8040
x1            0.8218     1.4819   0.555   0.6777
x2            2.2141     2.1902   1.011   0.4965
x1:x2         2.9760     0.4040   7.366   0.0859 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1751 on 1 degrees of freedom
Multiple R-squared:      1,     Adjusted R-squared:  0.9999
F-statistic: 1.06e+04 on 3 and 1 DF,  p-value: 0.007141

[7]:
print(summary(m3))

Call:
lm(formula = y ~ (x1 + x2)^2, data = df)

Residuals:
       1        2        3        4        5
-0.12166  0.10999  0.00240 -0.03846  0.04773

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  -2.5418     7.9916  -0.318   0.8040
x1            0.8218     1.4819   0.555   0.6777
x2            2.2141     2.1902   1.011   0.4965
x1:x2         2.9760     0.4040   7.366   0.0859 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1751 on 1 degrees of freedom
Multiple R-squared:      1,     Adjusted R-squared:  0.9999
F-statistic: 1.06e+04 on 3 and 1 DF,  p-value: 0.007141

[8]:
print(summary(m4))

Call:
lm(formula = y ~ log(x1) + exp(x2), data = df)

Residuals:
      1       2       3       4       5
-1.2924  1.1336 -0.3295  0.6047 -0.1165

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) -12.08037    9.60952  -1.257  0.33562
log(x1)      35.41011    6.07079   5.833  0.02816 *
exp(x2)       0.41994    0.02107  19.929  0.00251 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.312 on 2 degrees of freedom
Multiple R-squared:  0.9965,    Adjusted R-squared:  0.9929
F-statistic: 282.1 on 2 and 2 DF,  p-value: 0.003532

What about factors or categorical variables? How do formulas work against those types of variables? Let’s generate a dataset with some categorical variables.

  • x1 is a continuous variable

  • x2 is a continuous variable

  • x3 is a categorical variable

  • x4 is a categorical variable

  • y is the result of \(5 + 3.2 x_1 x_2 + 6.6 x_3 x_4 + e\), where \(e = \mathcal{N}(0, 1)\)

[9]:
N <- 2000
x1 <- rnorm(N, mean=5.1, sd=1)
x2 <- rnorm(N, mean=3.2, sd=1)
x3 <- factor(sample(0:1, N, replace=TRUE), labels=c('private', 'public'))
x4 <- factor(sample(0:2, N, replace=TRUE), labels=c('low', 'med', 'high'))
y <- 5 + 3.2 * x1 * x2 + 6.6 * as.numeric(x3) * as.numeric(x4) + rnorm(N, mean=0, sd=1)
df <- data.frame(x1=x1, x2=x2, x3=x3, x4=x4, y=y)
[10]:
head(df)
A data.frame: 6 x 5
x1x2x3x4y
<dbl><dbl><fct><fct><dbl>
13.4324722.672058public low 45.63740
22.3951503.734765public med 59.05097
33.3630761.725066privatehigh 43.52035
45.5030755.002680public high132.36280
54.3679272.429795privatemed 53.37352
64.8216392.318772public med 66.96244

Here some example models that we can define against the categorical variables and a mixture of all the variables.

[11]:
m1 = lm(y ~ x3)
m2 = lm(y ~ x4)
m3 = lm(y ~ x1 + x2 + x3 + x4, data=df)
m4 = lm(y ~ (x1 + x2 + x3 + x4)^2, data=df)
m5 = lm(y ~ x1*x2 + x3*x4, data=df)
m6 = lm(y ~ x1:x2 + x3:x4, data=df)

Notice how there are 2 values for x3, which are

  • public

  • private

If we one-hot encode x3, we would expect to see new variables, one to indicate when x3 = public and one when x3 = private.

  • x3public

  • x3private

However, using a formula, the correct thing to do, which is to drop one of these new variables, is taken care of (we only see x3public as the coefficient).

[12]:
summary(m1)

Call:
lm(formula = y ~ x3)

Residuals:
    Min      1Q  Median      3Q     Max
-60.824 -14.969  -1.775  14.612  82.802

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  70.7760     0.6955  101.76   <2e-16 ***
x3public     13.3969     0.9831   13.63   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 21.98 on 1998 degrees of freedom
Multiple R-squared:  0.08503,   Adjusted R-squared:  0.08457
F-statistic: 185.7 on 1 and 1998 DF,  p-value: < 2.2e-16

Likewise, there are 3 values for x4, which are

  • low

  • medium

  • high

Again, the formula drops one of the new one-hot encoded variable (in this case, x4low) from the model.

[13]:
summary(m2)

Call:
lm(formula = y ~ x4)

Residuals:
    Min      1Q  Median      3Q     Max
-58.143 -14.989  -1.665  13.338  85.816

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  67.9659     0.8322  81.671  < 2e-16 ***
x4med         9.0121     1.1646   7.739 1.59e-14 ***
x4high       20.0773     1.1921  16.842  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 21.51 on 1997 degrees of freedom
Multiple R-squared:  0.1246,    Adjusted R-squared:  0.1237
F-statistic: 142.1 on 2 and 1997 DF,  p-value: < 2.2e-16

A simple model against all the continuous and categorical variables just works. Observe the coefficients:

  • x1

  • x2

  • x3public

  • x4med

  • x4high

[14]:
summary(m3)

Call:
lm(formula = y ~ x1 + x2 + x3 + x4, data = df)

Residuals:
    Min      1Q  Median      3Q     Max
-25.284  -2.929  -0.063   2.923  24.662

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) -43.13332    0.61813  -69.78   <2e-16 ***
x1           10.13366    0.09791  103.50   <2e-16 ***
x2           16.30242    0.09683  168.36   <2e-16 ***
x3public     12.99560    0.19968   65.08   <2e-16 ***
x4med         9.73536    0.24155   40.30   <2e-16 ***
x4high       19.57656    0.24730   79.16   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.46 on 1994 degrees of freedom
Multiple R-squared:  0.9624,    Adjusted R-squared:  0.9623
F-statistic: 1.021e+04 on 5 and 1994 DF,  p-value: < 2.2e-16

If we wanted all variables and up to the 2-way interactions, a mixture of continuous and categorical variables expressed with a formula knows what to do.

[15]:
summary(m4)

Call:
lm(formula = y ~ (x1 + x2 + x3 + x4)^2, data = df)

Residuals:
    Min      1Q  Median      3Q     Max
-3.8136 -0.6552  0.0010  0.6548  3.1319

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)
(Intercept)     11.827014   0.422299  28.006   <2e-16 ***
x1              -0.051900   0.078882  -0.658    0.511
x2              -0.135147   0.115097  -1.174    0.240
x3public         6.423600   0.275072  23.352   <2e-16 ***
x4med            6.987297   0.322002  21.700   <2e-16 ***
x4high          13.358834   0.335365  39.834   <2e-16 ***
x1:x2            3.225207   0.021014 153.477   <2e-16 ***
x1:x3public      0.008138   0.043964   0.185    0.853
x1:x4med        -0.042248   0.052587  -0.803    0.422
x1:x4high       -0.031927   0.054832  -0.582    0.560
x2:x3public      0.031676   0.043548   0.727    0.467
x2:x4med        -0.034890   0.052861  -0.660    0.509
x2:x4high        0.028974   0.053573   0.541    0.589
x3public:x4med   6.569441   0.108319  60.649   <2e-16 ***
x3public:x4high 13.180202   0.110819 118.935   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9979 on 1985 degrees of freedom
Multiple R-squared:  0.9981,    Adjusted R-squared:  0.9981
F-statistic: 7.556e+04 on 14 and 1985 DF,  p-value: < 2.2e-16

Below, we only wanted interaction between x1 and x2, and only between x3 and x4. Since we use *, lower order terms are included.

[16]:
summary(m5)

Call:
lm(formula = y ~ x1 * x2 + x3 * x4, data = df)

Residuals:
    Min      1Q  Median      3Q     Max
-3.8313 -0.6466 -0.0017  0.6646  3.2065

Coefficients:
                Estimate Std. Error t value Pr(>|t|)
(Intercept)     11.90330    0.36588  32.533   <2e-16 ***
x1              -0.07572    0.06986  -1.084    0.279
x2              -0.12615    0.10910  -1.156    0.248
x3public         6.56763    0.07720  85.077   <2e-16 ***
x4med            6.66300    0.07740  86.088   <2e-16 ***
x4high          13.28896    0.07737 171.749   <2e-16 ***
x1:x2            3.22624    0.02091 154.259   <2e-16 ***
x3public:x4med   6.56425    0.10806  60.749   <2e-16 ***
x3public:x4high 13.18146    0.11064 119.138   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.997 on 1991 degrees of freedom
Multiple R-squared:  0.9981,    Adjusted R-squared:  0.9981
F-statistic: 1.325e+05 on 8 and 1991 DF,  p-value: < 2.2e-16

Below, we only wanted interaction between x1 and x2, and only between x3 and x4. Since we use :, lower order terms are excluded.

[17]:
summary(m6)

Call:
lm(formula = y ~ x1:x2 + x3:x4, data = df)

Residuals:
    Min      1Q  Median      3Q     Max
-3.8383 -0.6496 -0.0037  0.6631  3.2602

Coefficients: (1 not defined because of singularities)
                   Estimate Std. Error t value Pr(>|t|)
(Intercept)       44.538203   0.082103   542.5   <2e-16 ***
x1:x2              3.202488   0.003532   906.7   <2e-16 ***
x3private:x4low  -33.034951   0.079399  -416.1   <2e-16 ***
x3public:x4low   -26.469080   0.079061  -334.8   <2e-16 ***
x3private:x4med  -26.375155   0.079266  -332.7   <2e-16 ***
x3public:x4med   -13.241870   0.077710  -170.4   <2e-16 ***
x3private:x4high -19.747553   0.079238  -249.2   <2e-16 ***
x3public:x4high          NA         NA      NA       NA
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9968 on 1993 degrees of freedom
Multiple R-squared:  0.9981,    Adjusted R-squared:  0.9981
F-statistic: 1.767e+05 on 6 and 1993 DF,  p-value: < 2.2e-16

If we want the resulting matrix from the formula, we can use the model.matrix() function. Observe that the (Intercept) is returned. To exclude the intercept, add - 1 to the model.

[18]:
d1 = model.matrix(y ~ x1 + x2 + x3 + x4, data=df)
d2 = model.matrix(y ~ (x1 + x2 + x3 + x4)^2, data=df)
d3 = model.matrix(y ~ x1*x2 + x3*x4, data=df)
d4 = model.matrix(y ~ x1:x2 + x3:x4, data=df)
d5 = model.matrix(y ~ x1 + x2 + x3 + x4 - 1, data=df)
[19]:
head(d1)
A matrix: 6 x 6 of type dbl
(Intercept)x1x2x3publicx4medx4high
113.4324722.672058100
212.3951503.734765110
313.3630761.725066001
415.5030755.002680101
514.3679272.429795010
614.8216392.318772110
[20]:
head(d2)
A matrix: 6 x 15 of type dbl
(Intercept)x1x2x3publicx4medx4highx1:x2x1:x3publicx1:x4medx1:x4highx2:x3publicx2:x4medx2:x4highx3public:x4medx3public:x4high
113.4324722.672058100 9.1717623.4324720.0000000.0000002.6720580.0000000.00000000
212.3951503.734765110 8.9453242.3951502.3951500.0000003.7347653.7347650.00000010
313.3630761.725066001 5.8015260.0000000.0000003.3630760.0000000.0000001.72506600
415.5030755.00268010127.5301265.5030750.0000005.5030755.0026800.0000005.00268001
514.3679272.42979501010.6131660.0000004.3679270.0000000.0000002.4297950.00000000
614.8216392.31877211011.1802824.8216394.8216390.0000002.3187722.3187720.00000010
[21]:
head(d3)
A matrix: 6 x 9 of type dbl
(Intercept)x1x2x3publicx4medx4highx1:x2x3public:x4medx3public:x4high
113.4324722.672058100 9.17176200
212.3951503.734765110 8.94532410
313.3630761.725066001 5.80152600
415.5030755.00268010127.53012601
514.3679272.42979501010.61316600
614.8216392.31877211011.18028210
[22]:
head(d4)
A matrix: 6 x 8 of type dbl
(Intercept)x1:x2x3private:x4lowx3public:x4lowx3private:x4medx3public:x4medx3private:x4highx3public:x4high
11 9.171762010000
21 8.945324000100
31 5.801526000010
4127.530126000001
5110.613166001000
6111.180282000100
[23]:
head(model.matrix(y ~ x1:x2 + x3:x4, data=df))
A matrix: 6 x 8 of type dbl
(Intercept)x1:x2x3private:x4lowx3public:x4lowx3private:x4medx3public:x4medx3private:x4highx3public:x4high
11 9.171762010000
21 8.945324000100
31 5.801526000010
4127.530126000001
5110.613166001000
6111.180282000100
[24]:
head(d5)
A matrix: 6 x 6 of type dbl
x1x2x3privatex3publicx4medx4high
13.4324722.6720580100
22.3951503.7347650110
33.3630761.7250661001
45.5030755.0026800101
54.3679272.4297951010
64.8216392.3187720110