7. Formulas
Formulas are R’s way of expressing models against the data we have. Formulas are used nearly everywhere in R, from defining models to transforming data. The general form of a formula is
\(y \sim x_1 + x_2 + \dots\),
where
- \(y\) is the output (dependent variable), 
- \(\sim\) separates the left and right hand sides of the formula, and 
- \(x_1 + x_2 + \ldots\) is the input (independent variable). 
The power of formulas comes with the operators on the right hand side.
- +means to use a variable (or a set)
- :indicates interactions excluding lower level ones
- *indicates interactions including lower level ones
- -indicates to remove variables
- **or- ^indicates expanding interactions between sets of variables
Let’s look how some of these operators work. First, let’s simulate some data.
[1]:
set.seed(37)
[2]:
N <- 5
x1 <- rnorm(N, mean=5.1, sd=1)
x2 <- rnorm(N, mean=3.2, sd=1)
y <- 5 + 3.2 * x1 * x2 + rnorm(N, mean=0, sd=1)
df <- data.frame(x1=x1, x2=x2, y=y)
[3]:
df
| x1 | x2 | y | 
|---|---|---|
| <dbl> | <dbl> | <dbl> | 
| 5.224754 | 2.867286 | 52.56107 | 
| 5.482075 | 3.007840 | 57.80415 | 
| 5.679243 | 4.562983 | 89.35053 | 
| 4.806252 | 4.055954 | 68.36291 | 
| 4.271651 | 3.415995 | 52.00467 | 
With the data stored in a dataframe df, let’s build some models using formulas. Observe how we can apply R functions on the variables too (e.g. with log or exp).
- y ~ x1 + x2will create a model of \(y\) against \(x_1\) and \(x_2\)
- y ~ (x1 + x2) ** 2will create a model of \(y\) against \(x_1\), \(x_2\) and \(x_1 x_2\)
- y ~ (x1 + x2)^2will create a model of \(y\) against \(x_1\), \(x_2\) and \(x_1 x_2\) (same as above)
- y ~ log(x1) + exp(x2)will create a model of \(y\) against \(\log(x_1)\) and \(e^{x_2}\)
[4]:
m1 = lm(y ~ x1 + x2, data=df)
m2 = lm(y ~ (x1 + x2) ** 2, data=df)
m3 = lm(y ~ (x1 + x2)^2, data=df)
m4 = lm(y ~ log(x1) + exp(x2), data=df)
[5]:
print(summary(m1))
Call:
lm(formula = y ~ x1 + x2, data = df)
Residuals:
       1        2        3        4        5
 0.09791 -0.23830  0.51508 -0.99120  0.61651
Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -61.0669     4.4971  -13.58  0.00538 **
x1           11.6752     0.8276   14.11  0.00499 **
x2           18.3205     0.6504   28.17  0.00126 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9204 on 2 degrees of freedom
Multiple R-squared:  0.9983,    Adjusted R-squared:  0.9965
F-statistic: 574.4 on 2 and 2 DF,  p-value: 0.001738
[6]:
print(summary(m2))
Call:
lm(formula = y ~ (x1 + x2)^2, data = df)
Residuals:
       1        2        3        4        5
-0.12166  0.10999  0.00240 -0.03846  0.04773
Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  -2.5418     7.9916  -0.318   0.8040
x1            0.8218     1.4819   0.555   0.6777
x2            2.2141     2.1902   1.011   0.4965
x1:x2         2.9760     0.4040   7.366   0.0859 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1751 on 1 degrees of freedom
Multiple R-squared:      1,     Adjusted R-squared:  0.9999
F-statistic: 1.06e+04 on 3 and 1 DF,  p-value: 0.007141
[7]:
print(summary(m3))
Call:
lm(formula = y ~ (x1 + x2)^2, data = df)
Residuals:
       1        2        3        4        5
-0.12166  0.10999  0.00240 -0.03846  0.04773
Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  -2.5418     7.9916  -0.318   0.8040
x1            0.8218     1.4819   0.555   0.6777
x2            2.2141     2.1902   1.011   0.4965
x1:x2         2.9760     0.4040   7.366   0.0859 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1751 on 1 degrees of freedom
Multiple R-squared:      1,     Adjusted R-squared:  0.9999
F-statistic: 1.06e+04 on 3 and 1 DF,  p-value: 0.007141
[8]:
print(summary(m4))
Call:
lm(formula = y ~ log(x1) + exp(x2), data = df)
Residuals:
      1       2       3       4       5
-1.2924  1.1336 -0.3295  0.6047 -0.1165
Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) -12.08037    9.60952  -1.257  0.33562
log(x1)      35.41011    6.07079   5.833  0.02816 *
exp(x2)       0.41994    0.02107  19.929  0.00251 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.312 on 2 degrees of freedom
Multiple R-squared:  0.9965,    Adjusted R-squared:  0.9929
F-statistic: 282.1 on 2 and 2 DF,  p-value: 0.003532
What about factors or categorical variables? How do formulas work against those types of variables? Let’s generate a dataset with some categorical variables.
- x1is a continuous variable
- x2is a continuous variable
- x3is a categorical variable
- x4is a categorical variable
- yis the result of \(5 + 3.2 x_1 x_2 + 6.6 x_3 x_4 + e\), where \(e = \mathcal{N}(0, 1)\)
[9]:
N <- 2000
x1 <- rnorm(N, mean=5.1, sd=1)
x2 <- rnorm(N, mean=3.2, sd=1)
x3 <- factor(sample(0:1, N, replace=TRUE), labels=c('private', 'public'))
x4 <- factor(sample(0:2, N, replace=TRUE), labels=c('low', 'med', 'high'))
y <- 5 + 3.2 * x1 * x2 + 6.6 * as.numeric(x3) * as.numeric(x4) + rnorm(N, mean=0, sd=1)
df <- data.frame(x1=x1, x2=x2, x3=x3, x4=x4, y=y)
[10]:
head(df)
| x1 | x2 | x3 | x4 | y | |
|---|---|---|---|---|---|
| <dbl> | <dbl> | <fct> | <fct> | <dbl> | |
| 1 | 3.432472 | 2.672058 | public | low | 45.63740 | 
| 2 | 2.395150 | 3.734765 | public | med | 59.05097 | 
| 3 | 3.363076 | 1.725066 | private | high | 43.52035 | 
| 4 | 5.503075 | 5.002680 | public | high | 132.36280 | 
| 5 | 4.367927 | 2.429795 | private | med | 53.37352 | 
| 6 | 4.821639 | 2.318772 | public | med | 66.96244 | 
Here some example models that we can define against the categorical variables and a mixture of all the variables.
[11]:
m1 = lm(y ~ x3)
m2 = lm(y ~ x4)
m3 = lm(y ~ x1 + x2 + x3 + x4, data=df)
m4 = lm(y ~ (x1 + x2 + x3 + x4)^2, data=df)
m5 = lm(y ~ x1*x2 + x3*x4, data=df)
m6 = lm(y ~ x1:x2 + x3:x4, data=df)
Notice how there are 2 values for x3, which are
- public 
- private 
If we one-hot encode x3, we would expect to see new variables, one to indicate when x3 = public and one when x3 = private.
- x3public
- x3private
However, using a formula, the correct thing to do, which is to drop one of these new variables, is taken care of (we only see x3public as the coefficient).
[12]:
summary(m1)
Call:
lm(formula = y ~ x3)
Residuals:
    Min      1Q  Median      3Q     Max
-60.824 -14.969  -1.775  14.612  82.802
Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  70.7760     0.6955  101.76   <2e-16 ***
x3public     13.3969     0.9831   13.63   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 21.98 on 1998 degrees of freedom
Multiple R-squared:  0.08503,   Adjusted R-squared:  0.08457
F-statistic: 185.7 on 1 and 1998 DF,  p-value: < 2.2e-16
Likewise, there are 3 values for x4, which are
- low 
- medium 
- high 
Again, the formula drops one of the new one-hot encoded variable (in this case, x4low) from the model.
[13]:
summary(m2)
Call:
lm(formula = y ~ x4)
Residuals:
    Min      1Q  Median      3Q     Max
-58.143 -14.989  -1.665  13.338  85.816
Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  67.9659     0.8322  81.671  < 2e-16 ***
x4med         9.0121     1.1646   7.739 1.59e-14 ***
x4high       20.0773     1.1921  16.842  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 21.51 on 1997 degrees of freedom
Multiple R-squared:  0.1246,    Adjusted R-squared:  0.1237
F-statistic: 142.1 on 2 and 1997 DF,  p-value: < 2.2e-16
A simple model against all the continuous and categorical variables just works. Observe the coefficients:
- x1 
- x2 
- x3public 
- x4med 
- x4high 
[14]:
summary(m3)
Call:
lm(formula = y ~ x1 + x2 + x3 + x4, data = df)
Residuals:
    Min      1Q  Median      3Q     Max
-25.284  -2.929  -0.063   2.923  24.662
Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) -43.13332    0.61813  -69.78   <2e-16 ***
x1           10.13366    0.09791  103.50   <2e-16 ***
x2           16.30242    0.09683  168.36   <2e-16 ***
x3public     12.99560    0.19968   65.08   <2e-16 ***
x4med         9.73536    0.24155   40.30   <2e-16 ***
x4high       19.57656    0.24730   79.16   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.46 on 1994 degrees of freedom
Multiple R-squared:  0.9624,    Adjusted R-squared:  0.9623
F-statistic: 1.021e+04 on 5 and 1994 DF,  p-value: < 2.2e-16
If we wanted all variables and up to the 2-way interactions, a mixture of continuous and categorical variables expressed with a formula knows what to do.
[15]:
summary(m4)
Call:
lm(formula = y ~ (x1 + x2 + x3 + x4)^2, data = df)
Residuals:
    Min      1Q  Median      3Q     Max
-3.8136 -0.6552  0.0010  0.6548  3.1319
Coefficients:
                 Estimate Std. Error t value Pr(>|t|)
(Intercept)     11.827014   0.422299  28.006   <2e-16 ***
x1              -0.051900   0.078882  -0.658    0.511
x2              -0.135147   0.115097  -1.174    0.240
x3public         6.423600   0.275072  23.352   <2e-16 ***
x4med            6.987297   0.322002  21.700   <2e-16 ***
x4high          13.358834   0.335365  39.834   <2e-16 ***
x1:x2            3.225207   0.021014 153.477   <2e-16 ***
x1:x3public      0.008138   0.043964   0.185    0.853
x1:x4med        -0.042248   0.052587  -0.803    0.422
x1:x4high       -0.031927   0.054832  -0.582    0.560
x2:x3public      0.031676   0.043548   0.727    0.467
x2:x4med        -0.034890   0.052861  -0.660    0.509
x2:x4high        0.028974   0.053573   0.541    0.589
x3public:x4med   6.569441   0.108319  60.649   <2e-16 ***
x3public:x4high 13.180202   0.110819 118.935   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9979 on 1985 degrees of freedom
Multiple R-squared:  0.9981,    Adjusted R-squared:  0.9981
F-statistic: 7.556e+04 on 14 and 1985 DF,  p-value: < 2.2e-16
Below, we only wanted interaction between x1 and x2, and only between x3 and x4. Since we use *, lower order terms are included.
[16]:
summary(m5)
Call:
lm(formula = y ~ x1 * x2 + x3 * x4, data = df)
Residuals:
    Min      1Q  Median      3Q     Max
-3.8313 -0.6466 -0.0017  0.6646  3.2065
Coefficients:
                Estimate Std. Error t value Pr(>|t|)
(Intercept)     11.90330    0.36588  32.533   <2e-16 ***
x1              -0.07572    0.06986  -1.084    0.279
x2              -0.12615    0.10910  -1.156    0.248
x3public         6.56763    0.07720  85.077   <2e-16 ***
x4med            6.66300    0.07740  86.088   <2e-16 ***
x4high          13.28896    0.07737 171.749   <2e-16 ***
x1:x2            3.22624    0.02091 154.259   <2e-16 ***
x3public:x4med   6.56425    0.10806  60.749   <2e-16 ***
x3public:x4high 13.18146    0.11064 119.138   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.997 on 1991 degrees of freedom
Multiple R-squared:  0.9981,    Adjusted R-squared:  0.9981
F-statistic: 1.325e+05 on 8 and 1991 DF,  p-value: < 2.2e-16
Below, we only wanted interaction between x1 and x2, and only between x3 and x4. Since we use :, lower order terms are excluded.
[17]:
summary(m6)
Call:
lm(formula = y ~ x1:x2 + x3:x4, data = df)
Residuals:
    Min      1Q  Median      3Q     Max
-3.8383 -0.6496 -0.0037  0.6631  3.2602
Coefficients: (1 not defined because of singularities)
                   Estimate Std. Error t value Pr(>|t|)
(Intercept)       44.538203   0.082103   542.5   <2e-16 ***
x1:x2              3.202488   0.003532   906.7   <2e-16 ***
x3private:x4low  -33.034951   0.079399  -416.1   <2e-16 ***
x3public:x4low   -26.469080   0.079061  -334.8   <2e-16 ***
x3private:x4med  -26.375155   0.079266  -332.7   <2e-16 ***
x3public:x4med   -13.241870   0.077710  -170.4   <2e-16 ***
x3private:x4high -19.747553   0.079238  -249.2   <2e-16 ***
x3public:x4high          NA         NA      NA       NA
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9968 on 1993 degrees of freedom
Multiple R-squared:  0.9981,    Adjusted R-squared:  0.9981
F-statistic: 1.767e+05 on 6 and 1993 DF,  p-value: < 2.2e-16
If we want the resulting matrix from the formula, we can use the model.matrix() function. Observe that the (Intercept) is returned. To exclude the intercept, add - 1 to the model.
[18]:
d1 = model.matrix(y ~ x1 + x2 + x3 + x4, data=df)
d2 = model.matrix(y ~ (x1 + x2 + x3 + x4)^2, data=df)
d3 = model.matrix(y ~ x1*x2 + x3*x4, data=df)
d4 = model.matrix(y ~ x1:x2 + x3:x4, data=df)
d5 = model.matrix(y ~ x1 + x2 + x3 + x4 - 1, data=df)
[19]:
head(d1)
| (Intercept) | x1 | x2 | x3public | x4med | x4high | |
|---|---|---|---|---|---|---|
| 1 | 1 | 3.432472 | 2.672058 | 1 | 0 | 0 | 
| 2 | 1 | 2.395150 | 3.734765 | 1 | 1 | 0 | 
| 3 | 1 | 3.363076 | 1.725066 | 0 | 0 | 1 | 
| 4 | 1 | 5.503075 | 5.002680 | 1 | 0 | 1 | 
| 5 | 1 | 4.367927 | 2.429795 | 0 | 1 | 0 | 
| 6 | 1 | 4.821639 | 2.318772 | 1 | 1 | 0 | 
[20]:
head(d2)
| (Intercept) | x1 | x2 | x3public | x4med | x4high | x1:x2 | x1:x3public | x1:x4med | x1:x4high | x2:x3public | x2:x4med | x2:x4high | x3public:x4med | x3public:x4high | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 3.432472 | 2.672058 | 1 | 0 | 0 | 9.171762 | 3.432472 | 0.000000 | 0.000000 | 2.672058 | 0.000000 | 0.000000 | 0 | 0 | 
| 2 | 1 | 2.395150 | 3.734765 | 1 | 1 | 0 | 8.945324 | 2.395150 | 2.395150 | 0.000000 | 3.734765 | 3.734765 | 0.000000 | 1 | 0 | 
| 3 | 1 | 3.363076 | 1.725066 | 0 | 0 | 1 | 5.801526 | 0.000000 | 0.000000 | 3.363076 | 0.000000 | 0.000000 | 1.725066 | 0 | 0 | 
| 4 | 1 | 5.503075 | 5.002680 | 1 | 0 | 1 | 27.530126 | 5.503075 | 0.000000 | 5.503075 | 5.002680 | 0.000000 | 5.002680 | 0 | 1 | 
| 5 | 1 | 4.367927 | 2.429795 | 0 | 1 | 0 | 10.613166 | 0.000000 | 4.367927 | 0.000000 | 0.000000 | 2.429795 | 0.000000 | 0 | 0 | 
| 6 | 1 | 4.821639 | 2.318772 | 1 | 1 | 0 | 11.180282 | 4.821639 | 4.821639 | 0.000000 | 2.318772 | 2.318772 | 0.000000 | 1 | 0 | 
[21]:
head(d3)
| (Intercept) | x1 | x2 | x3public | x4med | x4high | x1:x2 | x3public:x4med | x3public:x4high | |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 3.432472 | 2.672058 | 1 | 0 | 0 | 9.171762 | 0 | 0 | 
| 2 | 1 | 2.395150 | 3.734765 | 1 | 1 | 0 | 8.945324 | 1 | 0 | 
| 3 | 1 | 3.363076 | 1.725066 | 0 | 0 | 1 | 5.801526 | 0 | 0 | 
| 4 | 1 | 5.503075 | 5.002680 | 1 | 0 | 1 | 27.530126 | 0 | 1 | 
| 5 | 1 | 4.367927 | 2.429795 | 0 | 1 | 0 | 10.613166 | 0 | 0 | 
| 6 | 1 | 4.821639 | 2.318772 | 1 | 1 | 0 | 11.180282 | 1 | 0 | 
[22]:
head(d4)
| (Intercept) | x1:x2 | x3private:x4low | x3public:x4low | x3private:x4med | x3public:x4med | x3private:x4high | x3public:x4high | |
|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 9.171762 | 0 | 1 | 0 | 0 | 0 | 0 | 
| 2 | 1 | 8.945324 | 0 | 0 | 0 | 1 | 0 | 0 | 
| 3 | 1 | 5.801526 | 0 | 0 | 0 | 0 | 1 | 0 | 
| 4 | 1 | 27.530126 | 0 | 0 | 0 | 0 | 0 | 1 | 
| 5 | 1 | 10.613166 | 0 | 0 | 1 | 0 | 0 | 0 | 
| 6 | 1 | 11.180282 | 0 | 0 | 0 | 1 | 0 | 0 | 
[23]:
head(model.matrix(y ~ x1:x2 + x3:x4, data=df))
| (Intercept) | x1:x2 | x3private:x4low | x3public:x4low | x3private:x4med | x3public:x4med | x3private:x4high | x3public:x4high | |
|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 9.171762 | 0 | 1 | 0 | 0 | 0 | 0 | 
| 2 | 1 | 8.945324 | 0 | 0 | 0 | 1 | 0 | 0 | 
| 3 | 1 | 5.801526 | 0 | 0 | 0 | 0 | 1 | 0 | 
| 4 | 1 | 27.530126 | 0 | 0 | 0 | 0 | 0 | 1 | 
| 5 | 1 | 10.613166 | 0 | 0 | 1 | 0 | 0 | 0 | 
| 6 | 1 | 11.180282 | 0 | 0 | 0 | 1 | 0 | 0 | 
[24]:
head(d5)
| x1 | x2 | x3private | x3public | x4med | x4high | |
|---|---|---|---|---|---|---|
| 1 | 3.432472 | 2.672058 | 0 | 1 | 0 | 0 | 
| 2 | 2.395150 | 3.734765 | 0 | 1 | 1 | 0 | 
| 3 | 3.363076 | 1.725066 | 1 | 0 | 0 | 1 | 
| 4 | 5.503075 | 5.002680 | 0 | 1 | 0 | 1 | 
| 5 | 4.367927 | 2.429795 | 1 | 0 | 1 | 0 | 
| 6 | 4.821639 | 2.318772 | 0 | 1 | 1 | 0 |