7. Formulas
Formulas are R’s way of expressing models against the data we have. Formulas are used nearly everywhere in R, from defining models to transforming data. The general form of a formula is
\(y \sim x_1 + x_2 + \dots\),
where
\(y\) is the output (dependent variable),
\(\sim\) separates the left and right hand sides of the formula, and
\(x_1 + x_2 + \ldots\) is the input (independent variable).
The power of formulas comes with the operators on the right hand side.
+
means to use a variable (or a set):
indicates interactions excluding lower level ones*
indicates interactions including lower level ones-
indicates to remove variables**
or^
indicates expanding interactions between sets of variables
Let’s look how some of these operators work. First, let’s simulate some data.
[1]:
set.seed(37)
[2]:
N <- 5
x1 <- rnorm(N, mean=5.1, sd=1)
x2 <- rnorm(N, mean=3.2, sd=1)
y <- 5 + 3.2 * x1 * x2 + rnorm(N, mean=0, sd=1)
df <- data.frame(x1=x1, x2=x2, y=y)
[3]:
df
x1 | x2 | y |
---|---|---|
<dbl> | <dbl> | <dbl> |
5.224754 | 2.867286 | 52.56107 |
5.482075 | 3.007840 | 57.80415 |
5.679243 | 4.562983 | 89.35053 |
4.806252 | 4.055954 | 68.36291 |
4.271651 | 3.415995 | 52.00467 |
With the data stored in a dataframe df
, let’s build some models using formulas. Observe how we can apply R functions on the variables too (e.g. with log
or exp
).
y ~ x1 + x2
will create a model of \(y\) against \(x_1\) and \(x_2\)y ~ (x1 + x2) ** 2
will create a model of \(y\) against \(x_1\), \(x_2\) and \(x_1 x_2\)y ~ (x1 + x2)^2
will create a model of \(y\) against \(x_1\), \(x_2\) and \(x_1 x_2\) (same as above)y ~ log(x1) + exp(x2)
will create a model of \(y\) against \(\log(x_1)\) and \(e^{x_2}\)
[4]:
m1 = lm(y ~ x1 + x2, data=df)
m2 = lm(y ~ (x1 + x2) ** 2, data=df)
m3 = lm(y ~ (x1 + x2)^2, data=df)
m4 = lm(y ~ log(x1) + exp(x2), data=df)
[5]:
print(summary(m1))
Call:
lm(formula = y ~ x1 + x2, data = df)
Residuals:
1 2 3 4 5
0.09791 -0.23830 0.51508 -0.99120 0.61651
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -61.0669 4.4971 -13.58 0.00538 **
x1 11.6752 0.8276 14.11 0.00499 **
x2 18.3205 0.6504 28.17 0.00126 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9204 on 2 degrees of freedom
Multiple R-squared: 0.9983, Adjusted R-squared: 0.9965
F-statistic: 574.4 on 2 and 2 DF, p-value: 0.001738
[6]:
print(summary(m2))
Call:
lm(formula = y ~ (x1 + x2)^2, data = df)
Residuals:
1 2 3 4 5
-0.12166 0.10999 0.00240 -0.03846 0.04773
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.5418 7.9916 -0.318 0.8040
x1 0.8218 1.4819 0.555 0.6777
x2 2.2141 2.1902 1.011 0.4965
x1:x2 2.9760 0.4040 7.366 0.0859 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1751 on 1 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 0.9999
F-statistic: 1.06e+04 on 3 and 1 DF, p-value: 0.007141
[7]:
print(summary(m3))
Call:
lm(formula = y ~ (x1 + x2)^2, data = df)
Residuals:
1 2 3 4 5
-0.12166 0.10999 0.00240 -0.03846 0.04773
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.5418 7.9916 -0.318 0.8040
x1 0.8218 1.4819 0.555 0.6777
x2 2.2141 2.1902 1.011 0.4965
x1:x2 2.9760 0.4040 7.366 0.0859 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1751 on 1 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 0.9999
F-statistic: 1.06e+04 on 3 and 1 DF, p-value: 0.007141
[8]:
print(summary(m4))
Call:
lm(formula = y ~ log(x1) + exp(x2), data = df)
Residuals:
1 2 3 4 5
-1.2924 1.1336 -0.3295 0.6047 -0.1165
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -12.08037 9.60952 -1.257 0.33562
log(x1) 35.41011 6.07079 5.833 0.02816 *
exp(x2) 0.41994 0.02107 19.929 0.00251 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.312 on 2 degrees of freedom
Multiple R-squared: 0.9965, Adjusted R-squared: 0.9929
F-statistic: 282.1 on 2 and 2 DF, p-value: 0.003532
What about factors or categorical variables? How do formulas work against those types of variables? Let’s generate a dataset with some categorical variables.
x1
is a continuous variablex2
is a continuous variablex3
is a categorical variablex4
is a categorical variabley
is the result of \(5 + 3.2 x_1 x_2 + 6.6 x_3 x_4 + e\), where \(e = \mathcal{N}(0, 1)\)
[9]:
N <- 2000
x1 <- rnorm(N, mean=5.1, sd=1)
x2 <- rnorm(N, mean=3.2, sd=1)
x3 <- factor(sample(0:1, N, replace=TRUE), labels=c('private', 'public'))
x4 <- factor(sample(0:2, N, replace=TRUE), labels=c('low', 'med', 'high'))
y <- 5 + 3.2 * x1 * x2 + 6.6 * as.numeric(x3) * as.numeric(x4) + rnorm(N, mean=0, sd=1)
df <- data.frame(x1=x1, x2=x2, x3=x3, x4=x4, y=y)
[10]:
head(df)
x1 | x2 | x3 | x4 | y | |
---|---|---|---|---|---|
<dbl> | <dbl> | <fct> | <fct> | <dbl> | |
1 | 3.432472 | 2.672058 | public | low | 45.63740 |
2 | 2.395150 | 3.734765 | public | med | 59.05097 |
3 | 3.363076 | 1.725066 | private | high | 43.52035 |
4 | 5.503075 | 5.002680 | public | high | 132.36280 |
5 | 4.367927 | 2.429795 | private | med | 53.37352 |
6 | 4.821639 | 2.318772 | public | med | 66.96244 |
Here some example models that we can define against the categorical variables and a mixture of all the variables.
[11]:
m1 = lm(y ~ x3)
m2 = lm(y ~ x4)
m3 = lm(y ~ x1 + x2 + x3 + x4, data=df)
m4 = lm(y ~ (x1 + x2 + x3 + x4)^2, data=df)
m5 = lm(y ~ x1*x2 + x3*x4, data=df)
m6 = lm(y ~ x1:x2 + x3:x4, data=df)
Notice how there are 2 values for x3
, which are
public
private
If we one-hot encode x3
, we would expect to see new variables, one to indicate when x3 = public
and one when x3 = private
.
x3public
x3private
However, using a formula, the correct thing to do, which is to drop one of these new variables, is taken care of (we only see x3public
as the coefficient).
[12]:
summary(m1)
Call:
lm(formula = y ~ x3)
Residuals:
Min 1Q Median 3Q Max
-60.824 -14.969 -1.775 14.612 82.802
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 70.7760 0.6955 101.76 <2e-16 ***
x3public 13.3969 0.9831 13.63 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 21.98 on 1998 degrees of freedom
Multiple R-squared: 0.08503, Adjusted R-squared: 0.08457
F-statistic: 185.7 on 1 and 1998 DF, p-value: < 2.2e-16
Likewise, there are 3 values for x4
, which are
low
medium
high
Again, the formula drops one of the new one-hot encoded variable (in this case, x4low
) from the model.
[13]:
summary(m2)
Call:
lm(formula = y ~ x4)
Residuals:
Min 1Q Median 3Q Max
-58.143 -14.989 -1.665 13.338 85.816
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 67.9659 0.8322 81.671 < 2e-16 ***
x4med 9.0121 1.1646 7.739 1.59e-14 ***
x4high 20.0773 1.1921 16.842 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 21.51 on 1997 degrees of freedom
Multiple R-squared: 0.1246, Adjusted R-squared: 0.1237
F-statistic: 142.1 on 2 and 1997 DF, p-value: < 2.2e-16
A simple model against all the continuous and categorical variables just works. Observe the coefficients:
x1
x2
x3public
x4med
x4high
[14]:
summary(m3)
Call:
lm(formula = y ~ x1 + x2 + x3 + x4, data = df)
Residuals:
Min 1Q Median 3Q Max
-25.284 -2.929 -0.063 2.923 24.662
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -43.13332 0.61813 -69.78 <2e-16 ***
x1 10.13366 0.09791 103.50 <2e-16 ***
x2 16.30242 0.09683 168.36 <2e-16 ***
x3public 12.99560 0.19968 65.08 <2e-16 ***
x4med 9.73536 0.24155 40.30 <2e-16 ***
x4high 19.57656 0.24730 79.16 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.46 on 1994 degrees of freedom
Multiple R-squared: 0.9624, Adjusted R-squared: 0.9623
F-statistic: 1.021e+04 on 5 and 1994 DF, p-value: < 2.2e-16
If we wanted all variables and up to the 2-way interactions, a mixture of continuous and categorical variables expressed with a formula knows what to do.
[15]:
summary(m4)
Call:
lm(formula = y ~ (x1 + x2 + x3 + x4)^2, data = df)
Residuals:
Min 1Q Median 3Q Max
-3.8136 -0.6552 0.0010 0.6548 3.1319
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.827014 0.422299 28.006 <2e-16 ***
x1 -0.051900 0.078882 -0.658 0.511
x2 -0.135147 0.115097 -1.174 0.240
x3public 6.423600 0.275072 23.352 <2e-16 ***
x4med 6.987297 0.322002 21.700 <2e-16 ***
x4high 13.358834 0.335365 39.834 <2e-16 ***
x1:x2 3.225207 0.021014 153.477 <2e-16 ***
x1:x3public 0.008138 0.043964 0.185 0.853
x1:x4med -0.042248 0.052587 -0.803 0.422
x1:x4high -0.031927 0.054832 -0.582 0.560
x2:x3public 0.031676 0.043548 0.727 0.467
x2:x4med -0.034890 0.052861 -0.660 0.509
x2:x4high 0.028974 0.053573 0.541 0.589
x3public:x4med 6.569441 0.108319 60.649 <2e-16 ***
x3public:x4high 13.180202 0.110819 118.935 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9979 on 1985 degrees of freedom
Multiple R-squared: 0.9981, Adjusted R-squared: 0.9981
F-statistic: 7.556e+04 on 14 and 1985 DF, p-value: < 2.2e-16
Below, we only wanted interaction between x1
and x2
, and only between x3
and x4
. Since we use *
, lower order terms are included.
[16]:
summary(m5)
Call:
lm(formula = y ~ x1 * x2 + x3 * x4, data = df)
Residuals:
Min 1Q Median 3Q Max
-3.8313 -0.6466 -0.0017 0.6646 3.2065
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.90330 0.36588 32.533 <2e-16 ***
x1 -0.07572 0.06986 -1.084 0.279
x2 -0.12615 0.10910 -1.156 0.248
x3public 6.56763 0.07720 85.077 <2e-16 ***
x4med 6.66300 0.07740 86.088 <2e-16 ***
x4high 13.28896 0.07737 171.749 <2e-16 ***
x1:x2 3.22624 0.02091 154.259 <2e-16 ***
x3public:x4med 6.56425 0.10806 60.749 <2e-16 ***
x3public:x4high 13.18146 0.11064 119.138 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.997 on 1991 degrees of freedom
Multiple R-squared: 0.9981, Adjusted R-squared: 0.9981
F-statistic: 1.325e+05 on 8 and 1991 DF, p-value: < 2.2e-16
Below, we only wanted interaction between x1
and x2
, and only between x3
and x4
. Since we use :
, lower order terms are excluded.
[17]:
summary(m6)
Call:
lm(formula = y ~ x1:x2 + x3:x4, data = df)
Residuals:
Min 1Q Median 3Q Max
-3.8383 -0.6496 -0.0037 0.6631 3.2602
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 44.538203 0.082103 542.5 <2e-16 ***
x1:x2 3.202488 0.003532 906.7 <2e-16 ***
x3private:x4low -33.034951 0.079399 -416.1 <2e-16 ***
x3public:x4low -26.469080 0.079061 -334.8 <2e-16 ***
x3private:x4med -26.375155 0.079266 -332.7 <2e-16 ***
x3public:x4med -13.241870 0.077710 -170.4 <2e-16 ***
x3private:x4high -19.747553 0.079238 -249.2 <2e-16 ***
x3public:x4high NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9968 on 1993 degrees of freedom
Multiple R-squared: 0.9981, Adjusted R-squared: 0.9981
F-statistic: 1.767e+05 on 6 and 1993 DF, p-value: < 2.2e-16
If we want the resulting matrix from the formula, we can use the model.matrix()
function. Observe that the (Intercept)
is returned. To exclude the intercept, add - 1
to the model.
[18]:
d1 = model.matrix(y ~ x1 + x2 + x3 + x4, data=df)
d2 = model.matrix(y ~ (x1 + x2 + x3 + x4)^2, data=df)
d3 = model.matrix(y ~ x1*x2 + x3*x4, data=df)
d4 = model.matrix(y ~ x1:x2 + x3:x4, data=df)
d5 = model.matrix(y ~ x1 + x2 + x3 + x4 - 1, data=df)
[19]:
head(d1)
(Intercept) | x1 | x2 | x3public | x4med | x4high | |
---|---|---|---|---|---|---|
1 | 1 | 3.432472 | 2.672058 | 1 | 0 | 0 |
2 | 1 | 2.395150 | 3.734765 | 1 | 1 | 0 |
3 | 1 | 3.363076 | 1.725066 | 0 | 0 | 1 |
4 | 1 | 5.503075 | 5.002680 | 1 | 0 | 1 |
5 | 1 | 4.367927 | 2.429795 | 0 | 1 | 0 |
6 | 1 | 4.821639 | 2.318772 | 1 | 1 | 0 |
[20]:
head(d2)
(Intercept) | x1 | x2 | x3public | x4med | x4high | x1:x2 | x1:x3public | x1:x4med | x1:x4high | x2:x3public | x2:x4med | x2:x4high | x3public:x4med | x3public:x4high | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 3.432472 | 2.672058 | 1 | 0 | 0 | 9.171762 | 3.432472 | 0.000000 | 0.000000 | 2.672058 | 0.000000 | 0.000000 | 0 | 0 |
2 | 1 | 2.395150 | 3.734765 | 1 | 1 | 0 | 8.945324 | 2.395150 | 2.395150 | 0.000000 | 3.734765 | 3.734765 | 0.000000 | 1 | 0 |
3 | 1 | 3.363076 | 1.725066 | 0 | 0 | 1 | 5.801526 | 0.000000 | 0.000000 | 3.363076 | 0.000000 | 0.000000 | 1.725066 | 0 | 0 |
4 | 1 | 5.503075 | 5.002680 | 1 | 0 | 1 | 27.530126 | 5.503075 | 0.000000 | 5.503075 | 5.002680 | 0.000000 | 5.002680 | 0 | 1 |
5 | 1 | 4.367927 | 2.429795 | 0 | 1 | 0 | 10.613166 | 0.000000 | 4.367927 | 0.000000 | 0.000000 | 2.429795 | 0.000000 | 0 | 0 |
6 | 1 | 4.821639 | 2.318772 | 1 | 1 | 0 | 11.180282 | 4.821639 | 4.821639 | 0.000000 | 2.318772 | 2.318772 | 0.000000 | 1 | 0 |
[21]:
head(d3)
(Intercept) | x1 | x2 | x3public | x4med | x4high | x1:x2 | x3public:x4med | x3public:x4high | |
---|---|---|---|---|---|---|---|---|---|
1 | 1 | 3.432472 | 2.672058 | 1 | 0 | 0 | 9.171762 | 0 | 0 |
2 | 1 | 2.395150 | 3.734765 | 1 | 1 | 0 | 8.945324 | 1 | 0 |
3 | 1 | 3.363076 | 1.725066 | 0 | 0 | 1 | 5.801526 | 0 | 0 |
4 | 1 | 5.503075 | 5.002680 | 1 | 0 | 1 | 27.530126 | 0 | 1 |
5 | 1 | 4.367927 | 2.429795 | 0 | 1 | 0 | 10.613166 | 0 | 0 |
6 | 1 | 4.821639 | 2.318772 | 1 | 1 | 0 | 11.180282 | 1 | 0 |
[22]:
head(d4)
(Intercept) | x1:x2 | x3private:x4low | x3public:x4low | x3private:x4med | x3public:x4med | x3private:x4high | x3public:x4high | |
---|---|---|---|---|---|---|---|---|
1 | 1 | 9.171762 | 0 | 1 | 0 | 0 | 0 | 0 |
2 | 1 | 8.945324 | 0 | 0 | 0 | 1 | 0 | 0 |
3 | 1 | 5.801526 | 0 | 0 | 0 | 0 | 1 | 0 |
4 | 1 | 27.530126 | 0 | 0 | 0 | 0 | 0 | 1 |
5 | 1 | 10.613166 | 0 | 0 | 1 | 0 | 0 | 0 |
6 | 1 | 11.180282 | 0 | 0 | 0 | 1 | 0 | 0 |
[23]:
head(model.matrix(y ~ x1:x2 + x3:x4, data=df))
(Intercept) | x1:x2 | x3private:x4low | x3public:x4low | x3private:x4med | x3public:x4med | x3private:x4high | x3public:x4high | |
---|---|---|---|---|---|---|---|---|
1 | 1 | 9.171762 | 0 | 1 | 0 | 0 | 0 | 0 |
2 | 1 | 8.945324 | 0 | 0 | 0 | 1 | 0 | 0 |
3 | 1 | 5.801526 | 0 | 0 | 0 | 0 | 1 | 0 |
4 | 1 | 27.530126 | 0 | 0 | 0 | 0 | 0 | 1 |
5 | 1 | 10.613166 | 0 | 0 | 1 | 0 | 0 | 0 |
6 | 1 | 11.180282 | 0 | 0 | 0 | 1 | 0 | 0 |
[24]:
head(d5)
x1 | x2 | x3private | x3public | x4med | x4high | |
---|---|---|---|---|---|---|
1 | 3.432472 | 2.672058 | 0 | 1 | 0 | 0 |
2 | 2.395150 | 3.734765 | 0 | 1 | 1 | 0 |
3 | 3.363076 | 1.725066 | 1 | 0 | 0 | 1 |
4 | 5.503075 | 5.002680 | 0 | 1 | 0 | 1 |
5 | 4.367927 | 2.429795 | 1 | 0 | 1 | 0 |
6 | 4.821639 | 2.318772 | 0 | 1 | 1 | 0 |