4. Hypothesis Testing

[1]:
set.seed(37)

4.1. Student’s t-test

The Student's t-test compares the means of two samples to see if they are different. Here is a two-sided Student’s t-test.

[2]:
x <- rnorm(1000, mean=0, sd=1)
y <- rnorm(1000, mean=1, sd=1)

r <- t.test(x, y, alternative='two.sided')
print(r)

        Welch Two Sample t-test

data:  x and y
t = -23.159, df = 1998, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.1425178 -0.9641235
sample estimates:
  mean of x   mean of y
-0.01839959  1.03492108

Here is a directional Student’s t-test to see if the mean of x is greater than the mean of y.

[3]:
x <- rnorm(1000, mean=0, sd=1)
y <- rnorm(1000, mean=1, sd=1)

r <- t.test(x, y, alternative='greater')
print(r)

        Welch Two Sample t-test

data:  x and y
t = -22.576, df = 1991.2, p-value = 1
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
 -1.118479       Inf
sample estimates:
 mean of x  mean of y
0.01325957 1.05574987

Here is a directional Student’s t-test to see if the mean of x is less than the mean of y.

[4]:
x <- rnorm(1000, mean=0, sd=1)
y <- rnorm(1000, mean=1, sd=1)

r <- t.test(x, y, alternative='less')
print(r)

        Welch Two Sample t-test

data:  x and y
t = -22.097, df = 1996.7, p-value < 2.2e-16
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
       -Inf -0.9224035
sample estimates:
 mean of x  mean of y
0.01069279 1.00731729

We may also perform a one-sample Student’s t-test.

[5]:
x <- rnorm(1000, mean=0, sd=1)

r <- t.test(x, mu=5)
print(r)

        One Sample t-test

data:  x
t = -159.87, df = 999, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 5
95 percent confidence interval:
 -0.13452024 -0.01000024
sample estimates:
  mean of x
-0.07226024

If your data is in long format, you may use a formula to perform a Student’s t-test.

[6]:
data <- data.frame(
    score = c(90, 89, 70, 99, 100, 77, 80, 67, 70),
    gender = c(rep('girl', 5), rep('boy', 4))
)

r <- t.test(score ~ gender, data=data)
print(r)

        Welch Two Sample t-test

data:  score by gender
t = -2.6069, df = 6.0971, p-value = 0.0397
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -31.15404  -1.04596
sample estimates:
 mean in group boy mean in group girl
              73.5               89.6

4.2. Wilcoxon U-Test

The Wilcoxon U-Test is non-parametric test used to compare two samples. The function wilcox.text behaves the same way as the t.test function.

[7]:
x <- rnorm(1000, mean=0, sd=1)
y <- rnorm(1000, mean=0.5, sd=1)

r <- wilcox.test(x, y)
print(r)

        Wilcoxon rank sum test with continuity correction

data:  x and y
W = 339274, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0

4.3. Correlation

May also compute correlation and test the it as well.

[8]:
x <- seq(1, 1000)
y <- x * 2 + rnorm(1000, mean=5, sd=5)

c <- cor(x, y)
print(c)
[1] 0.9999633

We compute the covariance with the cov function.`

[9]:
x <- seq(1, 1000)
y <- x * 2 + rnorm(1000, mean=5, sd=5)

c <- cov(x, y)
print(c)
[1] 166818.4

We compute the significance with cor.test.

[10]:
x <- seq(1, 1000)
y <- x * 2 + rnorm(1000, mean=5, sd=5)

r <- cor.test(x, y)
print(r)

        Pearson's product-moment correlation

data:  x and y
t = 3806.6, df = 998, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.9999610 0.9999696
sample estimates:
      cor
0.9999656

4.4. Chi-squared test

A Chi-squared test is used to test for association with contigency tables.

[11]:
df <- data.frame(
    rural = c(10, 15, 12),
    urban = c(20, 30, 25),
    row.names=c('DC', 'MD', 'VA')
)

r <- chisq.test(df)
print(r)

        Pearson's Chi-squared test

data:  df
X-squared = 0.0090902, df = 2, p-value = 0.9955

A goodness of fit test using the Chi-squared test is performed as follows.

[12]:
df <- data.frame(
    rural = c(10, 15, 12),
    urban = c(20, 30, 25),
    row.names=c('DC', 'MD', 'VA')
)

r <- chisq.test(df$rural, p=df$urban, rescale.p=TRUE)
print(r)

        Chi-squared test for given probabilities

data:  df$rural
X-squared = 0.013514, df = 2, p-value = 0.9933

4.5. Analysis of variance

4.5.1. One-way analysis of variance

A one-way analysis of variance (AOV) may be conducted as follows.

[13]:
library(tidyr)

df <- data.frame(
    city = c('A', 'B', 'C', 'D', 'E'),
    urban = c(20, 25, 22, 24, 21),
    rural = c(10, 15, 12, 14, 11),
    suburb = c(15, 18, 19, 20, 17)
)

df <- df %>% pivot_longer(-city, names_to='location', values_to='expense')
r <- aov(expense ~ location, data=df)
print(r)
print('-- summary below --')
print(summary(r))
Call:
   aov(formula = expense ~ location, data = df)

Terms:
                location Residuals
Sum of Squares  250.5333   49.2000
Deg. of Freedom        2        12

Residual standard error: 2.024846
Estimated effects may be unbalanced
[1] "-- summary below --"
            Df Sum Sq Mean Sq F value   Pr(>F)
location     2  250.5   125.3   30.55 1.96e-05 ***
Residuals   12   49.2     4.1
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

4.5.1.1. Post-hoc test

We apply Tukey's Honestly Significant Difference (HSD) test to see which pairs differ.

[14]:
t <- TukeyHSD(r)
print(t)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = expense ~ location, data = df)

$location
             diff      lwr       upr     p adj
suburb-rural  5.4 1.983468  8.816532 0.0031673
urban-rural  10.0 6.583468 13.416532 0.0000133
urban-suburb  4.6 1.183468  8.016532 0.0095794

4.5.1.2. Obtaining the effects

[15]:
e <- model.tables(r, type='effects')
print(e)
Tables of effects

 location
location
 rural suburb  urban
-5.133  0.267  4.867

4.5.1.3. Obtaining the means

[16]:
m <- model.tables(r, type='means')
print(m)
Tables of means
Grand mean

17.53333

 location
location
 rural suburb  urban
  12.4   17.8   22.4

4.5.1.4. Visualizing the means

[17]:
options(repr.plot.width=4, repr.plot.height=4)

boxplot(expense ~ location, data=df)
_images/hypothesis_33_0.png

4.5.1.5. Visualizing the differences

[18]:
options(repr.plot.width=5, repr.plot.height=3)

op = par(mar = c(5, 8, 4, 2))
plot(t, cex=0.2, las=1)
par(op)
_images/hypothesis_35_0.png

4.5.2. Two-way ANOVA

[19]:
suppressMessages({
    library('dplyr')
})

N = 5
a <- 5 + 20 * rnorm(N, mean=20, sd=1) + 4 * rnorm(N, mean=4, sd=1) # urban-high
b <- 5 + 18 * rnorm(N, mean=18, sd=1) + 2 * rnorm(N, mean=2, sd=1) # urban-low
c <- 5 + 10 * rnorm(N, mean=10, sd=1) + 4 * rnorm(N, mean=4, sd=1) # suburban-high
d <- 5 + 8 * rnorm(N, mean=8, sd=1) + 2 * rnorm(N, mean=2, sd=1) # suburban-low
e <- 5 + 5 * rnorm(N, mean=5, sd=1) + 4 * rnorm(N, mean=4, sd=1) # rural-high
f <- 5 + 3 * rnorm(N, mean=3, sd=1) + 2 * rnorm(N, mean=2, sd=1) # rural-low

df <- data.frame(
    expense=c(a, b, c, d, e, f),
    location=c(rep('urban', 2*N), rep('suburban', 2*N), rep('rural', 2*N)),
    income=c(rep('high', N), rep('low', N), rep('high', N), rep('low', N), rep('high', N), rep('low', N)),
    stringsAsFactors=TRUE
)
[20]:
r <- aov(expense ~ location * income, data=df)

print(r)
print('-- summary below --')
print(summary(r))
Call:
   aov(formula = expense ~ location * income, data = df)

Terms:
                location   income location:income Residuals
Sum of Squares  687822.6  24346.4          4833.1    7098.1
Deg. of Freedom        2        1               2        24

Residual standard error: 17.19753
Estimated effects may be unbalanced
[1] "-- summary below --"
                Df Sum Sq Mean Sq  F value   Pr(>F)
location         2 687823  343911 1162.825  < 2e-16 ***
income           1  24346   24346   82.319 3.17e-09 ***
location:income  2   4833    2417    8.171  0.00197 **
Residuals       24   7098     296
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

4.5.2.1. Two-Way ANOVA post-hoc

[21]:
t <- TukeyHSD(r)
print(t)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = expense ~ location * income, data = df)

$location
                    diff       lwr       upr p adj
suburban-rural  60.83863  41.63207  80.04519 1e-07
urban-rural    347.27478 328.06822 366.48134 0e+00
urban-suburban 286.43615 267.22959 305.64271 0e+00

$income
              diff       lwr       upr p adj
low-high -56.97529 -69.93585 -44.01473     0

$`location:income`
                                 diff        lwr         upr     p adj
suburban:high-rural:high     69.64827   36.01835  103.278196 0.0000173
urban:high-rural:high       377.50133  343.87140  411.131254 0.0000000
rural:low-rural:high        -30.95116  -64.58109    2.678762 0.0837968
suburban:low-rural:high      21.07783  -12.55210   54.707751 0.4049037
urban:low-rural:high        286.09707  252.46714  319.726994 0.0000000
urban:high-suburban:high    307.85306  274.22313  341.482983 0.0000000
rural:low-suburban:high    -100.59943 -134.22936  -66.969509 0.0000000
suburban:low-suburban:high  -48.57045  -82.20037  -14.940521 0.0019940
urban:low-suburban:high     216.44880  182.81887  250.078723 0.0000000
rural:low-urban:high       -408.45249 -442.08242 -374.822567 0.0000000
suburban:low-urban:high    -356.42350 -390.05343 -322.793579 0.0000000
urban:low-urban:high        -91.40426 -125.03418  -57.774335 0.0000002
suburban:low-rural:low       52.02899   18.39906   85.658913 0.0009103
urban:low-rural:low         317.04823  283.41831  350.678157 0.0000000
urban:low-suburban:low      265.01924  231.38932  298.649168 0.0000000

4.5.2.2. Two-Way ANOVA effects

[22]:
e <- model.tables(r, type='effects')
print(e)
Tables of effects

 location
location
   rural suburban    urban
 -136.04   -75.20   211.24

 income
income
   high     low
 28.488 -28.488

 location:income
          income
location   high    low
  rural    -13.012  13.012
  suburban  -4.202   4.202
  urban     17.214 -17.214

4.5.2.3. Two-Way ANOVA means

[23]:
m <- model.tables(r, type='means')
print(m)
Tables of means
Grand mean

168.0042

 location
location
   rural suburban    urban
    32.0     92.8    379.2

 income
income
  high    low
196.49 139.52

 location:income
          income
location   high  low
  rural     47.4  16.5
  suburban 117.1  68.5
  urban    424.9 333.5

4.5.2.4. Two-Way ANOVA means visualization

[24]:
options(repr.plot.width=5, repr.plot.height=5)

op = par(mar = c(8, 4, 4, 2))
boxplot(expense ~ location * income, data = df, cex.axis = 0.9, las=2, xlab='')
par(op)
_images/hypothesis_46_0.png

4.5.2.5. Two-Way ANOVA differences visualization

[25]:
options(repr.plot.width=5, repr.plot.height=3)

op = par(mar = c(5, 14, 4, 2))
plot(t, cex=0.2, las=1)
par(op)
_images/hypothesis_48_0.png
_images/hypothesis_48_1.png
_images/hypothesis_48_2.png

4.5.2.6. Two-Way ANOVA interaction plot

[26]:
options(repr.plot.width=5, repr.plot.height=5)

attach(df)
interaction.plot(location, income, expense)
detach(df)
_images/hypothesis_50_0.png