3. Data Distributions
[1]:
set.seed(37)
3.1. Normal distribution
The following are functions dealing with the normal
distribution.
rnorm
samples from the normal distributionpnorm
returns the probability of a sample (quantile
)qnorm
returns the quantile for a given probabilitydnorm
returns the density function for an array of values
The rnorm
function simply samples data points.
[2]:
s <- rnorm(10, mean=0, sd=1)
print(s)
[1] 0.1247540 0.3820746 0.5792428 -0.2937481 -0.8283492 -0.3327136
[7] -0.1921595 1.3629827 0.8559544 0.2159955
The pnorm
function gives you the cummulative probability up to the specified quantile.
[3]:
p <- pnorm(c(-1.0, -0.4, -0.3, -0.2, -0.1, 0.0, 0.1, 0.2, 0.3, 0.4, 1.0), mean=0, sd=1)
print(p)
[1] 0.1586553 0.3445783 0.3820886 0.4207403 0.4601722 0.5000000 0.5398278
[8] 0.5792597 0.6179114 0.6554217 0.8413447
The qnorm
function maps the quantile probability back to the quantile.
[4]:
q <- qnorm(p, mean=0, sd=1)
print(q)
[1] -1.0 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 1.0
The dnorm
function gives the probability of the data point.
[5]:
d <- dnorm(s, mean=0, sd=1)
print(d)
[1] 0.3958498 0.3708606 0.3373280 0.3820963 0.2830817 0.3774611 0.3916443
[8] 0.1575835 0.2765766 0.3897438
3.2. Other distributions
There are other distributions besides the normal distribution. They have the corresponding functions as with the normal distribution.
r
p
q
d
command |
distribution |
---|---|
dbeta |
beta |
dbinom |
binomial |
dcauchy |
Cauchy |
dchisq |
chi-squared |
dexp |
exponential |
df |
F distribution |
dgamma |
gamma |
dgeom |
geometric |
dhyper |
hypergeometric |
dlnorm |
log-normal |
dmultinom |
multinomial |
dnbinom |
negative binomial |
dnorm |
normal |
dpois |
Poisson |
dt |
Student’s t |
dunif |
uniform distribution |
dweibull |
Weibull |
dwilcox |
Wilcoxon rank sum |
ptukey |
Studentized range |
dsignrank |
Wilcoxon signed rank |
3.3. Normality test
You may use the Shapiro-Wilk Test
to test for normality. Below, we sample from a normal distribution and test for normality. Note that the p-value
is greater than (let’s say) 0.05, and so we fail to reject the null hypothesis (there is no difference between the normal distribution and the one observed).
[6]:
x <- rnorm(1000, m=0, sd=1)
r <- shapiro.test(x)
print(r)
Shapiro-Wilk normality test
data: x
W = 0.99869, p-value = 0.6793
Now, we sample from a Poisson
distribution and apply the normality test. The p-value
is less than 0.05 and so we reject the null hypothesis in favor the the alternative one (there is a difference between the normal distribution and the one observed).
[7]:
x <- rpois(1000, lambda=3)
r <- shapiro.test(x)
print(r)
Shapiro-Wilk normality test
data: x
W = 0.94566, p-value < 2.2e-16
3.4. Comparing two distributions
We may use the Kolmogorov-Smirnov Test
(or KS Test
) to test if two distributions are the same.
[8]:
x <- rnorm(1000, m=0, sd=1)
y <- rnorm(1000, m=0, sd=1)
r <- ks.test(x, y)
print(r)
Two-sample Kolmogorov-Smirnov test
data: x and y
D = 0.032, p-value = 0.6852
alternative hypothesis: two-sided
The KS Test
may also be called with a probability function. Below, we use pnorm
to test if the distribution of x
follows a normal distribution with mean 0 and standard deviation 1.
[9]:
x <- rnorm(1000, m=0, sd=1)
r <- ks.test(x, 'pnorm', mean=0, sd=1)
print(r)
One-sample Kolmogorov-Smirnov test
data: x
D = 0.02782, p-value = 0.4213
alternative hypothesis: two-sided
Now we compare with a normal distribution with mean 5 and standard deviation 1.
[10]:
x <- rnorm(1000, m=5, sd=1)
r <- ks.test(x, 'pnorm', mean=0, sd=1)
print(r)
One-sample Kolmogorov-Smirnov test
data: x
D = 0.98743, p-value < 2.2e-16
alternative hypothesis: two-sided
Here, we use ppois
to test if the distribution of x
follows a Poisson distribution with lambda 5.
[11]:
x <- rnorm(1000, m=0, sd=1)
r <- ks.test(x, 'ppois', 5)
print(r)
One-sample Kolmogorov-Smirnov test
data: x
D = 0.93457, p-value < 2.2e-16
alternative hypothesis: two-sided
3.5. Quantile-Quantile Plot
The Quantile-Quantile Plot
(QQ Plot
) is a visual way to also test for normality. To idea is to see if the data points fall on the straight line.
3.5.1. qqnorm
The qqnorm
function provides a way to visually tests if a distribution is normal. Note the use of qqline
to draw the straight line.
[12]:
x <- c(1, 2, 2, 3, 3, 3, 3, 3, 4, 4, 5)
options(repr.plot.width=4, repr.plot.height=4)
qqnorm(x)
qqline(x, lwd=2, lty=2)
Here, we sample from a normal distribution and form a QQ Plot
. Note that the data falls nearly on the straight line.
[13]:
x <- rnorm(100, m=0, sd=1)
options(repr.plot.width=4, repr.plot.height=4)
qqnorm(x)
qqline(x, lwd=2, lty=2)
3.5.2. qqplot
The qqplot
function may be used to compare two distributions. Below, we compare the distribution of data coming from a Poisson distribution to that coming from a normal distribution.
[14]:
options(repr.plot.width=4, repr.plot.height=4)
p <- qqplot(rpois(100, lambda=3), rnorm(100, mean=5, sd=1))
abline(lm(p$y ~ p$x))
Here, we compare two normal distributions parameterized differently.
[15]:
options(repr.plot.width=4, repr.plot.height=4)
p <- qqplot(rnorm(100, mean=2, sd=1), rnorm(100, mean=5, sd=1))
abline(lm(p$y ~ p$x))
Here, we compare two normal distributions parameterized the same.
[16]:
options(repr.plot.width=4, repr.plot.height=4)
p <- qqplot(rnorm(100, mean=5, sd=1), rnorm(100, mean=5, sd=1))
abline(lm(p$y ~ p$x))