Runhang Shu

T-test & ANOVA In R

Runhang Shu / 2019-04-02


One-sample t-test

when do we run one-sample t-test?

  You have a continuous variables(连续变量), and you want to know if the variables is significantly different from a certain number. In other word, you compare the mean of the continuous variable with another number.

  For example, the instruction of a drug state this drug would not affect sleep, and you want to know if the company of this drug is cheating consumers.


we test 10 person extra sleep time after taking the drug. Thus, the we can develop a null hypothesis and an alternative hypothesis

   null hypothesis: “Extra sleep does not = 0”;

   alternative hypothesis: “Extra sleep = 0”

sleep # We use the dataset sleep in R, we assume group 1 is the drug
##    extra group ID
## 1    0.7     1  1
## 2   -1.6     1  2
## 3   -0.2     1  3
## 4   -1.2     1  4
## 5   -0.1     1  5
## 6    3.4     1  6
## 7    3.7     1  7
## 8    0.8     1  8
## 9    0.0     1  9
## 10   2.0     1 10
## 11   1.9     2  1
## 12   0.8     2  2
## 13   1.1     2  3
## 14   0.1     2  4
## 15  -0.1     2  5
## 16   4.4     2  6
## 17   5.5     2  7
## 18   1.6     2  8
## 19   4.6     2  9
## 20   3.4     2 10
vec<-sleep$group==1 # Using a vector to extract group 1
newdata<-sleep[vec,"extra"] # Get rid of the group 2, extract "extra" column
t.test(newdata,mu=0,alternative = "two.sided")
## 
##  One Sample t-test
## 
## data:  newdata
## t = 1.3257, df = 9, p-value = 0.2176
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -0.5297804  2.0297804
## sample estimates:
## mean of x 
##      0.75

  As you can see from above, the R directly tell us “## alternative hypothesis: true mean is not equal to 0”, which means the company does lie. The R default significant level is 0.05. So, when we got p-value=0.02176, we fail to reject null hypothesis and thus extra sleep does not equal to 0.

  Furthermore, instead of using “alternative=two.sides”, we can use “alternative=less” or “alternative=greater”

Two-sample t-test

  Rather than compare a bunch of continuous variable with a certain number, two-sample t-test compare two continuous variables and see if they are significantly different.

Example:
Compare population ages between two countries;
Compare the fertility/longivity of fruit fly under two treatments

Again, we use the above “sleep” dataset for compare drug1 and drug2 effects on extra sleep time in R

vec1<-sleep$group==1 # Using a vector to extract group 1
drug1<-sleep[vec1,"extra"] # Get rid of the group 2, extract "extra" column
vec2<-sleep$group==2 # Using a vector to extract group 2
drug2<-sleep[vec2,"extra"] # Get rid of the group 1, extract "extra" column
t.test(x=drug1,y=drug2,var.equal = T) #Here, we assume the two variable have the same variance 
## 
##  Two Sample t-test
## 
## data:  drug1 and drug2
## t = -1.8608, df = 18, p-value = 0.07919
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.363874  0.203874
## sample estimates:
## mean of x mean of y 
##      0.75      2.33

  Hoestly, the result is a little upseting me. Even though the mean of drug1 extra sleeping time is 0.75 hours, and 2.33 hours for drug2, there is no significant difference between these two! (P-value=0.079)

  This reminds me that just two weeks ago, in March 2019, 800 scientists rised up against statistic significance.(I urge you to click the link and read the article) Indeed, this is a very controversial topic all the time. Thershold of 0.05 is so arbitray and why 0.051 is not statistic significance?

Paired t-test
  Actually, there is a very important prerequisite I deliberately conceal in the above two-sample t-test. The prerequisite we run a two-sample t-test is that we assume a group of 10 people consumed the drug1 and another group of 10 people consumed drug2.

  BUT, what if one group is more sensitive to drugs than another group. Therefore, it would be better if we could somehow control the individual discrepency. That is to say we use the same group of people to test the effect of there two drugs instead of two groups test one of two drugs respectively.

  So, a paired t-test is a way to go! What we need to do is change the “var.equal=T” to “paired=T”

vec1<-sleep$group==1 
drug1<-sleep[vec1,"extra"] 
vec2<-sleep$group==2 
drug2<-sleep[vec2,"extra"]  
t.test(x=drug1,y=drug2,paired = T) 
## 
##  Paired t-test
## 
## data:  drug1 and drug2
## t = -4.0621, df = 9, p-value = 0.002833
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.4598858 -0.7001142
## sample estimates:
## mean of the differences 
##                   -1.58

Look! Now we have p-value=0.002833. It is safe to say the extra sleep time of comsuming drug2 is statisticly different from that of drug1.

         

  Paired t-test is frequently used in agricultural experiments. This is because field plots have different levels of natural soil fertility, and we could set several treatments in one plot with many plots for replicates. Likewise, some animals might naturally have a better immune system than others. It is better to apply treatments to the same unit of animals.

  In short, for paired t-test, we designed the experiment by using one same sampling unit for controling inherent differences.

ANOVA

  We can recall in the t-test, it either compare a bunch of continuous variables with a certain number or compare two continuous variables. However, in ANOVA, the predicator variables(X-axis ) are categorical and the response variable(Y-axis) is continuous.

  For instance, in an insecticides experiment, the predicator variables could be insecticide1,insecticide2,insecticide3 and the response variable could be LC50(lethal dose if 50% population were killed).

Now, I will use ToothGrowth dataset to run ANOVA in R

z<-ToothGrowth
library(dplyr) #load package dplyr
## Warning: package 'dplyr' was built under R version 3.5.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
newdata<-mutate(z,category=paste(z$supp,z$dose)) #use mutate() function conbine two columns 
faccate<-factor(newdata$category,c("VC 0.5","VC 1","VC 2","OJ 0.5","OJ 1","OJ 2")) 
#Define each treatment as factor
plot(faccate,newdata$len,col="magenta",xlab="Treatments", cex.lab=1.5
     ,cex.axis=1,ylab="Length of Tooth") 


supp<-factor(newdata$supp,c("VC","OJ"))# Define Type of nutrition as factor
dose<-factor(newdata$dose,c("0.5","1","2")) # Define dose of nutrition as factor

fit1<-aov(newdata$len~supp) #Using aov function run ANOVA
summary(fit1)
##             Df Sum Sq Mean Sq F value Pr(>F)  
## supp         1    205  205.35   3.668 0.0604 .
## Residuals   58   3247   55.98                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

fit2<-aov(newdata$len~dose)
summary(fit2)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## dose         2   2426    1213   67.42 9.53e-16 ***
## Residuals   57   1026      18                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

fit3<-aov(newdata$len~supp+dose)
summary(fit3) 
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## supp         1  205.4   205.4   14.02 0.000429 ***
## dose         2 2426.4  1213.2   82.81  < 2e-16 ***
## Residuals   56  820.4    14.7                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

TukeyHSD(fit3) # Tukey's "honestly significant difference" (HSD) to compare all pairwise comparisons 
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = newdata$len ~ supp + dose)
## 
## $supp
##       diff      lwr      upr     p adj
## OJ-VC  3.7 1.720238 5.679762 0.0004293
## 
## $dose
##         diff       lwr       upr p adj
## 1-0.5  9.130  6.215909 12.044091 0e+00
## 2-0.5 15.495 12.580909 18.409091 0e+00
## 2-1    6.365  3.450909  9.279091 7e-06

fit4<-aov(newdata$len~faccate)
summary(fit4)
##             Df Sum Sq Mean Sq F value Pr(>F)    
## faccate      5 2740.1   548.0   41.56 <2e-16 ***
## Residuals   54  712.1    13.2                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD # Tukey's "honestly significant difference" (HSD) to compare all pairwise comparisons 
## function (x, which, ordered = FALSE, conf.level = 0.95, ...) 
## UseMethod("TukeyHSD")
## <bytecode: 0x7f9188161830>
## <environment: namespace:stats>

Last but not least, since both T-tests and ANOVA are linear models, so before conducting statistic test, we have to ensure following assumptions to be satisfied.

1.Testing Homogeneity of variance

For t-test

None, You just run qqplot to see if each variable is normally distributed.

For ANOVA

1 Run aov function
fit<-aov(Y~X)

2 Calculating a plotting residuals
predict<-predict.lm(fit)
residuals<-Y-predict

3 Plot residuals vs. the predicted values
plot(predict,residuals)
abline(a=0,b=0,col=“red”,lwd=3,ity=“dashed”)

For simple regression
Use lm() to fit a linear model
reg<- lm(Y~X)
predict<-predict.lm(reg)
residuals<- Y-predict
plot(predict,residuals)
abline(a=0,b=0,ity=“dashed”)

—– or———

reg<-lim(Y~X)
plot(reg$fitted.values,reg$residuals)
abline=(h=0)

2.Testing for any given value of X, Y values have normally distributed errors
1.Standardize each residual using function rstandard()
std<-rstandard()
2.Using qq plot
qqnorm(std);qqline(std)
hist(std) # something we also creat a histgram to see if the Residuals is normal

3.Independence of observations
This assumption can be satisfied when we sample. Sample should be random and no bais.

A good predict vs. residual plot should be like the plot 1 (dots have a comparatively even distribution), if your plot like plot 2,3,4, then your data are required some transformation, such as log or squaetroot.

A good qq plot shoud be like the 3rd plot bellow: