Once you have installed R and Rstudio, we can get started with R!
IMPORTANT : R is case sensitive. This means that mean(variable)
is not the same as Mean(variable)
!
The topics in this document are:
For a lot of R functions, additional libraries need to be loaded.
This can be done with:
library(lattice)
Sometimes the library needs to be installed first:
install.packages(lattice)
library(lattice)
Installing a library only needs to be done once. Loading the library must be done every time you start R.
In the bottom right pane, you can find the tab Packages. Here you can find help on files about the packages. Help about a specific function can also be found with ?
:
?xyplot
How do I get my data in R?
library(foreign)
data <- read.spss("C:/Documents/data.sav", to.data.frame = TRUE)
library(readxl)
data <- read_excel("C:/Documents/data.xlsx")
Note 1 For R to recognize the folder you need to use forward slashes (/
) instead of backward slashes (\
)
Note 2 Pay attention to the name you give your data set. You want it to be distinct, but don’t make it too long!
Base is the base package in R, that contains a lot of the basic functions. Examples are summary()
to get some summary statistics and plot()
to make plots.
There exists a collection of packages, together called the Tidyverse, that allows you to do a lot of the base R things in a slightly different way. For example, the package dplyr
is used for data management and with ggplot()
you can make beautiful graphs. The Tidyverse has a bit of different style of coding (it has its own ‘grammar’) than most R functions.
Some people prefer working with their data with the Tidyverse, other prefer base R. In these tutorials, I will use base R for the data management and graphs in the Getting Started tutorials, and show some Tidyverse functions in the Advanced options tutorials.
In the two code chunks below, I demonstrate the difference between the two styles. The codes give the means of an outcome (weight of chickens) for different groups (feed type).
It is not necessary to understand the code yet! This example is just to demonstrate how two different pieces of code, give the same result.
With base R
data(chickwts)
aggregate(chickwts$weight, list(chickwts$feed), FUN=mean)
## Group.1 x
## 1 casein 323.5833
## 2 horsebean 160.2000
## 3 linseed 218.7500
## 4 meatmeal 276.9091
## 5 soybean 246.4286
## 6 sunflower 328.9167
With dplyr
from the Tidyverse
library(dplyr)
data(chickwts)
chickwts %>%
group_by(feed) %>%
summarize(Mean = mean(weight))
## # A tibble: 6 x 2
## feed Mean
## <fct> <dbl>
## 1 casein 324.
## 2 horsebean 160.
## 3 linseed 219.
## 4 meatmeal 277.
## 5 soybean 246.
## 6 sunflower 329.
The main difference between the styles, is that with the dplyr
code, things elements of the code are added stepwise, whereas with the base R, everything is handled at the same time.
We want to investigate and explore the data set before we perform our analyses. Often we also need to perform some data manipulation steps first.
Let’s explore a dataset.
We will use a standard data set available in R (mtcars). I will name this dataset “data”
data(mtcars)
data <- mtcars
With the function head
, we can print the first couple of rows; check out the names of the variables and the first entries.
head(data)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
The numbers of missings NA
are also reported.
summary(data)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
With the $
symbol we access a variable in a data set.
summary(data$mpg)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 15.43 19.20 20.09 22.80 33.90
names(data)
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
Numeric - Factor - character - Integer - etc
str(data)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
names(data)[names(data) == "carb"] <- "CARB"
Tip! Create a new data set with a new name. Otherwise you might lose important information in your data.
Remove observations with missings (NA
) in “mpg” and “cyl”
data2 <- subset(data, !is.na(data$mpg) & !is.na(data$cyl))
Make a new dataset, including only mpg values > 20. (Don’t forget the comma in the code!)
data2 <- data[data$mpg > 20, ]
This can be used if there are multiple measurement per patient, but you want to keep only the first.
data.short <- data[!duplicated(data$gear),]
data2 <- subset(data, select = -c(mpg, cyl))
data2 <- subset(data, select = c(mpg, cyl))
More options for data management can by found in the Advanced options.
Make a simple histogram (mpg = continuous variable)
hist(data$mpg)
Make a simple scatter plot (mpg & disp = continuous variables)
plot(data$mpg ~ data$disp)
Make a simple boxplot (mpg = continuous variable, cyl = categorical variable)
boxplot(data$mpg ~ data$cyl)
Improve the scatterplot by changing the color, adding a title, etc.
plot(data$mpg ~ mtcars$disp, xlab = "Disp", ylab = "Mpg",
xlim = c(75, 475), ylim = c(10, 35),
main = "Scatterplot Mpg - Disp", col = "red", pch = 20, cex = 1.5)
For two plots next to each other we can use the function par(mfrow = c(1,2))
par(mfrow = c(1,2))
plot(data$mpg ~ data$disp)
plot(data$mpg ~ mtcars$disp, xlab = "Disp", ylab = "Mpg",
xlim = c(75, 475), ylim = c(10, 35),
main = "Scatterplot Mpg - Disp", col = "red", pch = 20, cex = 1.5)
To learn how to use ggplot2 for graphics, you can go to Advanced graphs.
mean (sd)
mean(data$mpg)
## [1] 20.09062
sd(data$mpg)
## [1] 6.026948
median (IQR)
median(data$mpg)
## [1] 19.2
quantile(data$mpg, probs = c(0.25, 0.75))
## 25% 75%
## 15.425 22.800
Frequencies
table(data$cyl)
##
## 4 6 8
## 11 7 14
Crosstabs
table(data$cyl, data$vs)
##
## 0 1
## 4 1 10
## 6 3 4
## 8 14 0
With a t-test, we compare a normally distributed continuous variable between two groups. In the data “am” is a grouping variable (0/1). Let’s explore this variable with table
:
table(data$am)
##
## 0 1
## 19 13
“drat” is a continuous variable. Is it normally distributed?
par(mfrow = c(1,2))
hist(data$drat)
qqnorm(data$drat)
qqline(data$drat, col = "steelblue")
This looks pretty normal, so we can use a t-test to analyze these data.
t.test(drat ~ am, data = data)
##
## Welch Two Sample t-test
##
## data: drat by am
## t = -5.6461, df = 27.198, p-value = 5.267e-06
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -1.0411183 -0.4862501
## sample estimates:
## mean in group 0 mean in group 1
## 3.286316 4.050000
The output provides us with the p-value of the test, but also the 95% confidence interval and the mean in both groups.
It is possible to save the test as an r object (named “test”) and extract the p-value from it.
test <- t.test(drat ~ am, data = data)
test$p.value
## [1] 5.266742e-06
When the continuous variable is not normally distributed, we use the Mann-Whitney U test (or Wilcoxon Rank Sum test). The variable “mpg” is not normally distributed as we can see by the graphs.
par(mfrow = c(1,2))
hist(data$disp)
qqnorm(data$disp)
qqline(data$disp, col = "steelblue")
So, we use the code wilcox.test
wilcox.test(disp ~ am, data = data)
##
## Wilcoxon rank sum test with continuity correction
##
## data: disp by am
## W = 214, p-value = 0.0005493
## alternative hypothesis: true location shift is not equal to 0
The p-value is <0.05, so we can reject H0.
When we want to compare a continuous variable in more than 2 groups, we can use an ANOVA. For this test, we assume that the continuous marker follows a normal distribution. The variable “gear” has three groups:
table(data$gear)
##
## 3 4 5
## 15 12 5
We will use ANOVA to test if “drat” is similar over the “gear” groups. We visualize the data first:
boxplot(data$drat ~ data$gear)
There appears to be a big difference, especially in group 3. For the ANOVA procedure in R it is better if the grouping variable is a “factor” instead of “numeric”, so we make a new variable: gear.factor.
str(data$gear)
## num [1:32] 4 4 4 3 3 3 3 4 4 4 ...
data$gear.factor <- as.factor(data$gear)
str(data$gear.factor)
## Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
aov1 <- aov(drat ~ gear.factor, data = data)
summary(aov1)
## Df Sum Sq Mean Sq F value Pr(>F)
## gear.factor 2 6.133 3.0667 32.59 3.82e-08 ***
## Residuals 29 2.729 0.0941
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value is highly significant. We can perform a post-hoc test, to see which groups are different from each other. With Tukey we adjust the p-value to correct for multiple testing.
TukeyHSD(aov1)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = drat ~ gear.factor, data = data)
##
## $gear.factor
## diff lwr upr p adj
## 4-3 0.9106667 0.6172574 1.2040759 0.0000001
## 5-3 0.7833333 0.3921210 1.1745457 0.0000856
## 5-4 -0.1273333 -0.5305858 0.2759191 0.7181847
Group 3 is indeed different from 4 and 5.
The non-paramtric version of the ANOVA is the Kruskal-Wallis test.
boxplot(data$disp ~ data$gear.factor)
kruskal.test(disp ~ gear.factor, data = data)
##
## Kruskal-Wallis rank sum test
##
## data: disp by gear.factor
## Kruskal-Wallis chi-squared = 16.578, df = 2, p-value = 0.0002513
The groups of gear are also significantly different for disp values.
To see if two binary or categorical variables are associated, we can use the chi-squared test.
The variables “vs” and “am” are both binary. With table()
we can make a crosstab of the variables.
tab <- with(data, table(vs, am))
tab
## am
## vs 0 1
## 0 12 6
## 1 7 7
We now estimate the chi-squared statistic on this table:
chisq.test(tab)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: tab
## X-squared = 0.34754, df = 1, p-value = 0.5555
The p-value is > 0.05.
Correlations are used for two continuous variables. We already made a scatterplot of “mpg” and “disp”.
plot(data$mpg ~ data$disp)
There seems to be a strong negative correlation between the variables. Is it significant?
cor.test(data$disp, data$mpg, method = "spearman")
##
## Spearman's rank correlation rho
##
## data: data$disp and data$mpg
## S = 10415, p-value = 6.37e-13
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.9088824
Since “mpg” is not normally distributed, we use the method = "spearman"
version of the test. For the Pearson correlation, use method = "pearson"
. The correlation is -0.9 and highly significant.
In regression we like to explain how one variable (x) influences changes in a second variable (y). Let’s start with a scatterplot for “drat” and “disp”:
plot(drat ~ disp, data = data)
With the regression model we can estimate how unit-value in x (disp) impact a change in y (drat): The model is estimated with lm()
. summary()
prints a summary of the model.
m1 <- lm(drat ~ disp, data = data)
summary(m1)
##
## Call:
## lm(formula = drat ~ disp, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.85409 -0.23642 0.01812 0.13112 0.99196
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.3034722 0.1447123 29.738 < 2e-16 ***
## disp -0.0030639 0.0005545 -5.526 5.28e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3826 on 30 degrees of freedom
## Multiple R-squared: 0.5044, Adjusted R-squared: 0.4879
## F-statistic: 30.53 on 1 and 30 DF, p-value: 5.282e-06
The p-value for disp is highly significant (5.2820217^{-6}). I can extract the coefficient table from this object:
summary(m1)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.303472222 0.1447123152 29.738120 8.065594e-24
## disp -0.003063904 0.0005544844 -5.525681 5.282022e-06
To trust the results of the model, we need to validate the assumptions. Normality and homoskedasticity of the residuals can be checked by making plots.
par(mfrow = c(2,2))
plot(m1)
We look mainly at the two upper plots. With the first we can check whether the residuals from a noisy cloud of points. The qq-plot in the second graph can show whether the residuals follow a normal distribution. No serious deviations here.
Multiple independent (x) variables can be added to the model.
m2 <- lm(drat ~ disp + mpg + vs, data = data)
summary(m2)
##
## Call:
## lm(formula = drat ~ disp + mpg + vs, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.69295 -0.22024 -0.00385 0.18727 0.95692
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.665518 0.659203 5.561 6.01e-06 ***
## disp -0.002419 0.001122 -2.156 0.0398 *
## mpg 0.028263 0.021714 1.302 0.2037
## vs -0.179902 0.195797 -0.919 0.3660
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3813 on 28 degrees of freedom
## Multiple R-squared: 0.5405, Adjusted R-squared: 0.4913
## F-statistic: 10.98 on 3 and 28 DF, p-value: 6.117e-05
Interactions can be added by using the *
instead of the +
m3 <- lm(drat ~ disp * vs + mpg, data = data)
summary(m3)
##
## Call:
## lm(formula = drat ~ disp * vs + mpg, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.59639 -0.20250 -0.05495 0.15378 0.94027
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.984823 0.653679 6.096 1.64e-06 ***
## disp -0.002403 0.001074 -2.236 0.0338 *
## vs 0.524636 0.418971 1.252 0.2212
## mpg 0.008746 0.023239 0.376 0.7096
## disp:vs -0.004128 0.002195 -1.880 0.0709 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3652 on 27 degrees of freedom
## Multiple R-squared: 0.5937, Adjusted R-squared: 0.5336
## F-statistic: 9.865 on 4 and 27 DF, p-value: 4.718e-05
Do not forget to evaluate the assumptions!
If the dependent variable is binary, we use a logistic regression model with glm()
glm1 <- glm(vs ~ drat, data = data, family = binomial(logit))
summary(glm1)
##
## Call:
## glm(formula = vs ~ drat, family = binomial(logit), data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7808 -0.7719 -0.6192 1.0357 2.0465
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.4480 3.1805 -2.342 0.0192 *
## drat 1.9875 0.8663 2.294 0.0218 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 43.860 on 31 degrees of freedom
## Residual deviance: 37.159 on 30 degrees of freedom
## AIC: 41.159
##
## Number of Fisher Scoring iterations: 3
The value of the coefficient is interpreted by taking the exponent
b1 <- summary(glm1)$coef[2, 1]
exp(b1)
## [1] 7.297034
Multiple variables can be added to the model
glm2 <- glm(vs ~ drat + am + mpg, data = data, family = binomial(logit))
summary(glm2)
##
## Call:
## glm(formula = vs ~ drat + am + mpg, family = binomial(logit),
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.22751 -0.40730 -0.08654 0.28217 1.67256
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -16.1227 6.4962 -2.482 0.0131 *
## drat 1.3874 1.7037 0.814 0.4154
## am -3.6872 1.8619 -1.980 0.0477 *
## mpg 0.6176 0.2541 2.431 0.0151 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 43.860 on 31 degrees of freedom
## Residual deviance: 19.956 on 28 degrees of freedom
## AIC: 27.956
##
## Number of Fisher Scoring iterations: 6
The predicted probability of ‘vs’ for specific values of the x variables can be calculated with predict()
DF <- data.frame(drat = 3.6, am = 1, mpg = 20.1)
predict(glm2, newdata = DF, type = "response")
## 1
## 0.08303771
The option type = "response"
ensure the result is given als probability.
There are several ways to save your work.
I always save my script as .R
file. Additionally I can save separate R objects with the save()
function.
If data2
is the result of several datamanagement steps and the dataset that is used for the analyses, I might want to save data. I can save this as .RData
file.
save(data2, file = "C:/Documents/data2.RData")
Reminder note for R to recognize the folder you need to use forward slashes (/
) instead of backward slashes (\
)
Models that take a long time to run, can also be save in this way.
save(Model1, file = "C:/Documents/model1.RData")
If I want to use my data in excel, I can export the data to excel for example.
library("xlsx")
save(data2, file = "C:/Documents/data2.xlsx")
Graphs can be saved in different file formats use png(), pdf(), tiff() etc.
to create the file. Then run the code for the plot. With dev.off()
at the end, the plot is saved as the file.
png("C:/Documents/plots.png")
plot(data$mpg ~ mtcars$disp, xlab = "Disp", ylab = "Mpg")
dev.off()
You can also save the whole environment you are working in:
save.image(file = "C:/Documents/MyEnvironment.RData")