Introduction Warmup
It is a good beginning. The task in DataCamp is friendly to beginners.
Clustering and classification, and dimensionality reduction techniques are the two parts I am most interested in.
I heard this course via email sent to HYMY students.
The link to my github repository:
https://github.com/yufanyin/IODS-project
The link to my course diary:
https://yufanyin.github.io/IODS-project/
work through the DataCamp:
DataCamp 1 Hello R
DataCamp 2 Getting Hooked on R
(5.11.-11.11.2019)
Work through the DataCamp
DataCamp 3 Regression and model validation
Data wrangling
R script of data Wrangling exercise is available here.
https://github.com/yufanyin/IODS-project/blob/master/create_learning2014.R
The result was saved as ‘learning2014.csv’.
Note: ‘(Step N)’ is used for reviewers to check my work queicky.
Read the data file (Step 1a)
learning2014 <-
read.table(file = "G:/C-Open Data Science/0-191030/IODS-project-master/learning2014.txt", stringsAsFactors = TRUE)
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
Describe the dataset briefly (Step 1b)
The data ‘learning2014’ consists of 166 observations and 7 variables. It contains students’ approaches to learning (deep, stra, surf), scores of attitude, points and some background information (gender, age).
(Step 2)
Correlations of the observations between the different variables are shown below with ggpairs() function.
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
p1 <- ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)), upper = list(continuous = wrap("cor", size = 2.8)))
p1
http://127.0.0.1:42872/chunk_output/DC8623CAB88836EA/A5B1EE4E/cd2f6s6tn8gxk/000002.png
The plot shows the data distribution of all the 166 observations of each variable. The colours represent different gender. Female students are shown in red and male students in verdant. In the upper right part presents how one variable correlates to another varibable. We are able to find possible significant regression result.
(Step 3)
Simple regression: point ~ gender Firstly we choose gender as explanatory variables and fit a regression model in which exam points is the dependent variable. Result: not significant.
gender_lm <- lm(points ~ gender, data = learning2014)
summary(gender_lm)
##
## Call:
## lm(formula = points ~ gender, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.3273 -3.3273 0.5179 4.5179 10.6727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.3273 0.5613 39.776 <2e-16 ***
## genderM 1.1549 0.9664 1.195 0.234
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.887 on 164 degrees of freedom
## Multiple R-squared: 0.008632, Adjusted R-squared: 0.002587
## F-statistic: 1.428 on 1 and 164 DF, p-value: 0.2338
Simple regression: point ~ attitude
gender_lm <- lm(points ~ attitude, data = learning2014)
summary(gender_lm)
##
## Call:
## lm(formula = points ~ attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
There is a statistical relationship between point and attitude (p:4.12e-09). Then draw a scatter plot of points versus attitude, fit a linear model and print out the summary.
qplot(attitude, points, data = learning2014) + geom_smooth(method = "lm")
http://127.0.0.1:42872/chunk_output/DC8623CAB88836EA/A5B1EE4E/c5p8rg484lt8q/000002.png
model_point_attitude <- lm(points ~ attitude, data = learning2014)
model_point_attitude
##
## Call:
## lm(formula = points ~ attitude, data = learning2014)
##
## Coefficients:
## (Intercept) attitude
## 11.637 3.525
(Step 4)
Multiple regression model: point ~ attitude + stra This multiple regression is to test whether strategic learning has an influence on points. The p-value (0.08) shows that the significance of the influence is not very strong. The multiple R-squared is a bit higher than that in regression model of points ~ attitude. That means a little higher correlations if strategic learning is taken into account.
model_p_att_stra <- lm(points ~ attitude + stra, data = learning2014)
model_p_att_stra_s <- summary(model_p_att_stra)
model_p_att_stra_s
##
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.9729 2.3959 3.745 0.00025 ***
## attitude 3.4658 0.5652 6.132 6.31e-09 ***
## stra 0.9137 0.5345 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
(Step 5)
The results of diagnostic plots are in accordance with the regression model.
The residuals is close to the 0 line, which means that the relationship is linear. Not many outliers (35, 56,145) are significantly disturbing the analysis but they raise errors in the dataset.
According to Nomal Q-Q plot, most of the values fall on the straight line. The outliers cause randomness in the dataset.
plot(model_p_att_stra, which=c(1,2,5))
http://127.0.0.1:42872/chunk_output/DC8623CAB88836EA/A5B1EE4E/c4jdrj7wils0g/000004.png?fixed_size=1
http://127.0.0.1:42872/chunk_output/DC8623CAB88836EA/A5B1EE4E/c4jdrj7wils0g/000005.png?fixed_size=1
http://127.0.0.1:42872/chunk_output/DC8623CAB88836EA/A5B1EE4E/c4jdrj7wils0g/000006.png
(12.11.-18.11.2019)
Work through the DataCamp
DataCamp 4 Logistic regression
Data wrangling
R script of data Wrangling exercise is available here.
https://github.com/yufanyin/IODS-project/blob/master/create_alc.R
The result was saved as ‘alc.csv’.
Note: ‘(Step N)’ is used for reviewers to check my work queicky.
Read the data file (Step 2a)
getwd()
## [1] "G:/C-Open Data Science/0-191030/IODS-project-master"
setwd("G:/C-Open Data Science/0-191030/IODS-project-master")
alc <- read.csv(file = "G:/C-Open Data Science/0-191030/IODS-project-master/alc.csv")
Print out the names of the variables in the data (Step 2b)
dim(alc)
## [1] 382 36
colnames(alc)
## [1] "X" "school" "sex" "age" "address"
## [6] "famsize" "Pstatus" "Medu" "Fedu" "Mjob"
## [11] "Fjob" "reason" "nursery" "internet" "guardian"
## [16] "traveltime" "studytime" "failures" "schoolsup" "famsup"
## [21] "paid" "activities" "higher" "romantic" "famrel"
## [26] "freetime" "goout" "Dalc" "Walc" "health"
## [31] "absences" "G1" "G2" "G3" "alc_use"
## [36] "high_use"
str(alc)
## 'data.frame': 382 obs. of 36 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
## $ reason : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : int 0 0 3 0 0 0 0 0 0 0 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
## $ famsup : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
## $ paid : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
## $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 6 4 10 2 4 10 0 6 0 0 ...
## $ G1 : int 5 5 7 15 6 15 12 6 16 14 ...
## $ G2 : int 6 5 8 14 10 15 12 5 18 15 ...
## $ G3 : int 6 6 10 15 10 15 11 6 19 15 ...
## $ alc_use : num 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
## $ high_use : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
Describe the data set briefly (Step 2c)
The data consists of 382 observations in 35 variables.
Note: It shows 36 variables. Why the first colname is “X” (int 1 2 3 4 5 6 7 8 9 10 …)?
The data were collected by using school reports and questionnaires from students in two Portugal secondary schools. They included students’ grades, demographic characteristics, social and school related features.
The two data sets provided students’ performance in Mathematics (mat) and Portuguese language (por). After data wrangling part, they were merged into one data set “alc”.
To analyse alcohol consumption, two new variables were created. “alc_use” is the average of weekday (“Dalc”) and weekend (“Walc”) alcohol consumption. “high_use” is the logical value created if the “alc_use” of a student is higher than 2 (TRUE).
Data source: https://archive.ics.uci.edu/ml/datasets/Student+Performance
Choose variables and write hypotheses (Step 3)
The purpose of the following analysis is to explore the relationships between high/low alcohol consumption and some variables. I chose the following four variables:
‘sex’: student’s sex (binary: ‘F’ - female or ‘M’ - male)
‘failures’: number of past class failures (numeric: n if 1<=n<3, else 4)
‘goout’: going out with friends (numeric: from 1 - very low to 5 - very high)
‘absences’: number of school absences (numeric: from 0 to 93)
The hypotheses are as follows.
Hypothesis 1: There is gender difference (‘sex’) in alcohol consumption. In social science, gender affect many behaviors and preferences and becomes a variable being tested often.
Hypothesis 2: The number of class failures (‘failures’) associates with alcohol consumption. Students with more class failures have higher alcohol consumption.
Hypothesis 3: The frequency of going out with friends (‘goout’) associates with alcohol consumption. Students going out with friends more often have higher alcohol consumption.
Hypothesis 4: The number of school absences (‘absences’) associates with alcohol consumption. Students having more times of school absences have higher alcohol consumption.
(Step 4)
library(tidyr)
library(dplyr)
library(ggplot2)
library(knitr)
library(kableExtra)
alc %>% group_by(sex) %>% summarise(count = n()) -> fm
knitr::kable(fm, caption="Students' gender distribution") %>% kable_styling(bootstrap_options = "hover", full_width = FALSE, position = "center")
| sex | count |
|---|---|
| F | 198 |
| M | 184 |
There were 198 female students and 184 male students.
ggplot(data = alc, aes(x = alc_use, fill = sex)) + geom_bar() + xlab("Alcohol consumption (1: lowest / 5: highest)") + ylab("Numeber of students") + ggtitle("Alcohol consumption and gender") + scale_fill_discrete("Gender", labels = c("Female", "Male"))
Te figure presents the distribution of alcohol consumption taking gender into account. 1 means the lowest consumption, 5 is the highest consumption.
ggplot(data = alc, aes(x = alc_use, fill = high_use)) + facet_wrap("sex") + geom_bar() + xlab("Student's alcohol consumption") + ggtitle("High alcohol consumption by gender") + ylab("Numeber of students") + scale_fill_discrete("High alcohol use")
This graph shows the number of male and female students with higher alcohol consumption. High alcohol consumption was in red and lower alcohol consumption was in red blue. Male students with high alcohol consumption more than female students with high consumption.
ggplot(data = alc, aes(x = failures, fill = sex)) + geom_bar() + facet_wrap("high_use") + ggtitle("Alcohol consumption and number of class failures") + xlab("Number of class failures") + ylab("Numeber of students") + scale_fill_discrete("Gender", labels = c("Female", "Male"))
ggplot(data = alc, aes(x = goout, fill = sex)) + geom_bar() + facet_wrap("high_use") + ggtitle("Alcohol consumption and frequency of going out with friends") + xlab("Number of going out with friends (1: very low / 5: very high)") + ylab("Numeber of students") + scale_fill_discrete("Gender", labels = c("Female", "Male"))
ggplot(data = alc, aes(x = absences, fill = sex)) + geom_bar() + facet_wrap("high_use") + ggtitle("Alcohol consumption and number of school absences") + xlab("Number of school absences (from 0 to 93)") + ylab("Numeber of students") + scale_fill_discrete("Student's gender", labels = c("Female", "Male"))
(Step 5a)
‘sex’, ‘failures’,‘goout’ and ‘absences’
lr_glm <- glm(high_use ~ sex + failures + goout + absences, data = alc, family = "binomial")
lr_glm_sum <- summary(lr_glm)
(Step 5b)
lr_glm_sum
##
## Call:
## glm(formula = high_use ~ sex + failures + goout + absences, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8046 -0.7661 -0.5196 0.7383 2.3862
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.21840 0.47865 -8.813 < 2e-16 ***
## sexM 0.97670 0.25898 3.771 0.000162 ***
## failures 0.28657 0.17170 1.669 0.095111 .
## goout 0.71559 0.12089 5.919 3.24e-09 ***
## absences 0.06550 0.01685 3.888 0.000101 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 378.67 on 377 degrees of freedom
## AIC: 388.67
##
## Number of Fisher Scoring iterations: 4
knitr::kable(lr_glm_sum$coefficients, digits = 3, align = "l", justify = "left", caption="Model coefficients of choosen variables") %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "center")
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -4.218 | 0.479 | -8.813 | 0.000 |
| sexM | 0.977 | 0.259 | 3.771 | 0.000 |
| failures | 0.287 | 0.172 | 1.669 | 0.095 |
| goout | 0.716 | 0.121 | 5.919 | 0.000 |
| absences | 0.066 | 0.017 | 3.888 | 0.000 |
The output shows
Being a male student instead of a female student significantly (p < 0.001) increased the probability of high alcohol consumption.
Class failure had a positive effect on high consumption with a smaller probability (p = 0.095).
The frenquency of going out with friends had a significant positive effect on high alcohol consumption (p < 0.001).
School absences had a significant positive effect on high alcohol consumption (< 0.001).
(Step 5c)
Odds ratios
or_lr <- coef(lr_glm) %>% exp
or_lr
## (Intercept) sexM failures goout absences
## 0.01472214 2.65567230 1.33184898 2.04538456 1.06769512
Confidence intervals
ci_lr <- confint(lr_glm) %>% exp
## Waiting for profiling to be done...
ci_lr
## 2.5 % 97.5 %
## (Intercept) 0.005520616 0.03620626
## sexM 1.608518976 4.45004345
## failures 0.949933462 1.86808335
## goout 1.623552039 2.61096941
## absences 1.035225148 1.10698949
Print out the odds ratios with their confidence intervals
od_conf <- cbind(or_lr, ci_lr)
colnames(od_conf) <- c("odds ratios","2.5 %", "97.5 %")
knitr::kable(od_conf, digits = 3, align = "l", caption="Odds ratios and confidental intervals") %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "center")
| odds ratios | 2.5 % | 97.5 % | |
|---|---|---|---|
| (Intercept) | 0.015 | 0.006 | 0.036 |
| sexM | 2.656 | 1.609 | 4.450 |
| failures | 1.332 | 0.950 | 1.868 |
| goout | 2.045 | 1.624 | 2.611 |
| absences | 1.068 | 1.035 | 1.107 |
‘sex’ has a value of 2.656, the confidence interval is between 1.609 and 4.450. So this oods ratio tells us that the probability of high alcohol consumption for male students was a 2.7 times higher than female students.
‘failures’ has a value of 1.332, the confidence interval is between 0.950 and 1.868. Class failures increased the risk of high alcohol consumption.
‘goout’ has a value of 2.045, the confidence interval is between 1.624 and 2.611. Going out with friends significantly increased the risk of high alcohol consumption.
‘absences’ has a value of 1.068, the confidence interval is between 1.035 and 1.107. School absences increased the probability of higher alcohol consumption.
(Step 6)
probs <- predict(lr_glm, type = "response")
alc <- mutate(alc, prob = probs)
alc <- mutate(alc, pred = prob > 0.5)
select(alc, age, failures, goout, absences, high_use, prob, pred) %>% tail(20)
## age failures goout absences high_use prob pred
## 363 17 0 3 8 FALSE 0.17542917 FALSE
## 364 17 0 5 14 FALSE 0.56870202 TRUE
## 365 18 0 4 0 FALSE 0.20488145 FALSE
## 366 18 0 2 2 FALSE 0.06560625 FALSE
## 367 18 0 3 4 TRUE 0.14068205 FALSE
## 368 18 0 3 0 FALSE 0.11188342 FALSE
## 369 17 0 4 17 TRUE 0.43966501 FALSE
## 370 18 0 4 4 TRUE 0.47069539 FALSE
## 371 18 0 3 5 FALSE 0.31703390 FALSE
## 372 17 0 3 2 FALSE 0.27608939 FALSE
## 373 19 1 2 0 FALSE 0.17887835 FALSE
## 374 18 1 3 14 TRUE 0.52713626 TRUE
## 375 18 0 3 2 FALSE 0.12557744 FALSE
## 376 18 0 3 7 FALSE 0.16615452 FALSE
## 377 19 1 2 0 FALSE 0.07581170 FALSE
## 378 18 0 4 0 FALSE 0.20488145 FALSE
## 379 18 1 1 0 FALSE 0.03855880 FALSE
## 380 18 1 1 0 FALSE 0.03855880 FALSE
## 381 17 0 5 3 TRUE 0.63011977 TRUE
## 382 18 0 1 0 TRUE 0.07404728 FALSE
pred22 <- table(high_use = alc$high_use, pred = alc$pred)
knitr::kable(pred22, align = "l", caption="2 x 2 cross tabulation" ) %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "center") %>% add_header_above(c(" " = 1, "Pred" = 2)) %>% pack_rows("High use", 1, 2)
| FALSE | TRUE | |
|---|---|---|
| High use | ||
| FALSE | 251 | 19 |
| TRUE | 62 | 50 |
The model reported 251 correct false results and 62 false negative results. Among the 62 cases, students had high alcohol consumption but the model predicted not. Among the 19 false positive prediction cases, the model predicted high alcohol consumption but it was not in fact.
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(class = alc$high_use, prob = alc$prob)
## [1] 0.2120419
It showed that 21.2% of all predictions was incorrect.
(Step 7)
library(boot)
cross_val<- cv.glm(data = alc, cost = loss_func, glmfit = lr_glm, K = 10)
cross_val$delta[1]
## [1] 0.2172775
(Step 8)
(19.11.-25.11.2019)
Work through the DataCamp
DataCamp 5 Clustering and classification
Data source
Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81–102.
Belsley D.A., Kuh, E. and Welsch, R.E. (1980) Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley.
Note: ‘(Step N)’ is used for reviewers to check my work queicky.
Load the Boston data from the MASS package (Step 2a)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
data("Boston")
Print out the names of the variables in the data (Step 2b)
dim(Boston)
## [1] 506 14
colnames(Boston)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
Describe the dataset briefly (Step 2c)
The data consist of 506 observations in 14 variables. Information about housing (proportion of residential land, proportion of non-retail business, rooms per dwelling, proportion of old buildings, accessibility to employment centres and highways), population (crime rate, tax, pupil-teacher ratio, demographic structure), environment (air quality, river) in Boston suburbs were reported.
This data frame contains the following columns:
“crim”: per capita crime rate by town.
“zn”: proportion of residential land zoned for lots over 25,000 sq.ft.
“indus”: proportion of non-retail business acres per town.
“chas”: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
“nox”: nitrogen oxides concentration (parts per 10 million).
“rm”: average number of rooms per dwelling.
“age”: proportion of owner-occupied units built prior to 1940.
“dis”: weighted mean of distances to five Boston employment centres.
“rad”: index of accessibility to radial highways.
“tax”: full-value property-tax rate per $10,000.
“ptratio”: pupil-teacher ratio by town.
“black”: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
“lstat”: lower status of the population (percent).
“medv”: median value of owner-occupied homes in $1000s.
(Step 3a)
library(tidyr)
library(dplyr)
library(corrplot)
library(ggplot2)
library(GGally)
ov_boston <- ggpairs(Boston, mapping = aes(), title ="Overview of Boston data", lower = list(combo = wrap("facethist", bins = 20)), upper = list(continuous = wrap("cor", size = 2.8)))
ov_boston
(Step 3b)
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
(Step 3c)
Calculate and print the correlation matrix
cor_matrix<-cor(Boston) %>% round(digits = 2)
cor_matrix
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
Specialized according to the significant level and visualize the correlation matrix p.mat <- cor.mtest(cor_matrix)$p
library(corrplot)
p.mat <- cor.mtest(cor_matrix)$p
corrplot(cor_matrix, method="circle", type="upper", tl.cex = 0.6, p.mat = p.mat, sig.level = 0.01, title="Correlations of Boston data", mar=c(0,0,1,0))
The crime rate strongly positively correlated with accessibility to highways, the property-tax rate and with the lower status.
The nitrogen oxides concentration significantly correlated with The weighted mean of distances to five employment significantly positively correlated with proportion of big residential land zoned and negatively correlated with proportion of non-retail business acres, proportion of old owner-occupied units and nitrogen oxides concentration.
(Step 4a)
Values were standardized after data scaling so that the observations of the different variables can be compared with each other. Positive and negative values were created with an overall mean of 0 and a standard deviation of 1.
bos_scaled <- scale(Boston)
summary(bos_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
(Step 4b)
Create a quantile vector of crim
class(bos_scaled)
## [1] "matrix"
bos_scaled <- as.data.frame(bos_scaled)
summary(bos_scaled$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419367 -0.410563 -0.390280 0.000000 0.007389 9.924110
bins <- quantile(bos_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
Create a categorical variable ‘crime’ and look at the table
crime <- cut(bos_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
Remove original crim and add the new categorical value to scaled data
bos_scaled <- dplyr::select(bos_scaled, -crim)
bos_scaled <- data.frame(bos_scaled, crime)
We checked the scaled values of rate values and defined the quantiles. Then we named them as low (0-25%), medium low (25-50%), medium high(50-75%) and high (75-100%).Then the quantiles of crime rate quantiles ‘crime’ were added and replaced ‘crim’.
(Step 4c)
Choose randomly 80% of the rows and create the train set and the test set
n_bos <- nrow(bos_scaled)
ind_bos <- sample(n_bos, size = n_bos * 0.8)
boston_train <- bos_scaled[ind_bos,]
boston_test <- bos_scaled[-ind_bos,]
Save the correct classes from test data
correct_classes <- boston_test$crime
correct_classes
## [1] low low med_low med_high med_high med_high med_high
## [8] med_high med_high low low med_low med_low low
## [15] low med_low med_low low low med_low low
## [22] low low med_low med_low med_low med_low med_low
## [29] med_low med_high med_high med_high med_high med_high med_high
## [36] med_high med_high med_high med_low med_low low low
## [43] low med_low med_low low low low low
## [50] med_low med_low med_low med_low med_low med_low med_low
## [57] low med_high med_high med_high med_high low med_low
## [64] low med_low low low med_low low low
## [71] med_high med_high med_high low low low low
## [78] low low high high high high high
## [85] high high high high high high high
## [92] high high high high high high high
## [99] high high high high
## Levels: low med_low med_high high
Remove the crime variable from test data set
boston_test <- dplyr::select(boston_test, -crime)
(Step 5)
Fit linear discriminant analysis on the ‘boston_train’ data set using the categorical ‘crime’ as the target variable.
lda.fit <- lda(crime ~ ., data = boston_train)
lda.fit
## Call:
## lda(crime ~ ., data = boston_train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2376238 0.2475248 0.2574257 0.2574257
##
## Group means:
## zn indus chas nox rm
## low 0.8671860 -0.8485969 -0.19030615 -0.8630622 0.4101940
## med_low -0.1172105 -0.2627239 -0.07547406 -0.5410452 -0.1847441
## med_high -0.3948894 0.1789197 0.21980846 0.3465785 0.1023238
## high -0.4872402 1.0170690 -0.04518867 1.0263404 -0.3282094
## age dis rad tax ptratio
## low -0.8339186 0.8406002 -0.6971471 -0.7348633 -0.452913751
## med_low -0.4027126 0.3202213 -0.5500476 -0.4291467 -0.009946484
## med_high 0.3592201 -0.3511215 -0.3756131 -0.2909462 -0.249941864
## high 0.7905465 -0.8503519 1.6386213 1.5144083 0.781350740
## black lstat medv
## low 0.36999537 -0.715425370 0.47076706
## med_low 0.31104125 -0.150826770 -0.03683837
## med_high 0.09915259 0.003071918 0.16883256
## high -0.79657982 0.809836268 -0.66608744
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.064713197 0.74524778 -0.908970410
## indus 0.005855833 -0.24857245 0.091501918
## chas -0.092210600 -0.10691445 0.060138540
## nox 0.395961661 -0.72837304 -1.144675082
## rm -0.089262903 -0.06315898 -0.259855052
## age 0.244696221 -0.23988852 -0.261379062
## dis -0.057066293 -0.21888082 -0.007249824
## rad 3.079385046 0.76714473 -0.417017083
## tax -0.087455797 0.11725343 0.880884627
## ptratio 0.128269592 0.01928150 -0.123383068
## black -0.137291621 0.01118952 0.079985876
## lstat 0.218280464 -0.20881241 0.217992605
## medv 0.173541485 -0.40568071 -0.226754787
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9522 0.0354 0.0125
The function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
Target classes as numeric and plot the lda results
classes <- as.numeric(boston_train$crime)
plot(lda.fit, dimen = 2, col = classes, pch = classes, main = "LDA biplot of Boston train set")
The linear discriminant analysis showed the separation of different crime rates.
Note: I do not know why the arrow were not desplayed.
Error in arrows(x0 = 0, y0 = 0, x1 = myscale * heads[, choices[1]], y1 = myscale * : plot.new has not been called yet
Now we Use the LDA outcome to perform a prediction on the crime rate on the test set.
(Step 6a)
lda.pred <- predict(lda.fit, newdata = boston_test)
(Step 6b)
pred_table <- table(correct = correct_classes, predicted = lda.pred$class)
pred_table
## predicted
## correct low med_low med_high high
## low 20 11 0 0
## med_low 4 18 4 0
## med_high 0 6 16 0
## high 0 0 0 23
The results were not very good on lower crime rate predictions. The predicted crime rate was higher than it actually was: predicted med_low of and med_high which were low rates or med_low rates. However, the predictions on the high crime rates was good.
In 4.5, we performed classification.In classification, the classes are defined before (crime rate values from low to high) and the classification model is trained based on data.
Now we do Clustering. The classes / number of classes is unknown and we will find groups based on similarity of the observations.
(Step 7a)
data(Boston)
bos_scaled2 <- scale(Boston)
(Step 7b)
Euclidean distance matrix
dist_eu <- dist(bos_scaled2)
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
Manhattan distance matrix
dist_man <- dist(bos_scaled2, method = "manhattan")
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4832 12.6090 13.5488 17.7568 48.8618
(Step 7c)
bos_km <- kmeans(bos_scaled2, centers = 3)
Plot the dataset with clusters
pairs(bos_scaled2, col = bos_km$cluster)
pairs(bos_scaled2[,1:5], col = bos_km$cluster)
pairs(bos_scaled2[,6:10], col = bos_km$cluster)
pairs(bos_scaled2[,11:14], col = bos_km$cluster)
The optimal number of clusters was 3. We got the best overview with three clusters.
set.seed(123)
k_max <- 10 # determine the number of clusters
twcss <- sapply(1:k_max, function(k){kmeans(bos_scaled2, k)$tot.withinss}) # calculate the total within sum of squares
qplot(x = 1:k_max, y = twcss, geom = 'line') # visualize the results
bos_km2 <- kmeans(bos_scaled2, centers = 3) # k-means clustering
pairs(bos_scaled2, col = bos_km2$cluster) # plot the data with clusters
pairs(bos_scaled2[,1:5], col = bos_km2$cluster)
pairs(bos_scaled2[,6:10], col = bos_km2$cluster)
pairs(bos_scaled2[,11:14], col = bos_km2$cluster)
The twcss value decrease heavily at a level of 2 - 3 clusters. The optimal number of clusters was 3.
(Bonus)
data(Boston)
bos_scaled3 <- scale(Boston)
bos_km3 <-kmeans(bos_scaled3, centers = 3)
cluster <- bos_km3$cluster
bos_scaled3 <- data.frame(bos_scaled3, cluster)
lda.fit_cluster <- lda(cluster ~ ., data = bos_scaled3)
lda.fit_cluster
## Call:
## lda(cluster ~ ., data = bos_scaled3)
##
## Prior probabilities of groups:
## 1 2 3
## 0.2470356 0.3260870 0.4268775
##
## Group means:
## crim zn indus chas nox rm
## 1 -0.3989700 1.2614609 -0.9791535 -0.020354653 -0.8573235 1.0090468
## 2 0.7982270 -0.4872402 1.1186734 0.014005495 1.1351215 -0.4596725
## 3 -0.3788713 -0.3578148 -0.2879024 0.001080671 -0.3709704 -0.2328004
## age dis rad tax ptratio black
## 1 -0.96130713 0.9497716 -0.5867985 -0.6709807 -0.80239137 0.3552363
## 2 0.79930921 -0.8549214 1.2113527 1.2873657 0.59162230 -0.6363367
## 3 -0.05427143 0.1034286 -0.5857564 -0.5951053 0.01241316 0.2805140
## lstat medv
## 1 -0.9571271 1.06668290
## 2 0.8622388 -0.67953738
## 3 -0.1047617 -0.09820229
##
## Coefficients of linear discriminants:
## LD1 LD2
## crim -0.03206338 -0.19094456
## zn 0.02935900 -1.07677218
## indus 0.63347352 -0.09917524
## chas 0.02460719 0.10009606
## nox 1.11749317 -0.75995105
## rm -0.18841682 -0.57360135
## age -0.12983139 0.47226685
## dis 0.04493809 -0.34585958
## rad 0.67004295 -0.08584353
## tax 1.03992455 -0.58075025
## ptratio 0.25864960 -0.02605279
## black -0.01657236 0.01975686
## lstat 0.17365575 -0.41704235
## medv -0.06819126 -0.79098605
##
## Proportion of trace:
## LD1 LD2
## 0.8506 0.1494
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes3 <- as.numeric(bos_scaled3$cluster)
plot(lda.fit_cluster, dimen = 2, col = classes3, pch = classes3, main = "LDA biplot using three clusters")
lda.arrows(lda.fit_cluster, myscale = 2)
(Super-Bonus)
model_predictors <- dplyr::select(boston_train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
Next, install and access the plotly package. Create a 3D plot of the columns of the matrix product.
Note: After running:
plot_ly(x = matrix_product\(LD1, y = matrix_product\)LD2, z = matrix_product\(LD3, type= 'scatter3d', mode='markers', color = boston_train\)crime)
It showed:
WebGL is not supported by your browser. Therefore I cannot see a cool plot.
(25.11.-2.12.2019)
Data wrangling
R script of data Wrangling exercise is available here.
https://github.com/yufanyin/IODS-project/blob/master/create_human.R
Load necessary packages
library(tidyr)
library(dplyr)
library(corrplot)
library(ggplot2)
library(GGally)
library(knitr)
library(kableExtra)
library(stringr)
library(ggfortify)
library(factoextra)
human <- read.table(file = "G:/C-Open Data Science/0-191030/IODS-project-master/human_ana.txt")
(Step 1a)
The data are the conbination of two data sets, Human Development Index (HDI) and Gender Inequality Index.
The data consist of 155 observations in 8 variables.
“Country”: country name “edu2F_edu2M”: ratio of second education of female/male “labF_labM”: ratio of labour forced female/male “exp_edu”: expected years of education “exp_life”: life expectancy at birth “GNI”: Gross National Income per capita “mat_mor”: maternal mortality ratio “ado_birth”: adolescent birth rate “F_parli”: percetange of female representatives in parliament
knitr::kable(summary(human)) %>% kable_styling(bootstrap_options = "striped", position = "center", font_size = 12)
| edu2F_edu2M | labF_labM | exp_edu | exp_life | GNI | mat_mor | ado_birth | F_parli | |
|---|---|---|---|---|---|---|---|---|
| Min. :0.1717 | Min. :0.1857 | Min. : 5.40 | Min. :49.00 | Min. : 581 | Min. : 1.0 | Min. : 0.60 | Min. : 0.00 | |
| 1st Qu.:0.7264 | 1st Qu.:0.5984 | 1st Qu.:11.25 | 1st Qu.:66.30 | 1st Qu.: 4198 | 1st Qu.: 11.5 | 1st Qu.: 12.65 | 1st Qu.:12.40 | |
| Median :0.9375 | Median :0.7535 | Median :13.50 | Median :74.20 | Median : 12040 | Median : 49.0 | Median : 33.60 | Median :19.30 | |
| Mean :0.8529 | Mean :0.7074 | Mean :13.18 | Mean :71.65 | Mean : 17628 | Mean : 149.1 | Mean : 47.16 | Mean :20.91 | |
| 3rd Qu.:0.9968 | 3rd Qu.:0.8535 | 3rd Qu.:15.20 | 3rd Qu.:77.25 | 3rd Qu.: 24512 | 3rd Qu.: 190.0 | 3rd Qu.: 71.95 | 3rd Qu.:27.95 | |
| Max. :1.4967 | Max. :1.0380 | Max. :20.20 | Max. :83.50 | Max. :123124 | Max. :1100.0 | Max. :204.80 | Max. :57.50 |
(Step 1b)
ov_human <- ggpairs(human, mapping = aes(), title ="Overview of 'human' data", lower = list(combo = wrap("facethist", bins = 20)), upper = list(continuous = wrap("cor", size = 2.8)))
ov_human
The overview plot showed the data distributions of all variables and their correlations. GNI, maternal mortality rate and adolescence birth rate showed left skewed distribution representing a increased cound on low rates. On the upper part showed correlations of the variables.
(Step 1c)
cor_human <- cor(human) %>% round(digits = 2)
cor_human %>% knitr::kable(caption = "Correlation between variables") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "center")
| edu2F_edu2M | labF_labM | exp_edu | exp_life | GNI | mat_mor | ado_birth | F_parli | |
|---|---|---|---|---|---|---|---|---|
| edu2F_edu2M | 1.00 | 0.01 | 0.59 | 0.58 | 0.43 | -0.66 | -0.53 | 0.08 |
| labF_labM | 0.01 | 1.00 | 0.05 | -0.14 | -0.02 | 0.24 | 0.12 | 0.25 |
| exp_edu | 0.59 | 0.05 | 1.00 | 0.79 | 0.62 | -0.74 | -0.70 | 0.21 |
| exp_life | 0.58 | -0.14 | 0.79 | 1.00 | 0.63 | -0.86 | -0.73 | 0.17 |
| GNI | 0.43 | -0.02 | 0.62 | 0.63 | 1.00 | -0.50 | -0.56 | 0.09 |
| mat_mor | -0.66 | 0.24 | -0.74 | -0.86 | -0.50 | 1.00 | 0.76 | -0.09 |
| ado_birth | -0.53 | 0.12 | -0.70 | -0.73 | -0.56 | 0.76 | 1.00 | -0.07 |
| F_parli | 0.08 | 0.25 | 0.21 | 0.17 | 0.09 | -0.09 | -0.07 | 1.00 |
p.mat <- cor.mtest(cor_human)$p
corrplot(cor_human, method="circle", type="lower", tl.cex = 0.6, p.mat = p.mat, sig.level = 0.01, tl.srt = 45, title="Correlations of 'human' data", mar=c(0,0,1,0))
Expected years of education strongly positively correlated with Live expectation and strongly negatively correlated with maternal mortality ratio. It indicated that a better education leads to a higher life expectation and contributes to a lower maternal mortality.
Another strong positive correlation was seen between adolescent birth rate and maternal mortality ratio. This may be due to high quality of medical care.
In PCA, the data are first transformed to a new space with equal or less number of dimensions which called the principal components. The 1st principal component captures the maximum amount of variance from the features in the original data. The 2nd principal component is orthogonal to the first and it captures the maximum amount of variability left.
(Step 2a)
pca_human <- prcomp(human)
round(pca_human$rotation[,1:2], digits = 4) %>% knitr::kable(caption = "PCA result on all variables") %>% kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "center")
| PC1 | PC2 | |
|---|---|---|
| edu2F_edu2M | 0.0000 | 0.0007 |
| labF_labM | 0.0000 | -0.0003 |
| exp_edu | -0.0001 | 0.0076 |
| exp_life | -0.0003 | 0.0283 |
| GNI | -1.0000 | -0.0058 |
| mat_mor | 0.0057 | -0.9916 |
| ado_birth | 0.0012 | -0.1256 |
| F_parli | -0.0001 | 0.0032 |
s1 <- summary(pca_human)
pca_prc1 <- round(100 * s1$importance[2, ], digits = 3)
round(100 * s1$importance[2, 1:2], digits = 5) %>% knitr::kable(caption = "Principle components") %>% kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "center")
| x | |
|---|---|
| PC1 | 99.99 |
| PC2 | 0.01 |
(Step 2b)
The biplot displayed the observations by the first two principal components (PC1 coordinate in x-axis, PC2 coordinate in y-axis), along with arrows representing the original variables.
pca_label1 <- paste0("GNI (", pca_prc1[1],"%)" )
pca_label2 <- paste0("Maternal mortality (", pca_prc1[2],"%)" )
biplot(pca_human, choices = 1:2, cex = c(0.55, 0,9), col = c("grey40", "darkgreen"), xlab = pca_label1, ylab = pca_label2, main = "PCA biplot of non-standardized human data", margin =c(0,0,5,0))
autoplot(pca_human, data = human, label= TRUE, label.size = 3.0, colour = "darkgreen", loadings = TRUE, loadings.label = TRUE, loadings.colour = "red",) + ggtitle("PCA biplot of non-standardized human data 2") + xlab(paste0(pca_label1)) + ylab(paste0(pca_label2)) + theme_bw()
Principal component 1 (GNI) represents 99.99% of the variance of the non-standardized data. PC2 (maternal mortalility ratio) captures 0.01% of the variance. The Gross National Income per capita explains 99.99% of the principal component 1 of the data.
(Step 3a)
The summary of the standardized data is as follows (The mean of all values is 0 and the standard diviation reaches 1).
stzd_human <- scale(human)
knitr::kable(summary(stzd_human)) %>%
kable_styling(bootstrap_options = "striped", position = "center", font_size = 11)
| edu2F_edu2M | labF_labM | exp_edu | exp_life | GNI | mat_mor | ado_birth | F_parli | |
|---|---|---|---|---|---|---|---|---|
| Min. :-2.8189 | Min. :-2.6247 | Min. :-2.7378 | Min. :-2.7188 | Min. :-0.9193 | Min. :-0.6992 | Min. :-1.1325 | Min. :-1.8203 | |
| 1st Qu.:-0.5233 | 1st Qu.:-0.5484 | 1st Qu.:-0.6782 | 1st Qu.:-0.6425 | 1st Qu.:-0.7243 | 1st Qu.:-0.6496 | 1st Qu.:-0.8394 | 1st Qu.:-0.7409 | |
| Median : 0.3503 | Median : 0.2316 | Median : 0.1140 | Median : 0.3056 | Median :-0.3013 | Median :-0.4726 | Median :-0.3298 | Median :-0.1403 | |
| Mean : 0.0000 | Mean : 0.0000 | Mean : 0.0000 | Mean : 0.0000 | Mean : 0.0000 | Mean : 0.0000 | Mean : 0.0000 | Mean : 0.0000 | |
| 3rd Qu.: 0.5958 | 3rd Qu.: 0.7350 | 3rd Qu.: 0.7126 | 3rd Qu.: 0.6717 | 3rd Qu.: 0.3712 | 3rd Qu.: 0.1932 | 3rd Qu.: 0.6030 | 3rd Qu.: 0.6127 | |
| Max. : 2.6646 | Max. : 1.6632 | Max. : 2.4730 | Max. : 1.4218 | Max. : 5.6890 | Max. : 4.4899 | Max. : 3.8344 | Max. : 3.1850 |
(Step 3b)
pca_stzd_human <- prcomp(stzd_human)
round(pca_stzd_human$rotation[, 1:2], digits = 4) %>% knitr::kable(caption = "PC1 & 2 of standardized 'human' data") %>% kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "center")
| PC1 | PC2 | |
|---|---|---|
| edu2F_edu2M | -0.3566 | 0.0380 |
| labF_labM | 0.0546 | 0.7243 |
| exp_edu | -0.4277 | 0.1394 |
| exp_life | -0.4437 | -0.0253 |
| GNI | -0.3505 | 0.0506 |
| mat_mor | 0.4370 | 0.1451 |
| ado_birth | 0.4113 | 0.0771 |
| F_parli | -0.0844 | 0.6514 |
s2 <- summary(pca_stzd_human)
pca_prc2 <- round(100 * s2$importance[2, ], digits = 1)
pca_label2.1 <- paste0("Education and health (",pca_prc2[1],"%) ")
pca_label2.2 <- paste0("Female social participation (",pca_prc2[2],"%) ")
round(100 * s2$importance[2, 1:2], digits = 2) %>% knitr::kable(caption = "Principle components") %>% kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "center")
| x | |
|---|---|
| PC1 | 53.61 |
| PC2 | 16.24 |
(Step 3c)
biplot(pca_stzd_human, choices = 1:2, cex = c(0.5, 0,9), col = c("grey40", "deeppink2"), xlab = pca_label2.1, ylab = pca_label2.2, main = "PCA biplot of standardized 'human' data", margin =c(0,0,5,0))
autoplot(prcomp(stzd_human), data = human, colour = "darkgreen", label= TRUE, label.size = 3.0, loadings = TRUE, loadings.label = TRUE, loadings.colour = "red",) + ggtitle("PCA biplot of standardized 'human' data 2") + xlab(paste0(pca_label2.1)) + ylab(paste0(pca_label2.2)) + theme_bw()
(Step 4)
The principal component representing 16.2% of the data variation was strongly associated with percentage of parliament representatives (“F_parli”) and ratio of labour forced female/male (“labF_labM”). So the share of of woman in labour market and in the political representatives were mostly associated with principal component 2. The variables which showed the strongest association with principal component 1 and represented 53.6% of the data variation were life expectancy (“exp_life”), expected years of education (“exp_edu”), the ratio of second education of female/male (“edu2F_edu2M”), adolescent birth rate (“ado_birth”) and maternal mortality ratio (“mat_mor”). Hence principal component 1 discribed the health and education situation in those countries.
The PCA result on between non-standardized data differed from that on standardized data. In the PCA of the non-standardized human data, the principal component 1 represents almost all varation and Gross National Income per capita is the value representing principle component 1. The PCA of standardized values looks quite different. The PC1 & PC2 values spread all over the biplot and more variables are associated to these two principal components. To perform the PCA, it is necessary to standardizing variables of obsvervations so that values become comparable. (Before doing this exercise, I did not know why I needed to standardize first! Now I understand.)
The angle between the arrows represents the original features of the data and can be interpreted as the correlation between these features. Small angle between the arrows means that the correlation is positive. The angle between the feature and the principal component can be interpreted as the correlation between these two components. Similarly, if the angle is small there is a positive correlation. The length of the arrows is proportional to the standard deviation of the features.
(Step 5a)
The tea data about tea consumption behaviour were loaded from the package Factominer. The data consist of 300 observations in 36 variables.
library(FactoMineR)
data(tea)
dim(tea)
## [1] 300 36
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
(Step 5b)
Error in the following chunk:
Warning: attributes are not identical across measure variables;
they will be dropped
WHY? The error occurred in knitting index.rmd. Everything was fine when I knitted chapter5.rmd
I had to add #### so that all the following codes would be able to display without running.
# column names to keep in the dataset
#### keep_tea <- c("Tea", "How", "how", "price", "where", "lunch")
#### keep_tea
# select and create a new dataset
#### tea_time <- select(tea, one_of(keep_tea))
#### gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar(fill = "darkblue") + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
(Step 5c)
#### mca_tea_time <- MCA(tea_time, graph = FALSE)
#### summary(mca_tea_time)
Visualize MCA
#### plot(mca_tea_time, invisible=c("ind"), habillage = "quali")
(Step 5d)
An individual value between 0 & 1 was added to colour the cosine-squared (cos2), a value between 0 & 1. The closer to 1, the better this value projected on the dimension.
#### fviz_mca_biplot(mca_tea_time, col.ind = "cos2", col.var = "darkred", label = "var", geom = c("point","text"), labelsize = 4, arrows = c(FALSE,TRUE)) + labs(title = "MCA of tea consumption behaviour") + theme_grey() + theme(axis.line = element_line(size = 0.5), panel.background = element_rect(fill = "gray93"))
(Step 5d)
An individual value between 0 & 1 was added to colour the cosine-squared (cos2), a value between 0 & 1. The closer to 1, the better this value projected on the dimension.
(Step 5e)
“sex”: Gender of tea consumers “age_Q”: Age group of tea consumers “spirituality”: Spiritual reason “healthy”: Health reason “slimming”: To lose weight or not “relaxing”: Relaxation or not
(Same error in this chunk)
#### keep_reason <- c("sex", "age_Q", "spirituality", "healthy", "slimming", "relaxing")
#### tea_reason <- select(tea, one_of(keep_reason))
#### gather(tea_reason) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar(fill = "darkblue", alpha = 0.6) + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
MCA of ‘tea_reason’ data
#### mca_tea_reason <- MCA(tea_reason, graph = FALSE)
#### summary(mca_tea_reason)
Call: MCA(X = tea_reason, graph = FALSE)
Eigenvalues
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
Variance 0.233 0.217 0.202 0.173 0.166 0.138
% of var. 15.562 14.476 13.456 11.502 11.066 9.187
Cumulative % of var. 15.562 30.039 43.495 54.997 66.063 75.251
Dim.7 Dim.8 Dim.9
Variance 0.132 0.126 0.113
% of var. 8.833 8.393 7.523
Cumulative % of var. 84.083 92.477 100.000
Individuals (the 10 first)
Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
1 | 0.491 0.344 0.135 | 0.199 0.061 0.022 | 0.110
2 | 0.506 0.366 0.210 | 0.599 0.551 0.294 | -0.564
3 | 0.039 0.002 0.001 | 0.353 0.191 0.119 | -0.380
4 | -0.602 0.518 0.306 | -0.531 0.432 0.237 | 0.178
5 | 0.208 0.062 0.024 | -0.446 0.305 0.112 | -0.116
6 | -0.270 0.104 0.081 | -0.288 0.128 0.093 | 0.198
7 | 0.024 0.001 0.000 | -0.047 0.003 0.001 | 0.294
8 | -0.193 0.053 0.025 | 0.146 0.033 0.014 | -0.211
9 | 0.308 0.135 0.049 | -0.410 0.259 0.088 | 0.073
10 | 0.024 0.001 0.000 | -0.047 0.003 0.001 | 0.294
ctr cos2
1 0.020 0.007 |
2 0.526 0.261 |
3 0.239 0.138 |
4 0.052 0.027 |
5 0.022 0.008 |
6 0.065 0.044 |
7 0.143 0.054 |
8 0.074 0.030 |
9 0.009 0.003 |
10 0.143 0.054 |
Categories (the 10 first)
Dim.1 ctr cos2 v.test Dim.2 ctr
F | -0.256 2.767 0.095 -5.338 | 0.220 2.198
M | 0.373 4.037 0.095 5.338 | -0.321 3.208
15-24 | -0.834 15.237 0.308 -9.593 | -0.524 6.472
25-34 | 0.681 7.619 0.139 6.437 | -0.744 9.760
35-44 | 0.018 0.003 0.000 0.121 | 0.149 0.229
45-59 | 0.692 6.948 0.122 6.043 | 0.728 8.276
+60 | -0.346 1.086 0.017 -2.282 | 1.293 16.262
Not.spirituality | 0.302 4.458 0.199 7.719 | 0.212 2.374
spirituality | -0.661 9.770 0.199 -7.719 | -0.465 5.203
healthy | -0.247 3.050 0.142 -6.525 | 0.304 4.981
cos2 v.test Dim.3 ctr cos2 v.test
F 0.070 4.589 | -0.555 15.066 0.449 -11.582 |
M 0.070 -4.589 | 0.809 21.981 0.449 11.582 |
15-24 0.122 -6.030 | -0.522 6.893 0.120 -6.000 |
25-34 0.165 -7.027 | 0.983 18.367 0.289 9.294 |
35-44 0.003 1.014 | -0.261 0.750 0.010 -1.770 |
45-59 0.135 6.361 | -0.716 8.605 0.131 -6.254 |
+60 0.243 8.517 | 0.901 8.498 0.118 5.936 |
Not.spirituality 0.099 5.433 | 0.017 0.016 0.001 0.432 |
spirituality 0.099 -5.433 | -0.037 0.035 0.001 -0.432 |
healthy 0.216 8.043 | 0.179 1.853 0.075 4.730 |
Categorical variables (eta2)
Dim.1 Dim.2 Dim.3
sex | 0.095 0.070 0.449 |
age_Q | 0.433 0.534 0.522 |
spirituality | 0.199 0.099 0.001 |
healthy | 0.142 0.216 0.075 |
slimming | 0.100 0.272 0.107 |
relaxing | 0.431 0.111 0.058 |
MCA biplot of ‘tea_reason’ data
#### fviz_mca_biplot(mca_tea_reason, col.ind = "cos2", col.var = "darkred", alpha.var = "contrib", label = "var", geom = c("point","text"), labelsize = 3.5, arrows = c(FALSE,TRUE)) + labs(title = "MCA of tea consumption reasons") + theme_grey() + theme(axis.line = element_line(size = 0.5), panel.background = element_rect(fill = "gray93"))
Comment on the output
(Step 5f)
Variable categories closing to each other showed higher similarity. The plot included two colour schemes representing the contribution of the variables to the dimenstions(“contrib”) and the “cos2” of the individuals (the closer to 1, the better it projected on the dimension) .
People in the age of 15-24, spirituality and relaxation reason were close. Men in the age of 25-34 tended not to drink tea for health reason or losing weight. People in the age of 45-59 tended not to drink tea for relaxation or spirituality reason. People in age of 60+ and women group have similarities with drinking tea for health reasons and weight losing.
Note: PCA and MCA biplot help to visualize results much better than doing in SPSS. Wrting codes in R is a bit more abstract but it is worth learning.
(3.12.-10.12.2019)
Data wrangling
R script of data Wrangling exercise is available here.
https://github.com/yufanyin/IODS-project/blob/master/create_BPRS_rats_l.R
Reference to the data source
https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt
https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt
Note:
6.2-6.5 Analyses of chapter 8 of MABS using the RATS data (rats_l)
6.6-6.8 Analyses of chapter 9 of MABS using the BPRS data (bprs_l)
Load and check the data sets
# load necessary packages
library(tidyr)
library(dplyr)
library(corrplot)
library(ggplot2)
library(GGally)
library(knitr)
library(kableExtra)
library(ggthemes)
library(stringr)
library(lme4)
Load the wide (bprs & rats) and long (bprs_l & rats_l) data sets
bprs <- read.table("G:/C-Open Data Science/0-191030/IODS-project-master/BPRS.txt", sep = " ", header = TRUE)
rats <- read.table("G:/C-Open Data Science/0-191030/IODS-project-master/rats.txt", header = TRUE)
bprs_l <- read.table("G:/C-Open Data Science/0-191030/IODS-project-master/BPRS_l.txt")
rats_l <- read.table("G:/C-Open Data Science/0-191030/IODS-project-master/rats_l.txt")
str(bprs_l)
## 'data.frame': 360 obs. of 5 variables:
## $ treatment: int 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : int 1 2 3 4 5 6 7 8 9 10 ...
## $ weeks : Factor w/ 9 levels "week0","week1",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BPRS : int 42 58 54 55 72 48 71 30 41 57 ...
## $ week : int 0 0 0 0 0 0 0 0 0 0 ...
str(rats_l)
## 'data.frame': 176 obs. of 5 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Group : int 1 1 1 1 1 1 1 1 2 2 ...
## $ WD : Factor w/ 11 levels "WD1","WD15","WD22",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Weight: int 240 225 245 260 255 260 275 245 410 405 ...
## $ Time : int 1 1 1 1 1 1 1 1 1 1 ...
bprs data: factor treatment & subject rats data: factor ID & Group
bprs_l$treatment <- factor(bprs_l$treatment)
bprs_l$subject <- factor(bprs_l$subject)
rats_l$ID <- factor(rats_l$ID)
rats_l$Group <- factor(rats_l$Group)
str(bprs_l)
## 'data.frame': 360 obs. of 5 variables:
## $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ weeks : Factor w/ 9 levels "week0","week1",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BPRS : int 42 58 54 55 72 48 71 30 41 57 ...
## $ week : int 0 0 0 0 0 0 0 0 0 0 ...
str(rats_l)
## 'data.frame': 176 obs. of 5 variables:
## $ ID : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
## $ WD : Factor w/ 11 levels "WD1","WD15","WD22",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Weight: int 240 225 245 260 255 260 275 245 410 405 ...
## $ Time : int 1 1 1 1 1 1 1 1 1 1 ...
The BPRS dataset
A total 40 male subjects were measured with 2 treatment groups using BPRS (Brief Psychiatric Rating Scale). One BPRS measurement was conducted before treatment started (week0), and weekly measurements followed up to 8 weeks. The BPRS assesses the levels of 18 symptom constructs to evaluate whether patients have schizophrenia (Vehkalahti & Everitt, 2019). The levels of symptoms constructs were from 1 (not present) to 7 (extremely severe).
The long BPRS data ‘bprs_l’
They consisted of 360 observations in 5 variables. The variables included:
“treatment”: the psychological treatment of the male subjects (1 & 2)
“subject”: the male individuals identification number “weeks”: the week of treatment and BPRS evaluation “bprs”: the values of bprs “week”: the week number of the BPRS evaluation.
knitr::kable(bprs_l, caption = "The long BPRS data 'bprs_l'") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "center") %>%
scroll_box(height = "200px")
| treatment | subject | weeks | BPRS | week |
|---|---|---|---|---|
| 1 | 1 | week0 | 42 | 0 |
| 1 | 2 | week0 | 58 | 0 |
| 1 | 3 | week0 | 54 | 0 |
| 1 | 4 | week0 | 55 | 0 |
| 1 | 5 | week0 | 72 | 0 |
| 1 | 6 | week0 | 48 | 0 |
| 1 | 7 | week0 | 71 | 0 |
| 1 | 8 | week0 | 30 | 0 |
| 1 | 9 | week0 | 41 | 0 |
| 1 | 10 | week0 | 57 | 0 |
| 1 | 11 | week0 | 30 | 0 |
| 1 | 12 | week0 | 55 | 0 |
| 1 | 13 | week0 | 36 | 0 |
| 1 | 14 | week0 | 38 | 0 |
| 1 | 15 | week0 | 66 | 0 |
| 1 | 16 | week0 | 41 | 0 |
| 1 | 17 | week0 | 45 | 0 |
| 1 | 18 | week0 | 39 | 0 |
| 1 | 19 | week0 | 24 | 0 |
| 1 | 20 | week0 | 38 | 0 |
| 2 | 1 | week0 | 52 | 0 |
| 2 | 2 | week0 | 30 | 0 |
| 2 | 3 | week0 | 65 | 0 |
| 2 | 4 | week0 | 37 | 0 |
| 2 | 5 | week0 | 59 | 0 |
| 2 | 6 | week0 | 30 | 0 |
| 2 | 7 | week0 | 69 | 0 |
| 2 | 8 | week0 | 62 | 0 |
| 2 | 9 | week0 | 38 | 0 |
| 2 | 10 | week0 | 65 | 0 |
| 2 | 11 | week0 | 78 | 0 |
| 2 | 12 | week0 | 38 | 0 |
| 2 | 13 | week0 | 63 | 0 |
| 2 | 14 | week0 | 40 | 0 |
| 2 | 15 | week0 | 40 | 0 |
| 2 | 16 | week0 | 54 | 0 |
| 2 | 17 | week0 | 33 | 0 |
| 2 | 18 | week0 | 28 | 0 |
| 2 | 19 | week0 | 52 | 0 |
| 2 | 20 | week0 | 47 | 0 |
| 1 | 1 | week1 | 36 | 1 |
| 1 | 2 | week1 | 68 | 1 |
| 1 | 3 | week1 | 55 | 1 |
| 1 | 4 | week1 | 77 | 1 |
| 1 | 5 | week1 | 75 | 1 |
| 1 | 6 | week1 | 43 | 1 |
| 1 | 7 | week1 | 61 | 1 |
| 1 | 8 | week1 | 36 | 1 |
| 1 | 9 | week1 | 43 | 1 |
| 1 | 10 | week1 | 51 | 1 |
| 1 | 11 | week1 | 34 | 1 |
| 1 | 12 | week1 | 52 | 1 |
| 1 | 13 | week1 | 32 | 1 |
| 1 | 14 | week1 | 35 | 1 |
| 1 | 15 | week1 | 68 | 1 |
| 1 | 16 | week1 | 35 | 1 |
| 1 | 17 | week1 | 38 | 1 |
| 1 | 18 | week1 | 35 | 1 |
| 1 | 19 | week1 | 28 | 1 |
| 1 | 20 | week1 | 34 | 1 |
| 2 | 1 | week1 | 73 | 1 |
| 2 | 2 | week1 | 23 | 1 |
| 2 | 3 | week1 | 31 | 1 |
| 2 | 4 | week1 | 31 | 1 |
| 2 | 5 | week1 | 67 | 1 |
| 2 | 6 | week1 | 33 | 1 |
| 2 | 7 | week1 | 52 | 1 |
| 2 | 8 | week1 | 54 | 1 |
| 2 | 9 | week1 | 40 | 1 |
| 2 | 10 | week1 | 44 | 1 |
| 2 | 11 | week1 | 95 | 1 |
| 2 | 12 | week1 | 41 | 1 |
| 2 | 13 | week1 | 65 | 1 |
| 2 | 14 | week1 | 37 | 1 |
| 2 | 15 | week1 | 36 | 1 |
| 2 | 16 | week1 | 45 | 1 |
| 2 | 17 | week1 | 41 | 1 |
| 2 | 18 | week1 | 30 | 1 |
| 2 | 19 | week1 | 43 | 1 |
| 2 | 20 | week1 | 36 | 1 |
| 1 | 1 | week2 | 36 | 2 |
| 1 | 2 | week2 | 61 | 2 |
| 1 | 3 | week2 | 41 | 2 |
| 1 | 4 | week2 | 49 | 2 |
| 1 | 5 | week2 | 72 | 2 |
| 1 | 6 | week2 | 41 | 2 |
| 1 | 7 | week2 | 47 | 2 |
| 1 | 8 | week2 | 38 | 2 |
| 1 | 9 | week2 | 39 | 2 |
| 1 | 10 | week2 | 51 | 2 |
| 1 | 11 | week2 | 34 | 2 |
| 1 | 12 | week2 | 49 | 2 |
| 1 | 13 | week2 | 36 | 2 |
| 1 | 14 | week2 | 36 | 2 |
| 1 | 15 | week2 | 65 | 2 |
| 1 | 16 | week2 | 45 | 2 |
| 1 | 17 | week2 | 46 | 2 |
| 1 | 18 | week2 | 27 | 2 |
| 1 | 19 | week2 | 31 | 2 |
| 1 | 20 | week2 | 27 | 2 |
| 2 | 1 | week2 | 42 | 2 |
| 2 | 2 | week2 | 32 | 2 |
| 2 | 3 | week2 | 33 | 2 |
| 2 | 4 | week2 | 27 | 2 |
| 2 | 5 | week2 | 58 | 2 |
| 2 | 6 | week2 | 37 | 2 |
| 2 | 7 | week2 | 41 | 2 |
| 2 | 8 | week2 | 49 | 2 |
| 2 | 9 | week2 | 38 | 2 |
| 2 | 10 | week2 | 31 | 2 |
| 2 | 11 | week2 | 75 | 2 |
| 2 | 12 | week2 | 36 | 2 |
| 2 | 13 | week2 | 60 | 2 |
| 2 | 14 | week2 | 31 | 2 |
| 2 | 15 | week2 | 55 | 2 |
| 2 | 16 | week2 | 35 | 2 |
| 2 | 17 | week2 | 30 | 2 |
| 2 | 18 | week2 | 29 | 2 |
| 2 | 19 | week2 | 26 | 2 |
| 2 | 20 | week2 | 32 | 2 |
| 1 | 1 | week3 | 43 | 3 |
| 1 | 2 | week3 | 55 | 3 |
| 1 | 3 | week3 | 38 | 3 |
| 1 | 4 | week3 | 54 | 3 |
| 1 | 5 | week3 | 65 | 3 |
| 1 | 6 | week3 | 38 | 3 |
| 1 | 7 | week3 | 30 | 3 |
| 1 | 8 | week3 | 38 | 3 |
| 1 | 9 | week3 | 35 | 3 |
| 1 | 10 | week3 | 55 | 3 |
| 1 | 11 | week3 | 41 | 3 |
| 1 | 12 | week3 | 54 | 3 |
| 1 | 13 | week3 | 31 | 3 |
| 1 | 14 | week3 | 34 | 3 |
| 1 | 15 | week3 | 49 | 3 |
| 1 | 16 | week3 | 42 | 3 |
| 1 | 17 | week3 | 38 | 3 |
| 1 | 18 | week3 | 25 | 3 |
| 1 | 19 | week3 | 28 | 3 |
| 1 | 20 | week3 | 25 | 3 |
| 2 | 1 | week3 | 41 | 3 |
| 2 | 2 | week3 | 24 | 3 |
| 2 | 3 | week3 | 28 | 3 |
| 2 | 4 | week3 | 31 | 3 |
| 2 | 5 | week3 | 61 | 3 |
| 2 | 6 | week3 | 33 | 3 |
| 2 | 7 | week3 | 33 | 3 |
| 2 | 8 | week3 | 39 | 3 |
| 2 | 9 | week3 | 27 | 3 |
| 2 | 10 | week3 | 34 | 3 |
| 2 | 11 | week3 | 76 | 3 |
| 2 | 12 | week3 | 27 | 3 |
| 2 | 13 | week3 | 53 | 3 |
| 2 | 14 | week3 | 38 | 3 |
| 2 | 15 | week3 | 55 | 3 |
| 2 | 16 | week3 | 27 | 3 |
| 2 | 17 | week3 | 32 | 3 |
| 2 | 18 | week3 | 33 | 3 |
| 2 | 19 | week3 | 27 | 3 |
| 2 | 20 | week3 | 29 | 3 |
| 1 | 1 | week4 | 41 | 4 |
| 1 | 2 | week4 | 43 | 4 |
| 1 | 3 | week4 | 43 | 4 |
| 1 | 4 | week4 | 56 | 4 |
| 1 | 5 | week4 | 50 | 4 |
| 1 | 6 | week4 | 36 | 4 |
| 1 | 7 | week4 | 27 | 4 |
| 1 | 8 | week4 | 31 | 4 |
| 1 | 9 | week4 | 28 | 4 |
| 1 | 10 | week4 | 53 | 4 |
| 1 | 11 | week4 | 36 | 4 |
| 1 | 12 | week4 | 48 | 4 |
| 1 | 13 | week4 | 25 | 4 |
| 1 | 14 | week4 | 25 | 4 |
| 1 | 15 | week4 | 36 | 4 |
| 1 | 16 | week4 | 31 | 4 |
| 1 | 17 | week4 | 40 | 4 |
| 1 | 18 | week4 | 29 | 4 |
| 1 | 19 | week4 | 29 | 4 |
| 1 | 20 | week4 | 25 | 4 |
| 2 | 1 | week4 | 39 | 4 |
| 2 | 2 | week4 | 20 | 4 |
| 2 | 3 | week4 | 22 | 4 |
| 2 | 4 | week4 | 31 | 4 |
| 2 | 5 | week4 | 49 | 4 |
| 2 | 6 | week4 | 28 | 4 |
| 2 | 7 | week4 | 34 | 4 |
| 2 | 8 | week4 | 55 | 4 |
| 2 | 9 | week4 | 31 | 4 |
| 2 | 10 | week4 | 39 | 4 |
| 2 | 11 | week4 | 66 | 4 |
| 2 | 12 | week4 | 29 | 4 |
| 2 | 13 | week4 | 52 | 4 |
| 2 | 14 | week4 | 35 | 4 |
| 2 | 15 | week4 | 42 | 4 |
| 2 | 16 | week4 | 25 | 4 |
| 2 | 17 | week4 | 46 | 4 |
| 2 | 18 | week4 | 30 | 4 |
| 2 | 19 | week4 | 24 | 4 |
| 2 | 20 | week4 | 25 | 4 |
| 1 | 1 | week5 | 40 | 5 |
| 1 | 2 | week5 | 34 | 5 |
| 1 | 3 | week5 | 28 | 5 |
| 1 | 4 | week5 | 50 | 5 |
| 1 | 5 | week5 | 39 | 5 |
| 1 | 6 | week5 | 29 | 5 |
| 1 | 7 | week5 | 40 | 5 |
| 1 | 8 | week5 | 26 | 5 |
| 1 | 9 | week5 | 22 | 5 |
| 1 | 10 | week5 | 43 | 5 |
| 1 | 11 | week5 | 36 | 5 |
| 1 | 12 | week5 | 43 | 5 |
| 1 | 13 | week5 | 25 | 5 |
| 1 | 14 | week5 | 27 | 5 |
| 1 | 15 | week5 | 32 | 5 |
| 1 | 16 | week5 | 31 | 5 |
| 1 | 17 | week5 | 33 | 5 |
| 1 | 18 | week5 | 28 | 5 |
| 1 | 19 | week5 | 21 | 5 |
| 1 | 20 | week5 | 27 | 5 |
| 2 | 1 | week5 | 38 | 5 |
| 2 | 2 | week5 | 20 | 5 |
| 2 | 3 | week5 | 25 | 5 |
| 2 | 4 | week5 | 26 | 5 |
| 2 | 5 | week5 | 38 | 5 |
| 2 | 6 | week5 | 26 | 5 |
| 2 | 7 | week5 | 37 | 5 |
| 2 | 8 | week5 | 51 | 5 |
| 2 | 9 | week5 | 24 | 5 |
| 2 | 10 | week5 | 34 | 5 |
| 2 | 11 | week5 | 64 | 5 |
| 2 | 12 | week5 | 27 | 5 |
| 2 | 13 | week5 | 32 | 5 |
| 2 | 14 | week5 | 30 | 5 |
| 2 | 15 | week5 | 30 | 5 |
| 2 | 16 | week5 | 22 | 5 |
| 2 | 17 | week5 | 43 | 5 |
| 2 | 18 | week5 | 26 | 5 |
| 2 | 19 | week5 | 32 | 5 |
| 2 | 20 | week5 | 23 | 5 |
| 1 | 1 | week6 | 38 | 6 |
| 1 | 2 | week6 | 28 | 6 |
| 1 | 3 | week6 | 29 | 6 |
| 1 | 4 | week6 | 47 | 6 |
| 1 | 5 | week6 | 32 | 6 |
| 1 | 6 | week6 | 33 | 6 |
| 1 | 7 | week6 | 30 | 6 |
| 1 | 8 | week6 | 26 | 6 |
| 1 | 9 | week6 | 20 | 6 |
| 1 | 10 | week6 | 43 | 6 |
| 1 | 11 | week6 | 38 | 6 |
| 1 | 12 | week6 | 37 | 6 |
| 1 | 13 | week6 | 21 | 6 |
| 1 | 14 | week6 | 25 | 6 |
| 1 | 15 | week6 | 27 | 6 |
| 1 | 16 | week6 | 29 | 6 |
| 1 | 17 | week6 | 27 | 6 |
| 1 | 18 | week6 | 21 | 6 |
| 1 | 19 | week6 | 22 | 6 |
| 1 | 20 | week6 | 21 | 6 |
| 2 | 1 | week6 | 43 | 6 |
| 2 | 2 | week6 | 19 | 6 |
| 2 | 3 | week6 | 24 | 6 |
| 2 | 4 | week6 | 24 | 6 |
| 2 | 5 | week6 | 37 | 6 |
| 2 | 6 | week6 | 27 | 6 |
| 2 | 7 | week6 | 37 | 6 |
| 2 | 8 | week6 | 55 | 6 |
| 2 | 9 | week6 | 22 | 6 |
| 2 | 10 | week6 | 41 | 6 |
| 2 | 11 | week6 | 64 | 6 |
| 2 | 12 | week6 | 21 | 6 |
| 2 | 13 | week6 | 37 | 6 |
| 2 | 14 | week6 | 33 | 6 |
| 2 | 15 | week6 | 26 | 6 |
| 2 | 16 | week6 | 22 | 6 |
| 2 | 17 | week6 | 43 | 6 |
| 2 | 18 | week6 | 36 | 6 |
| 2 | 19 | week6 | 21 | 6 |
| 2 | 20 | week6 | 23 | 6 |
| 1 | 1 | week7 | 47 | 7 |
| 1 | 2 | week7 | 28 | 7 |
| 1 | 3 | week7 | 25 | 7 |
| 1 | 4 | week7 | 42 | 7 |
| 1 | 5 | week7 | 38 | 7 |
| 1 | 6 | week7 | 27 | 7 |
| 1 | 7 | week7 | 31 | 7 |
| 1 | 8 | week7 | 25 | 7 |
| 1 | 9 | week7 | 23 | 7 |
| 1 | 10 | week7 | 39 | 7 |
| 1 | 11 | week7 | 36 | 7 |
| 1 | 12 | week7 | 36 | 7 |
| 1 | 13 | week7 | 19 | 7 |
| 1 | 14 | week7 | 26 | 7 |
| 1 | 15 | week7 | 30 | 7 |
| 1 | 16 | week7 | 26 | 7 |
| 1 | 17 | week7 | 31 | 7 |
| 1 | 18 | week7 | 25 | 7 |
| 1 | 19 | week7 | 23 | 7 |
| 1 | 20 | week7 | 19 | 7 |
| 2 | 1 | week7 | 62 | 7 |
| 2 | 2 | week7 | 18 | 7 |
| 2 | 3 | week7 | 31 | 7 |
| 2 | 4 | week7 | 26 | 7 |
| 2 | 5 | week7 | 36 | 7 |
| 2 | 6 | week7 | 23 | 7 |
| 2 | 7 | week7 | 38 | 7 |
| 2 | 8 | week7 | 59 | 7 |
| 2 | 9 | week7 | 21 | 7 |
| 2 | 10 | week7 | 42 | 7 |
| 2 | 11 | week7 | 60 | 7 |
| 2 | 12 | week7 | 22 | 7 |
| 2 | 13 | week7 | 52 | 7 |
| 2 | 14 | week7 | 30 | 7 |
| 2 | 15 | week7 | 30 | 7 |
| 2 | 16 | week7 | 22 | 7 |
| 2 | 17 | week7 | 43 | 7 |
| 2 | 18 | week7 | 33 | 7 |
| 2 | 19 | week7 | 21 | 7 |
| 2 | 20 | week7 | 23 | 7 |
| 1 | 1 | week8 | 51 | 8 |
| 1 | 2 | week8 | 28 | 8 |
| 1 | 3 | week8 | 24 | 8 |
| 1 | 4 | week8 | 46 | 8 |
| 1 | 5 | week8 | 32 | 8 |
| 1 | 6 | week8 | 25 | 8 |
| 1 | 7 | week8 | 31 | 8 |
| 1 | 8 | week8 | 24 | 8 |
| 1 | 9 | week8 | 21 | 8 |
| 1 | 10 | week8 | 32 | 8 |
| 1 | 11 | week8 | 36 | 8 |
| 1 | 12 | week8 | 31 | 8 |
| 1 | 13 | week8 | 22 | 8 |
| 1 | 14 | week8 | 26 | 8 |
| 1 | 15 | week8 | 37 | 8 |
| 1 | 16 | week8 | 30 | 8 |
| 1 | 17 | week8 | 27 | 8 |
| 1 | 18 | week8 | 20 | 8 |
| 1 | 19 | week8 | 22 | 8 |
| 1 | 20 | week8 | 21 | 8 |
| 2 | 1 | week8 | 50 | 8 |
| 2 | 2 | week8 | 20 | 8 |
| 2 | 3 | week8 | 32 | 8 |
| 2 | 4 | week8 | 23 | 8 |
| 2 | 5 | week8 | 35 | 8 |
| 2 | 6 | week8 | 21 | 8 |
| 2 | 7 | week8 | 35 | 8 |
| 2 | 8 | week8 | 66 | 8 |
| 2 | 9 | week8 | 21 | 8 |
| 2 | 10 | week8 | 39 | 8 |
| 2 | 11 | week8 | 75 | 8 |
| 2 | 12 | week8 | 23 | 8 |
| 2 | 13 | week8 | 28 | 8 |
| 2 | 14 | week8 | 27 | 8 |
| 2 | 15 | week8 | 37 | 8 |
| 2 | 16 | week8 | 22 | 8 |
| 2 | 17 | week8 | 43 | 8 |
| 2 | 18 | week8 | 30 | 8 |
| 2 | 19 | week8 | 21 | 8 |
| 2 | 20 | week8 | 23 | 8 |
The RATS dataset The data consisted of measurements on how rats grew. Three groups of rats got different diets. The body weight of every animal was measured over a period of 9 weeks. The research question is whether the growth profiles of the different diet groups differ (Vehkalahti & Everitt, 2019).
The long RATS data ‘rats_l’ The data consisted of 176 observations in 5 variables. The variables included:
“ID”: the individual rat’s identification number “Group”: the diet group (1, 2 & 3) “WD”: the factorial value of the time (the date when rat’ weight was measured) “Weight”: weight of individual rat at the given day “Time”: the time point of the weight measurement (days)
# rats_ data set
knitr::kable(rats_l, caption = "The long RATS data 'rats_l'") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "center") %>%
scroll_box(height = "200px")
| ID | Group | WD | Weight | Time |
|---|---|---|---|---|
| 1 | 1 | WD1 | 240 | 1 |
| 2 | 1 | WD1 | 225 | 1 |
| 3 | 1 | WD1 | 245 | 1 |
| 4 | 1 | WD1 | 260 | 1 |
| 5 | 1 | WD1 | 255 | 1 |
| 6 | 1 | WD1 | 260 | 1 |
| 7 | 1 | WD1 | 275 | 1 |
| 8 | 1 | WD1 | 245 | 1 |
| 9 | 2 | WD1 | 410 | 1 |
| 10 | 2 | WD1 | 405 | 1 |
| 11 | 2 | WD1 | 445 | 1 |
| 12 | 2 | WD1 | 555 | 1 |
| 13 | 3 | WD1 | 470 | 1 |
| 14 | 3 | WD1 | 535 | 1 |
| 15 | 3 | WD1 | 520 | 1 |
| 16 | 3 | WD1 | 510 | 1 |
| 1 | 1 | WD8 | 250 | 8 |
| 2 | 1 | WD8 | 230 | 8 |
| 3 | 1 | WD8 | 250 | 8 |
| 4 | 1 | WD8 | 255 | 8 |
| 5 | 1 | WD8 | 260 | 8 |
| 6 | 1 | WD8 | 265 | 8 |
| 7 | 1 | WD8 | 275 | 8 |
| 8 | 1 | WD8 | 255 | 8 |
| 9 | 2 | WD8 | 415 | 8 |
| 10 | 2 | WD8 | 420 | 8 |
| 11 | 2 | WD8 | 445 | 8 |
| 12 | 2 | WD8 | 560 | 8 |
| 13 | 3 | WD8 | 465 | 8 |
| 14 | 3 | WD8 | 525 | 8 |
| 15 | 3 | WD8 | 525 | 8 |
| 16 | 3 | WD8 | 510 | 8 |
| 1 | 1 | WD15 | 255 | 15 |
| 2 | 1 | WD15 | 230 | 15 |
| 3 | 1 | WD15 | 250 | 15 |
| 4 | 1 | WD15 | 255 | 15 |
| 5 | 1 | WD15 | 255 | 15 |
| 6 | 1 | WD15 | 270 | 15 |
| 7 | 1 | WD15 | 260 | 15 |
| 8 | 1 | WD15 | 260 | 15 |
| 9 | 2 | WD15 | 425 | 15 |
| 10 | 2 | WD15 | 430 | 15 |
| 11 | 2 | WD15 | 450 | 15 |
| 12 | 2 | WD15 | 565 | 15 |
| 13 | 3 | WD15 | 475 | 15 |
| 14 | 3 | WD15 | 530 | 15 |
| 15 | 3 | WD15 | 530 | 15 |
| 16 | 3 | WD15 | 520 | 15 |
| 1 | 1 | WD22 | 260 | 22 |
| 2 | 1 | WD22 | 232 | 22 |
| 3 | 1 | WD22 | 255 | 22 |
| 4 | 1 | WD22 | 265 | 22 |
| 5 | 1 | WD22 | 270 | 22 |
| 6 | 1 | WD22 | 275 | 22 |
| 7 | 1 | WD22 | 270 | 22 |
| 8 | 1 | WD22 | 268 | 22 |
| 9 | 2 | WD22 | 428 | 22 |
| 10 | 2 | WD22 | 440 | 22 |
| 11 | 2 | WD22 | 452 | 22 |
| 12 | 2 | WD22 | 580 | 22 |
| 13 | 3 | WD22 | 485 | 22 |
| 14 | 3 | WD22 | 533 | 22 |
| 15 | 3 | WD22 | 540 | 22 |
| 16 | 3 | WD22 | 515 | 22 |
| 1 | 1 | WD29 | 262 | 29 |
| 2 | 1 | WD29 | 240 | 29 |
| 3 | 1 | WD29 | 262 | 29 |
| 4 | 1 | WD29 | 265 | 29 |
| 5 | 1 | WD29 | 270 | 29 |
| 6 | 1 | WD29 | 275 | 29 |
| 7 | 1 | WD29 | 273 | 29 |
| 8 | 1 | WD29 | 270 | 29 |
| 9 | 2 | WD29 | 438 | 29 |
| 10 | 2 | WD29 | 448 | 29 |
| 11 | 2 | WD29 | 455 | 29 |
| 12 | 2 | WD29 | 590 | 29 |
| 13 | 3 | WD29 | 487 | 29 |
| 14 | 3 | WD29 | 535 | 29 |
| 15 | 3 | WD29 | 543 | 29 |
| 16 | 3 | WD29 | 530 | 29 |
| 1 | 1 | WD36 | 258 | 36 |
| 2 | 1 | WD36 | 240 | 36 |
| 3 | 1 | WD36 | 265 | 36 |
| 4 | 1 | WD36 | 268 | 36 |
| 5 | 1 | WD36 | 273 | 36 |
| 6 | 1 | WD36 | 277 | 36 |
| 7 | 1 | WD36 | 274 | 36 |
| 8 | 1 | WD36 | 265 | 36 |
| 9 | 2 | WD36 | 443 | 36 |
| 10 | 2 | WD36 | 460 | 36 |
| 11 | 2 | WD36 | 455 | 36 |
| 12 | 2 | WD36 | 597 | 36 |
| 13 | 3 | WD36 | 493 | 36 |
| 14 | 3 | WD36 | 540 | 36 |
| 15 | 3 | WD36 | 546 | 36 |
| 16 | 3 | WD36 | 538 | 36 |
| 1 | 1 | WD43 | 266 | 43 |
| 2 | 1 | WD43 | 243 | 43 |
| 3 | 1 | WD43 | 267 | 43 |
| 4 | 1 | WD43 | 270 | 43 |
| 5 | 1 | WD43 | 274 | 43 |
| 6 | 1 | WD43 | 278 | 43 |
| 7 | 1 | WD43 | 276 | 43 |
| 8 | 1 | WD43 | 265 | 43 |
| 9 | 2 | WD43 | 442 | 43 |
| 10 | 2 | WD43 | 458 | 43 |
| 11 | 2 | WD43 | 451 | 43 |
| 12 | 2 | WD43 | 595 | 43 |
| 13 | 3 | WD43 | 493 | 43 |
| 14 | 3 | WD43 | 525 | 43 |
| 15 | 3 | WD43 | 538 | 43 |
| 16 | 3 | WD43 | 535 | 43 |
| 1 | 1 | WD44 | 266 | 44 |
| 2 | 1 | WD44 | 244 | 44 |
| 3 | 1 | WD44 | 267 | 44 |
| 4 | 1 | WD44 | 272 | 44 |
| 5 | 1 | WD44 | 273 | 44 |
| 6 | 1 | WD44 | 278 | 44 |
| 7 | 1 | WD44 | 271 | 44 |
| 8 | 1 | WD44 | 267 | 44 |
| 9 | 2 | WD44 | 446 | 44 |
| 10 | 2 | WD44 | 464 | 44 |
| 11 | 2 | WD44 | 450 | 44 |
| 12 | 2 | WD44 | 595 | 44 |
| 13 | 3 | WD44 | 504 | 44 |
| 14 | 3 | WD44 | 530 | 44 |
| 15 | 3 | WD44 | 544 | 44 |
| 16 | 3 | WD44 | 542 | 44 |
| 1 | 1 | WD50 | 265 | 50 |
| 2 | 1 | WD50 | 238 | 50 |
| 3 | 1 | WD50 | 264 | 50 |
| 4 | 1 | WD50 | 274 | 50 |
| 5 | 1 | WD50 | 276 | 50 |
| 6 | 1 | WD50 | 284 | 50 |
| 7 | 1 | WD50 | 282 | 50 |
| 8 | 1 | WD50 | 273 | 50 |
| 9 | 2 | WD50 | 456 | 50 |
| 10 | 2 | WD50 | 475 | 50 |
| 11 | 2 | WD50 | 462 | 50 |
| 12 | 2 | WD50 | 612 | 50 |
| 13 | 3 | WD50 | 507 | 50 |
| 14 | 3 | WD50 | 543 | 50 |
| 15 | 3 | WD50 | 553 | 50 |
| 16 | 3 | WD50 | 550 | 50 |
| 1 | 1 | WD57 | 272 | 57 |
| 2 | 1 | WD57 | 247 | 57 |
| 3 | 1 | WD57 | 268 | 57 |
| 4 | 1 | WD57 | 273 | 57 |
| 5 | 1 | WD57 | 278 | 57 |
| 6 | 1 | WD57 | 279 | 57 |
| 7 | 1 | WD57 | 281 | 57 |
| 8 | 1 | WD57 | 274 | 57 |
| 9 | 2 | WD57 | 468 | 57 |
| 10 | 2 | WD57 | 484 | 57 |
| 11 | 2 | WD57 | 466 | 57 |
| 12 | 2 | WD57 | 618 | 57 |
| 13 | 3 | WD57 | 518 | 57 |
| 14 | 3 | WD57 | 544 | 57 |
| 15 | 3 | WD57 | 555 | 57 |
| 16 | 3 | WD57 | 553 | 57 |
| 1 | 1 | WD64 | 278 | 64 |
| 2 | 1 | WD64 | 245 | 64 |
| 3 | 1 | WD64 | 269 | 64 |
| 4 | 1 | WD64 | 275 | 64 |
| 5 | 1 | WD64 | 280 | 64 |
| 6 | 1 | WD64 | 281 | 64 |
| 7 | 1 | WD64 | 284 | 64 |
| 8 | 1 | WD64 | 278 | 64 |
| 9 | 2 | WD64 | 478 | 64 |
| 10 | 2 | WD64 | 496 | 64 |
| 11 | 2 | WD64 | 472 | 64 |
| 12 | 2 | WD64 | 628 | 64 |
| 13 | 3 | WD64 | 525 | 64 |
| 14 | 3 | WD64 | 559 | 64 |
| 15 | 3 | WD64 | 548 | 64 |
| 16 | 3 | WD64 | 569 | 64 |
6.2-6.5
ggplot(rats_l, aes(x = Time, y = Weight,linetype = ID)) +
geom_line(color = "darkblue") +
scale_linetype_manual(values = rep(1:6, times= 3)) +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "right") +
scale_y_continuous(limits = c(min(rats_l$Weight), max(rats_l$Weight))) +
ggtitle("The long RATS data 'rats_l'")
Rat’s starting weights in group 2 & 3 were higher than in group 1 and the weights in group 2 & 3 increased in quite high rates. The weights in group 1 increased in a smaller rate.
Calculating StdWeight and added as a new variable to the ‘rats_l’
rats_l <- rats_l %>%
group_by(Time) %>%
mutate(StdWeight = ((Weight - mean(Weight))/sd(Weight))) %>%
ungroup()
str(rats_l)
## Classes 'tbl_df', 'tbl' and 'data.frame': 176 obs. of 6 variables:
## $ ID : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
## $ WD : Factor w/ 11 levels "WD1","WD15","WD22",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Weight : int 240 225 245 260 255 260 275 245 410 405 ...
## $ Time : int 1 1 1 1 1 1 1 1 1 1 ...
## $ StdWeight: num -1.001 -1.12 -0.961 -0.842 -0.882 ...
ggplot(rats_l, aes(x = Time, y = StdWeight, linetype = ID)) +
geom_line(color = "darkred") +
scale_linetype_manual(values = rep(1:6, times=3)) +
facet_grid(. ~ Group, labeller = label_both) +
scale_y_continuous(name = "StdWeight: standardized Weight values") +
ggtitle("Standardized 'rats_l' Weight values") +
theme(panel.grid.major.y = element_line(colour = "darkred"))
Add the Time as a categorical variable “Time1”
rats_l <- rats_l %>%
mutate(Time1 = factor(rats_l$Time)) %>%
ungroup()
str(rats_l)
## Classes 'tbl_df', 'tbl' and 'data.frame': 176 obs. of 7 variables:
## $ ID : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
## $ WD : Factor w/ 11 levels "WD1","WD15","WD22",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Weight : int 240 225 245 260 255 260 275 245 410 405 ...
## $ Time : int 1 1 1 1 1 1 1 1 1 1 ...
## $ StdWeight: num -1.001 -1.12 -0.961 -0.842 -0.882 ...
## $ Time1 : Factor w/ 11 levels "1","8","15","22",..: 1 1 1 1 1 1 1 1 1 1 ...
The overview of rats’ weights over the whole measurement period with boxplots. Different diet groups were in different colors.
ggplot(data = rats_l, aes(x = rats_l$Time1, y = Weight, fill = rats_l$Group)) +
geom_boxplot() +
ylab("Weight") +
xlab("Measurement time [days]") +
ggtitle("Rat weights") +
scale_fill_discrete(name = "Diet group") +
theme(legend.position = "right")
Calculate mean weight of all rats & the standard error of each diet group and add as a variable
n_rats <- rats_l$Time %>% unique() %>% length()
rats_s <- rats_l %>%
group_by(Group, Time) %>%
summarise(mean = mean(Weight), se = (sd(Weight)/sqrt(n_rats))) %>%
ungroup()
str(rats_s)
## Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of 4 variables:
## $ Group: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ Time : int 1 8 15 22 29 36 43 44 50 57 ...
## $ mean : num 251 255 254 262 265 ...
## $ se : num 4.59 3.95 3.46 4.1 3.33 ...
ggplot(rats_s, aes(x = Time, y = mean, linetype = Group, color = Group, shape = Group)) +
geom_line(size = 0.6) +
scale_linetype_manual(name = "Diet group", values = c(1,2,3)) +
geom_point(size=1.5) +
scale_shape_manual(name = "Diet group", values = c(16,17,18)) +
scale_color_manual(name = "Diet group", values = c("darkblue", "darkred", "darkgreen")) +
geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.8) +
theme(legend.position = "right") +
scale_y_continuous(name = "Mean (Weight) +/- SE (Weight)") +
ggtitle("Mean weight profiles of different diet groups")
This graph showed the rat weight development of different diet groups during the measurement period. Group 1 started at a lower level and the weight increased slightly . In Group 2 and group 3, weights started from a higher values and then decreased. The standard errors in group 2 & 3 were higher, which indicated that the weight difference between the individual rats was bigger within the diet group.
Create a summary data by Group and ID with mean as the summary variable
rats_l10s <- rats_l %>%
filter(Time > 1) %>%
group_by(Group, ID) %>%
summarise(mean=mean(Weight)) %>%
ungroup()
str(rats_l10s)
## Classes 'tbl_df', 'tbl' and 'data.frame': 16 obs. of 3 variables:
## $ Group: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
## $ ID : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ mean : num 263 239 262 267 271 ...
ggplot(rats_l10s, aes(x = Group, y = mean, fill = Group)) +
geom_boxplot() +
stat_summary(fun.y = "mean", geom = "point", shape = 23, size = 3, fill = "black") +
scale_y_continuous(name = "Mean (Weight) / Time 8-64 [days]") +
xlab("Diet group") +
scale_fill_discrete(name = "Diet group") +
ggtitle("Mean weights of diet groups excluding the first measurement")
The boxplot showed the outliers and they could disturb further comparisons of the different diet groups.
rats_l10s_o <- rats_l10s %>%
filter(
(mean > 250 & Group == 1) |
(mean < 550 & Group == 2) |
(mean > 500 & Group == 3)) %>%
ungroup()
str(rats_l10s_o)
## Classes 'tbl_df', 'tbl' and 'data.frame': 13 obs. of 3 variables:
## $ Group: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 2 2 2 ...
## $ ID : Factor w/ 16 levels "1","2","3","4",..: 1 3 4 5 6 7 8 9 10 11 ...
## $ mean : num 263 262 267 271 276 ...
ggplot(rats_l10s_o, aes(x = Group, y = mean, fill = Group)) +
geom_boxplot() +
stat_summary(fun.y = "mean", geom = "point", shape = 23, size = 3, fill = "black") +
scale_y_continuous(name = "Mean (Weight) / Time 8-64 [days]") +
xlab("Diet group") +
scale_fill_discrete(name = "Diet group") +
ggtitle("Mean weights of diet groups without the outlier")
This time there were no outlier values in the boxplot Mean weights of diet groups during 8 to 64 days.
Note: I used ANOVA more frequently than t-test so I skip t-test.
Creating a new data table rats_l10s_b. The weight baseline values were created by the first weigh measurement on day 1 (WD1) as a new variable to the summary data.
rats_l10s_b <- rats_l10s %>%
mutate(baseline = rats$WD1)
rats_l10s_b %>% knitr::kable() %>% kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "center") %>% scroll_box(height = "200px")
| Group | ID | mean | baseline |
|---|---|---|---|
| 1 | 1 | 263.2 | 240 |
| 1 | 2 | 238.9 | 225 |
| 1 | 3 | 261.7 | 245 |
| 1 | 4 | 267.2 | 260 |
| 1 | 5 | 270.9 | 255 |
| 1 | 6 | 276.2 | 260 |
| 1 | 7 | 274.6 | 275 |
| 1 | 8 | 267.5 | 245 |
| 2 | 9 | 443.9 | 410 |
| 2 | 10 | 457.5 | 405 |
| 2 | 11 | 455.8 | 445 |
| 2 | 12 | 594.0 | 555 |
| 3 | 13 | 495.2 | 470 |
| 3 | 14 | 536.4 | 535 |
| 3 | 15 | 542.2 | 520 |
| 3 | 16 | 536.2 | 510 |
To find which diet group had the highest increase in weight during the measurement period, anova tests were done using this new data ‘rats_l10s_b’ with the mean weight values and the baseline weight values.
rats_lrm1 <- lm(mean ~ Group, data = rats_l10s_b)
round(rats_lrm1$coefficients, digits = 2) %>% knitr::kable() %>% kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "center")
| x | |
|---|---|
| (Intercept) | 265.02 |
| Group2 | 222.77 |
| Group3 | 262.48 |
The model showed that the mean value of group 2 was 222g higher and the mean value of group 3 was 262g higher than that of group 1.
Examine the group means using a post-hoc test after the ANOVA to find whether the diet groups differed to each other.
The model shows that the mean value of group 2 is 222g higher and the mean value of group 3 is 262g higher compared to group 1 mean weights. Let’s have a look now if the diet groups differ to each other. With a post-hoc test after the ANOVA we can check the group means.
rats_anova1 <- anova(rats_lrm1)
rats_anova1
## Analysis of Variance Table
##
## Response: mean
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 2 238620 119310 88.525 2.679e-08 ***
## Residuals 13 17521 1348
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The ANOVA of the linear model (mean weight ~ diet groups) showed that group 2 had a significant different weight increase compared to group 1 & 3.
Perform an ANOVA and a post-hoc test (TukeyHSD) to find whether group means of diet groups were different from each other. The dataset rats_l10s_b was without the first measurement (start value) and inluded the outlier values.
rats_anova2 <- aov(mean ~ Group, data = rats_l10s_b)
summary(rats_anova2)
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 2 238620 119310 88.53 2.68e-08 ***
## Residuals 13 17521 1348
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(rats_anova2, "Group", ordered = TRUE, conf.level = 0.95)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
## factor levels have been ordered
##
## Fit: aov(formula = mean ~ Group, data = rats_l10s_b)
##
## $Group
## diff lwr upr p adj
## 2-1 222.775 163.41452 282.1355 0.0000006
## 3-1 262.475 203.11452 321.8355 0.0000001
## 3-2 39.700 -28.84358 108.2436 0.3099506
This post-hoc test provided the differences between the diet groups. Group 1 & 2 were significantly different in mean weight. So were Group 1 & 3. Group 2 and 3 were not significantly different in the mean weight mean, influenced by the outlier.
The difference of each group to the baselines:
The linear model with the mean as the response
rats_lrm2 <- lm(mean ~ baseline + Group, data = rats_l10s_b)
round(rats_lrm2$coefficients, digits = 2) %>% knitr::kable() %>% kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "center")
| x | |
|---|---|
| (Intercept) | 33.16 |
| baseline | 0.93 |
| Group2 | 34.86 |
| Group3 | 23.68 |
That was the regression model of mean weights compared to the weight baseline (the first weight measurement of each dietary group). The results shows that the weight difference between baseline and group 2 was 34g. The weight difference between baseline and group 3 and baseline was 23g. Therefore Group 2 has the highest weight increase rates during the measurement period.
rats_anova3 <- anova(rats_lrm2)
rats_anova3
## Analysis of Variance Table
##
## Response: mean
## Df Sum Sq Mean Sq F value Pr(>F)
## baseline 1 253625 253625 1859.8201 1.57e-14 ***
## Group 2 879 439 3.2219 0.07586 .
## Residuals 12 1636 136
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The baseline had a significant influence in the weight increase in different diet groups. Group 2 had a significant weight increase relative to the baseline, compared to group 1 & 3. Diet 2 may lead to the highest weight increase.
6.6-6.8
(10.12.2019)
str(bprs_l)
## 'data.frame': 360 obs. of 5 variables:
## $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ weeks : Factor w/ 9 levels "week0","week1",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BPRS : int 42 58 54 55 72 48 71 30 41 57 ...
## $ week : int 0 0 0 0 0 0 0 0 0 0 ...
pairs(bprs_l, col = bprs_l$subject)
Plotting the bprs_l
ggplot(bprs_l, aes(x = week, y = BPRS, shape = subject, group = treatment)) +
geom_point(color = "darkblue") +
scale_shape_manual(values = rep(1:10, times = 2)) +
scale_x_continuous(name = "Weeks", breaks = seq(0,8,1)) +
scale_y_continuous(name = "BPRS value") +
theme(legend.position = "bottom") +
ggtitle("Overview of BPRS data") +
theme(legend.box.background = element_rect(),legend.box.margin = margin(2, 2, 2, 2))
ggplot(bprs_l, aes(x = week, y = BPRS, linetype = subject)) +
geom_line(color = "darkblue") +
scale_linetype_manual(values = rep(1:6, times = 4)) +
facet_grid(. ~ treatment, labeller = label_both) +
scale_x_continuous(name = "Weeks", breaks = seq(0,8,1)) +
scale_y_continuous(name = "BPRS values observed", breaks = seq(10,100,5)) +
theme(legend.position = "bottom") +
ggtitle("Overview of BPRS data by treatment") +
theme(legend.box.background = element_rect(),legend.box.margin = margin(2, 2, 2, 2))
ggplot(bprs_l, aes(x = week, y = BPRS)) +
geom_line(aes(linetype = treatment), color = "darkblue") +
scale_linetype_manual(values = rep(1:2, times = 1)) +
facet_grid(.~ subject) +
scale_x_continuous(name = "Weeks", breaks = seq(0,8,4)) +
scale_y_continuous(name = "BPRS values observed", breaks = seq(10,100,5)) +
theme(legend.position = "right") +
ggtitle("Overview of BPRS data by subject") +
theme(legend.box.background = element_rect(),legend.box.margin = margin(2, 2, 2, 2))
There were 40 subjects in 2 treatment groups (20 in each group). After the first BPRS determination (week 0), the treatment started and 8 more BPRS determinations followed.
In the first linear regression model, the BPRS value was dependent varibale and week + treatment were independent variables (explanatory variables).
bprs_l_lrm1 <- lm(BPRS ~ week + treatment, data = bprs_l)
summary(bprs_l_lrm1)
##
## Call:
## lm(formula = BPRS ~ week + treatment, data = bprs_l)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.454 -8.965 -3.196 7.002 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.4539 1.3670 33.982 <2e-16 ***
## week -2.2704 0.2524 -8.995 <2e-16 ***
## treatment2 0.5722 1.3034 0.439 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared: 0.1851, Adjusted R-squared: 0.1806
## F-statistic: 40.55 on 2 and 357 DF, p-value: < 2.2e-16
Weekly measurement had a significant effect on the BPRS values. With every new BRPS determination per week, the BRPS value decreases 2.2 units. Treatment 2 had a better effect on the BPRS values of the subjects but it was not significant.
To explore whether treatments helped patients and whether the subjects (patients) affected the model, the subjects were added as a random effect term to a random intercept model.
In the random intercept model, “treatment” and the measurement periods “week” were the explanatory variables. The subjects (the 40 men which undergoing the treatment and measurements) was the random-effects term. This model explored whether linear regression fitted for each subject to the other subjects.
bprs_l_rim <- lmer(BPRS ~ week + treatment + (1 | subject), data = bprs_l, REML = FALSE)
summary(bprs_l_rim)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: BPRS ~ week + treatment + (1 | subject)
## Data: bprs_l
##
## AIC BIC logLik deviance df.resid
## 2748.7 2768.1 -1369.4 2738.7 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0481 -0.6749 -0.1361 0.4813 3.4855
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 47.41 6.885
## Residual 104.21 10.208
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 1.9090 24.334
## week -2.2704 0.2084 -10.896
## treatment2 0.5722 1.0761 0.532
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.437
## treatment2 -0.282 0.000
According to the summary, the standard error of the measurement periods ‘week’ in the random intercept model (0.2084) was smaller than the standard error in the linear regression model (0.2524). Similarly, the standard error of the treatment 2 in the random intercept model (1.0761) was smaller than that in the linear regression model (1.3034). They meant that the effect of subjects was taken into account in this model.
Generally, the mixed model takes into account the correlations between the observations (‘subjects’). So it can be more accurate than linear regression model. The p-value was calculated by inverse function of the t-values (inverting the t value to the p value).
Fit a random intercept and random slope model to the BPRS values. It allowed the linear regression to fit for every patient (‘subject’) to differ in the intercept and in the slope. ‘Subject’ differences was taken into account of the BPRS values and the measurement period. In this model, “week” and “subject” were random-effect terms.
A total of 9 (0-8 weeks) different regression models were defined for every subject for every time period (every week).
bprs_l_rim1 <- lmer(BPRS ~ week + treatment + (week | subject), data = bprs_l, REML = FALSE)
summary(bprs_l_rim1)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: BPRS ~ week + treatment + (week | subject)
## Data: bprs_l
##
## AIC BIC logLik deviance df.resid
## 2745.4 2772.6 -1365.7 2731.4 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.8919 -0.6194 -0.0691 0.5531 3.7977
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.8222 8.0512
## week 0.9609 0.9803 -0.51
## Residual 97.4304 9.8707
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 2.1052 22.066
## week -2.2704 0.2977 -7.626
## treatment2 0.5722 1.0405 0.550
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.582
## treatment2 -0.247 0.000
The results of this model showed not much difference with the model above. ANOVA was conducted to find differences in these models.
anova(bprs_l_rim1, bprs_l_rim)
## Data: bprs_l
## Models:
## bprs_l_rim: BPRS ~ week + treatment + (1 | subject)
## bprs_l_rim1: BPRS ~ week + treatment + (week | subject)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## bprs_l_rim 5 2748.7 2768.1 -1369.4 2738.7
## bprs_l_rim1 7 2745.4 2772.6 -1365.7 2731.4 7.2721 2 0.02636 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The ANOVA showed that the random Intercept and random Slope model (taking week and subject into account) gave a better fit for the data (p-value of 0.02) with a significant outcome.
In the following random intercept and random slope model, the interaction of treatment and time (weeks) were included to see whether the treatment time period interacted with the BRPS value outcome.
bprs_l_rim2 <- lmer(BPRS ~ week * treatment + (week | subject), data = bprs_l, REML = FALSE)
summary(bprs_l_rim2)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: BPRS ~ week * treatment + (week | subject)
## Data: bprs_l
##
## AIC BIC logLik deviance df.resid
## 2744.3 2775.4 -1364.1 2728.3 352
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0512 -0.6271 -0.0768 0.5288 3.9260
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.9964 8.0620
## week 0.9687 0.9842 -0.51
## Residual 96.4707 9.8220
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 47.8856 2.2521 21.262
## week -2.6283 0.3589 -7.323
## treatment2 -2.2911 1.9090 -1.200
## week:treatment2 0.7158 0.4010 1.785
##
## Correlation of Fixed Effects:
## (Intr) week trtmn2
## week -0.650
## treatment2 -0.424 0.469
## wek:trtmnt2 0.356 -0.559 -0.840
anova(bprs_l_rim2, bprs_l_rim1)
## Data: bprs_l
## Models:
## bprs_l_rim1: BPRS ~ week + treatment + (week | subject)
## bprs_l_rim2: BPRS ~ week * treatment + (week | subject)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## bprs_l_rim1 7 2745.4 2772.6 -1365.7 2731.4
## bprs_l_rim2 8 2744.3 2775.4 -1364.1 2728.3 3.1712 1 0.07495 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The ANOVA shows that the random intercept and random slope model with interaction ‘time * treatment’ fitted better on the data (p-value = 0.075 < 0.1). So there was an interaction of treatments with the treament time.
Check the BPRS data again on a graph
ggplot(bprs_l, aes(x = week, y = BPRS, linetype = subject)) +
geom_line(color = "darkblue") +
scale_linetype_manual(values = rep(1:6, times = 4)) +
facet_grid(. ~ treatment, labeller = label_both) +
scale_x_continuous(name = "Weeks", breaks = seq(0,8,1)) +
scale_y_continuous(name = "BPRS values observed", breaks = seq(10,100,5)) +
theme(legend.position = "bottom") +
ggtitle("BPRS: original values by treatment") +
theme(legend.box.background = element_rect(),legend.box.margin = margin(2, 2, 2, 2))
Add fitted BPRS values created by the random intercept and random slope model with interaction ‘time * treatment’ to the dataset
bprs_l_fit <- fitted(bprs_l_rim2)
bprs_l$fitted_bprs <- round(bprs_l_fit, digits = 2)
bprs_l %>% knitr::kable() %>% kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "center") %>% scroll_box(height = "300px")
| treatment | subject | weeks | BPRS | week | fitted_bprs |
|---|---|---|---|---|---|
| 1 | 1 | week0 | 42 | 0 | 49.24 |
| 1 | 2 | week0 | 58 | 0 | 46.97 |
| 1 | 3 | week0 | 54 | 0 | 47.66 |
| 1 | 4 | week0 | 55 | 0 | 49.85 |
| 1 | 5 | week0 | 72 | 0 | 66.39 |
| 1 | 6 | week0 | 48 | 0 | 42.59 |
| 1 | 7 | week0 | 71 | 0 | 54.58 |
| 1 | 8 | week0 | 30 | 0 | 47.93 |
| 1 | 9 | week0 | 41 | 0 | 42.08 |
| 1 | 10 | week0 | 57 | 0 | 53.35 |
| 1 | 11 | week0 | 30 | 0 | 59.97 |
| 1 | 12 | week0 | 55 | 0 | 48.42 |
| 1 | 13 | week0 | 36 | 0 | 50.36 |
| 1 | 14 | week0 | 38 | 0 | 40.93 |
| 1 | 15 | week0 | 66 | 0 | 55.07 |
| 1 | 16 | week0 | 41 | 0 | 44.43 |
| 1 | 17 | week0 | 45 | 0 | 43.93 |
| 1 | 18 | week0 | 39 | 0 | 36.82 |
| 1 | 19 | week0 | 24 | 0 | 38.14 |
| 1 | 20 | week0 | 38 | 0 | 39.01 |
| 2 | 1 | week0 | 52 | 0 | 46.95 |
| 2 | 2 | week0 | 30 | 0 | 44.68 |
| 2 | 3 | week0 | 65 | 0 | 45.36 |
| 2 | 4 | week0 | 37 | 0 | 47.56 |
| 2 | 5 | week0 | 59 | 0 | 64.10 |
| 2 | 6 | week0 | 30 | 0 | 40.30 |
| 2 | 7 | week0 | 69 | 0 | 52.29 |
| 2 | 8 | week0 | 62 | 0 | 45.64 |
| 2 | 9 | week0 | 38 | 0 | 39.79 |
| 2 | 10 | week0 | 65 | 0 | 51.05 |
| 2 | 11 | week0 | 78 | 0 | 57.68 |
| 2 | 12 | week0 | 38 | 0 | 46.13 |
| 2 | 13 | week0 | 63 | 0 | 48.07 |
| 2 | 14 | week0 | 40 | 0 | 38.64 |
| 2 | 15 | week0 | 40 | 0 | 52.78 |
| 2 | 16 | week0 | 54 | 0 | 42.14 |
| 2 | 17 | week0 | 33 | 0 | 41.63 |
| 2 | 18 | week0 | 28 | 0 | 34.53 |
| 2 | 19 | week0 | 52 | 0 | 35.85 |
| 2 | 20 | week0 | 47 | 0 | 36.71 |
| 1 | 1 | week1 | 36 | 1 | 47.97 |
| 1 | 2 | week1 | 68 | 1 | 43.72 |
| 1 | 3 | week1 | 55 | 1 | 44.47 |
| 1 | 4 | week1 | 77 | 1 | 47.41 |
| 1 | 5 | week1 | 75 | 1 | 62.11 |
| 1 | 6 | week1 | 43 | 1 | 40.04 |
| 1 | 7 | week1 | 61 | 1 | 51.11 |
| 1 | 8 | week1 | 36 | 1 | 46.36 |
| 1 | 9 | week1 | 43 | 1 | 39.11 |
| 1 | 10 | week1 | 51 | 1 | 50.79 |
| 1 | 11 | week1 | 34 | 1 | 58.01 |
| 1 | 12 | week1 | 52 | 1 | 45.55 |
| 1 | 13 | week1 | 32 | 1 | 47.26 |
| 1 | 14 | week1 | 35 | 1 | 38.73 |
| 1 | 15 | week1 | 68 | 1 | 51.69 |
| 1 | 16 | week1 | 35 | 1 | 41.50 |
| 1 | 17 | week1 | 38 | 1 | 42.30 |
| 1 | 18 | week1 | 35 | 1 | 35.03 |
| 1 | 19 | week1 | 28 | 1 | 35.66 |
| 1 | 20 | week1 | 34 | 1 | 36.35 |
| 2 | 1 | week1 | 73 | 1 | 46.39 |
| 2 | 2 | week1 | 23 | 1 | 42.14 |
| 2 | 3 | week1 | 31 | 1 | 42.89 |
| 2 | 4 | week1 | 31 | 1 | 45.83 |
| 2 | 5 | week1 | 67 | 1 | 60.53 |
| 2 | 6 | week1 | 33 | 1 | 38.47 |
| 2 | 7 | week1 | 52 | 1 | 49.53 |
| 2 | 8 | week1 | 54 | 1 | 44.79 |
| 2 | 9 | week1 | 40 | 1 | 37.53 |
| 2 | 10 | week1 | 44 | 1 | 49.21 |
| 2 | 11 | week1 | 95 | 1 | 56.44 |
| 2 | 12 | week1 | 41 | 1 | 43.97 |
| 2 | 13 | week1 | 65 | 1 | 45.68 |
| 2 | 14 | week1 | 37 | 1 | 37.16 |
| 2 | 15 | week1 | 36 | 1 | 50.11 |
| 2 | 16 | week1 | 45 | 1 | 39.92 |
| 2 | 17 | week1 | 41 | 1 | 40.72 |
| 2 | 18 | week1 | 30 | 1 | 33.45 |
| 2 | 19 | week1 | 43 | 1 | 34.09 |
| 2 | 20 | week1 | 36 | 1 | 34.78 |
| 1 | 1 | week2 | 36 | 2 | 46.69 |
| 1 | 2 | week2 | 61 | 2 | 40.46 |
| 1 | 3 | week2 | 41 | 2 | 41.28 |
| 1 | 4 | week2 | 49 | 2 | 44.96 |
| 1 | 5 | week2 | 72 | 2 | 57.82 |
| 1 | 6 | week2 | 41 | 2 | 37.49 |
| 1 | 7 | week2 | 47 | 2 | 47.64 |
| 1 | 8 | week2 | 38 | 2 | 44.79 |
| 1 | 9 | week2 | 39 | 2 | 36.14 |
| 1 | 10 | week2 | 51 | 2 | 48.23 |
| 1 | 11 | week2 | 34 | 2 | 56.06 |
| 1 | 12 | week2 | 49 | 2 | 42.68 |
| 1 | 13 | week2 | 36 | 2 | 44.15 |
| 1 | 14 | week2 | 36 | 2 | 36.53 |
| 1 | 15 | week2 | 65 | 2 | 48.30 |
| 1 | 16 | week2 | 45 | 2 | 38.56 |
| 1 | 17 | week2 | 46 | 2 | 40.67 |
| 1 | 18 | week2 | 27 | 2 | 33.24 |
| 1 | 19 | week2 | 31 | 2 | 33.19 |
| 1 | 20 | week2 | 27 | 2 | 33.70 |
| 2 | 1 | week2 | 42 | 2 | 45.83 |
| 2 | 2 | week2 | 32 | 2 | 39.60 |
| 2 | 3 | week2 | 33 | 2 | 40.42 |
| 2 | 4 | week2 | 27 | 2 | 44.10 |
| 2 | 5 | week2 | 58 | 2 | 56.96 |
| 2 | 6 | week2 | 37 | 2 | 36.63 |
| 2 | 7 | week2 | 41 | 2 | 46.78 |
| 2 | 8 | week2 | 49 | 2 | 43.93 |
| 2 | 9 | week2 | 38 | 2 | 35.28 |
| 2 | 10 | week2 | 31 | 2 | 47.37 |
| 2 | 11 | week2 | 75 | 2 | 55.20 |
| 2 | 12 | week2 | 36 | 2 | 41.82 |
| 2 | 13 | week2 | 60 | 2 | 43.29 |
| 2 | 14 | week2 | 31 | 2 | 35.67 |
| 2 | 15 | week2 | 55 | 2 | 47.44 |
| 2 | 16 | week2 | 35 | 2 | 37.70 |
| 2 | 17 | week2 | 30 | 2 | 39.81 |
| 2 | 18 | week2 | 29 | 2 | 32.38 |
| 2 | 19 | week2 | 26 | 2 | 32.33 |
| 2 | 20 | week2 | 32 | 2 | 32.84 |
| 1 | 1 | week3 | 43 | 3 | 45.42 |
| 1 | 2 | week3 | 55 | 3 | 37.20 |
| 1 | 3 | week3 | 38 | 3 | 38.08 |
| 1 | 4 | week3 | 54 | 3 | 42.52 |
| 1 | 5 | week3 | 65 | 3 | 53.54 |
| 1 | 6 | week3 | 38 | 3 | 34.94 |
| 1 | 7 | week3 | 30 | 3 | 44.17 |
| 1 | 8 | week3 | 38 | 3 | 43.22 |
| 1 | 9 | week3 | 35 | 3 | 33.17 |
| 1 | 10 | week3 | 55 | 3 | 45.68 |
| 1 | 11 | week3 | 41 | 3 | 54.11 |
| 1 | 12 | week3 | 54 | 3 | 39.81 |
| 1 | 13 | week3 | 31 | 3 | 41.04 |
| 1 | 14 | week3 | 34 | 3 | 34.33 |
| 1 | 15 | week3 | 49 | 3 | 44.92 |
| 1 | 16 | week3 | 42 | 3 | 35.63 |
| 1 | 17 | week3 | 38 | 3 | 39.04 |
| 1 | 18 | week3 | 25 | 3 | 31.46 |
| 1 | 19 | week3 | 28 | 3 | 30.71 |
| 1 | 20 | week3 | 25 | 3 | 31.04 |
| 2 | 1 | week3 | 41 | 3 | 45.27 |
| 2 | 2 | week3 | 24 | 3 | 37.06 |
| 2 | 3 | week3 | 28 | 3 | 37.94 |
| 2 | 4 | week3 | 31 | 3 | 42.37 |
| 2 | 5 | week3 | 61 | 3 | 53.39 |
| 2 | 6 | week3 | 33 | 3 | 34.79 |
| 2 | 7 | week3 | 33 | 3 | 44.02 |
| 2 | 8 | week3 | 39 | 3 | 43.08 |
| 2 | 9 | week3 | 27 | 3 | 33.02 |
| 2 | 10 | week3 | 34 | 3 | 45.53 |
| 2 | 11 | week3 | 76 | 3 | 53.96 |
| 2 | 12 | week3 | 27 | 3 | 39.66 |
| 2 | 13 | week3 | 53 | 3 | 40.90 |
| 2 | 14 | week3 | 38 | 3 | 34.18 |
| 2 | 15 | week3 | 55 | 3 | 44.78 |
| 2 | 16 | week3 | 27 | 3 | 35.48 |
| 2 | 17 | week3 | 32 | 3 | 38.89 |
| 2 | 18 | week3 | 33 | 3 | 31.31 |
| 2 | 19 | week3 | 27 | 3 | 30.57 |
| 2 | 20 | week3 | 29 | 3 | 30.90 |
| 1 | 1 | week4 | 41 | 4 | 44.14 |
| 1 | 2 | week4 | 43 | 4 | 33.95 |
| 1 | 3 | week4 | 43 | 4 | 34.89 |
| 1 | 4 | week4 | 56 | 4 | 40.07 |
| 1 | 5 | week4 | 50 | 4 | 49.25 |
| 1 | 6 | week4 | 36 | 4 | 32.38 |
| 1 | 7 | week4 | 27 | 4 | 40.69 |
| 1 | 8 | week4 | 31 | 4 | 41.65 |
| 1 | 9 | week4 | 28 | 4 | 30.19 |
| 1 | 10 | week4 | 53 | 4 | 43.12 |
| 1 | 11 | week4 | 36 | 4 | 52.15 |
| 1 | 12 | week4 | 48 | 4 | 36.94 |
| 1 | 13 | week4 | 25 | 4 | 37.94 |
| 1 | 14 | week4 | 25 | 4 | 32.13 |
| 1 | 15 | week4 | 36 | 4 | 41.54 |
| 1 | 16 | week4 | 31 | 4 | 32.69 |
| 1 | 17 | week4 | 40 | 4 | 37.41 |
| 1 | 18 | week4 | 29 | 4 | 29.67 |
| 1 | 19 | week4 | 29 | 4 | 28.23 |
| 1 | 20 | week4 | 25 | 4 | 28.39 |
| 2 | 1 | week4 | 39 | 4 | 44.72 |
| 2 | 2 | week4 | 20 | 4 | 34.52 |
| 2 | 3 | week4 | 22 | 4 | 35.47 |
| 2 | 4 | week4 | 31 | 4 | 40.65 |
| 2 | 5 | week4 | 49 | 4 | 49.83 |
| 2 | 6 | week4 | 28 | 4 | 32.95 |
| 2 | 7 | week4 | 34 | 4 | 41.27 |
| 2 | 8 | week4 | 55 | 4 | 42.23 |
| 2 | 9 | week4 | 31 | 4 | 30.77 |
| 2 | 10 | week4 | 39 | 4 | 43.70 |
| 2 | 11 | week4 | 66 | 4 | 52.72 |
| 2 | 12 | week4 | 29 | 4 | 37.51 |
| 2 | 13 | week4 | 52 | 4 | 38.51 |
| 2 | 14 | week4 | 35 | 4 | 32.70 |
| 2 | 15 | week4 | 42 | 4 | 42.11 |
| 2 | 16 | week4 | 25 | 4 | 33.26 |
| 2 | 17 | week4 | 46 | 4 | 37.98 |
| 2 | 18 | week4 | 30 | 4 | 30.24 |
| 2 | 19 | week4 | 24 | 4 | 28.81 |
| 2 | 20 | week4 | 25 | 4 | 28.96 |
| 1 | 1 | week5 | 40 | 5 | 42.87 |
| 1 | 2 | week5 | 34 | 5 | 30.69 |
| 1 | 3 | week5 | 28 | 5 | 31.70 |
| 1 | 4 | week5 | 50 | 5 | 37.63 |
| 1 | 5 | week5 | 39 | 5 | 44.97 |
| 1 | 6 | week5 | 29 | 5 | 29.83 |
| 1 | 7 | week5 | 40 | 5 | 37.22 |
| 1 | 8 | week5 | 26 | 5 | 40.08 |
| 1 | 9 | week5 | 22 | 5 | 27.22 |
| 1 | 10 | week5 | 43 | 5 | 40.57 |
| 1 | 11 | week5 | 36 | 5 | 50.20 |
| 1 | 12 | week5 | 43 | 5 | 34.06 |
| 1 | 13 | week5 | 25 | 5 | 34.83 |
| 1 | 14 | week5 | 27 | 5 | 29.92 |
| 1 | 15 | week5 | 32 | 5 | 38.16 |
| 1 | 16 | week5 | 31 | 5 | 29.76 |
| 1 | 17 | week5 | 33 | 5 | 35.78 |
| 1 | 18 | week5 | 28 | 5 | 27.88 |
| 1 | 19 | week5 | 21 | 5 | 25.76 |
| 1 | 20 | week5 | 27 | 5 | 25.73 |
| 2 | 1 | week5 | 38 | 5 | 44.16 |
| 2 | 2 | week5 | 20 | 5 | 31.98 |
| 2 | 3 | week5 | 25 | 5 | 32.99 |
| 2 | 4 | week5 | 26 | 5 | 38.92 |
| 2 | 5 | week5 | 38 | 5 | 46.26 |
| 2 | 6 | week5 | 26 | 5 | 31.12 |
| 2 | 7 | week5 | 37 | 5 | 38.51 |
| 2 | 8 | week5 | 51 | 5 | 41.37 |
| 2 | 9 | week5 | 24 | 5 | 28.51 |
| 2 | 10 | week5 | 34 | 5 | 41.86 |
| 2 | 11 | week5 | 64 | 5 | 51.49 |
| 2 | 12 | week5 | 27 | 5 | 35.35 |
| 2 | 13 | week5 | 32 | 5 | 36.12 |
| 2 | 14 | week5 | 30 | 5 | 31.21 |
| 2 | 15 | week5 | 30 | 5 | 39.45 |
| 2 | 16 | week5 | 22 | 5 | 31.04 |
| 2 | 17 | week5 | 43 | 5 | 37.07 |
| 2 | 18 | week5 | 26 | 5 | 29.17 |
| 2 | 19 | week5 | 32 | 5 | 27.05 |
| 2 | 20 | week5 | 23 | 5 | 27.02 |
| 1 | 1 | week6 | 38 | 6 | 41.60 |
| 1 | 2 | week6 | 28 | 6 | 27.43 |
| 1 | 3 | week6 | 29 | 6 | 28.51 |
| 1 | 4 | week6 | 47 | 6 | 35.18 |
| 1 | 5 | week6 | 32 | 6 | 40.69 |
| 1 | 6 | week6 | 33 | 6 | 27.28 |
| 1 | 7 | week6 | 30 | 6 | 33.75 |
| 1 | 8 | week6 | 26 | 6 | 38.51 |
| 1 | 9 | week6 | 20 | 6 | 24.25 |
| 1 | 10 | week6 | 43 | 6 | 38.01 |
| 1 | 11 | week6 | 38 | 6 | 48.24 |
| 1 | 12 | week6 | 37 | 6 | 31.19 |
| 1 | 13 | week6 | 21 | 6 | 31.73 |
| 1 | 14 | week6 | 25 | 6 | 27.72 |
| 1 | 15 | week6 | 27 | 6 | 34.78 |
| 1 | 16 | week6 | 29 | 6 | 26.82 |
| 1 | 17 | week6 | 27 | 6 | 34.15 |
| 1 | 18 | week6 | 21 | 6 | 26.09 |
| 1 | 19 | week6 | 22 | 6 | 23.28 |
| 1 | 20 | week6 | 21 | 6 | 23.08 |
| 2 | 1 | week6 | 43 | 6 | 43.60 |
| 2 | 2 | week6 | 19 | 6 | 29.44 |
| 2 | 3 | week6 | 24 | 6 | 30.52 |
| 2 | 4 | week6 | 24 | 6 | 37.19 |
| 2 | 5 | week6 | 37 | 6 | 42.69 |
| 2 | 6 | week6 | 27 | 6 | 29.28 |
| 2 | 7 | week6 | 37 | 6 | 35.76 |
| 2 | 8 | week6 | 55 | 6 | 40.52 |
| 2 | 9 | week6 | 22 | 6 | 26.25 |
| 2 | 10 | week6 | 41 | 6 | 40.02 |
| 2 | 11 | week6 | 64 | 6 | 50.25 |
| 2 | 12 | week6 | 21 | 6 | 33.20 |
| 2 | 13 | week6 | 37 | 6 | 33.73 |
| 2 | 14 | week6 | 33 | 6 | 29.73 |
| 2 | 15 | week6 | 26 | 6 | 36.78 |
| 2 | 16 | week6 | 22 | 6 | 28.82 |
| 2 | 17 | week6 | 43 | 6 | 36.15 |
| 2 | 18 | week6 | 36 | 6 | 28.10 |
| 2 | 19 | week6 | 21 | 6 | 25.29 |
| 2 | 20 | week6 | 23 | 6 | 25.08 |
| 1 | 1 | week7 | 47 | 7 | 40.32 |
| 1 | 2 | week7 | 28 | 7 | 24.18 |
| 1 | 3 | week7 | 25 | 7 | 25.32 |
| 1 | 4 | week7 | 42 | 7 | 32.74 |
| 1 | 5 | week7 | 38 | 7 | 36.40 |
| 1 | 6 | week7 | 27 | 7 | 24.72 |
| 1 | 7 | week7 | 31 | 7 | 30.28 |
| 1 | 8 | week7 | 25 | 7 | 36.94 |
| 1 | 9 | week7 | 23 | 7 | 21.28 |
| 1 | 10 | week7 | 39 | 7 | 35.46 |
| 1 | 11 | week7 | 36 | 7 | 46.29 |
| 1 | 12 | week7 | 36 | 7 | 28.32 |
| 1 | 13 | week7 | 19 | 7 | 28.62 |
| 1 | 14 | week7 | 26 | 7 | 25.52 |
| 1 | 15 | week7 | 30 | 7 | 31.40 |
| 1 | 16 | week7 | 26 | 7 | 23.88 |
| 1 | 17 | week7 | 31 | 7 | 32.52 |
| 1 | 18 | week7 | 25 | 7 | 24.31 |
| 1 | 19 | week7 | 23 | 7 | 20.81 |
| 1 | 20 | week7 | 19 | 7 | 20.43 |
| 2 | 1 | week7 | 62 | 7 | 43.04 |
| 2 | 2 | week7 | 18 | 7 | 26.90 |
| 2 | 3 | week7 | 31 | 7 | 28.04 |
| 2 | 4 | week7 | 26 | 7 | 35.46 |
| 2 | 5 | week7 | 36 | 7 | 39.12 |
| 2 | 6 | week7 | 23 | 7 | 27.44 |
| 2 | 7 | week7 | 38 | 7 | 33.00 |
| 2 | 8 | week7 | 59 | 7 | 39.66 |
| 2 | 9 | week7 | 21 | 7 | 24.00 |
| 2 | 10 | week7 | 42 | 7 | 38.18 |
| 2 | 11 | week7 | 60 | 7 | 49.01 |
| 2 | 12 | week7 | 22 | 7 | 31.04 |
| 2 | 13 | week7 | 52 | 7 | 31.34 |
| 2 | 14 | week7 | 30 | 7 | 28.24 |
| 2 | 15 | week7 | 30 | 7 | 34.12 |
| 2 | 16 | week7 | 22 | 7 | 26.60 |
| 2 | 17 | week7 | 43 | 7 | 35.24 |
| 2 | 18 | week7 | 33 | 7 | 27.03 |
| 2 | 19 | week7 | 21 | 7 | 23.53 |
| 2 | 20 | week7 | 23 | 7 | 23.15 |
| 1 | 1 | week8 | 51 | 8 | 39.05 |
| 1 | 2 | week8 | 28 | 8 | 20.92 |
| 1 | 3 | week8 | 24 | 8 | 22.13 |
| 1 | 4 | week8 | 46 | 8 | 30.29 |
| 1 | 5 | week8 | 32 | 8 | 32.12 |
| 1 | 6 | week8 | 25 | 8 | 22.17 |
| 1 | 7 | week8 | 31 | 8 | 26.81 |
| 1 | 8 | week8 | 24 | 8 | 35.37 |
| 1 | 9 | week8 | 21 | 8 | 18.31 |
| 1 | 10 | week8 | 32 | 8 | 32.90 |
| 1 | 11 | week8 | 36 | 8 | 44.33 |
| 1 | 12 | week8 | 31 | 8 | 25.45 |
| 1 | 13 | week8 | 22 | 8 | 25.52 |
| 1 | 14 | week8 | 26 | 8 | 23.32 |
| 1 | 15 | week8 | 37 | 8 | 28.02 |
| 1 | 16 | week8 | 30 | 8 | 20.95 |
| 1 | 17 | week8 | 27 | 8 | 30.89 |
| 1 | 18 | week8 | 20 | 8 | 22.52 |
| 1 | 19 | week8 | 22 | 8 | 18.33 |
| 1 | 20 | week8 | 21 | 8 | 17.77 |
| 2 | 1 | week8 | 50 | 8 | 42.48 |
| 2 | 2 | week8 | 20 | 8 | 24.35 |
| 2 | 3 | week8 | 32 | 8 | 25.57 |
| 2 | 4 | week8 | 23 | 8 | 33.73 |
| 2 | 5 | week8 | 35 | 8 | 35.55 |
| 2 | 6 | week8 | 21 | 8 | 25.61 |
| 2 | 7 | week8 | 35 | 8 | 30.25 |
| 2 | 8 | week8 | 66 | 8 | 38.81 |
| 2 | 9 | week8 | 21 | 8 | 21.74 |
| 2 | 10 | week8 | 39 | 8 | 36.34 |
| 2 | 11 | week8 | 75 | 8 | 47.77 |
| 2 | 12 | week8 | 23 | 8 | 28.89 |
| 2 | 13 | week8 | 28 | 8 | 28.95 |
| 2 | 14 | week8 | 27 | 8 | 26.75 |
| 2 | 15 | week8 | 37 | 8 | 31.45 |
| 2 | 16 | week8 | 22 | 8 | 24.38 |
| 2 | 17 | week8 | 43 | 8 | 34.33 |
| 2 | 18 | week8 | 30 | 8 | 25.96 |
| 2 | 19 | week8 | 21 | 8 | 21.77 |
| 2 | 20 | week8 | 23 | 8 | 21.21 |
Draw the plot with the fitted BPRS values
ggplot(bprs_l, aes(x = week, y = fitted_bprs, linetype = subject)) +
geom_line(color = "darkblue") +
scale_linetype_manual(values = rep(1:6, times = 4)) +
facet_grid(. ~ treatment, labeller = label_both) +
scale_x_continuous(name = "Weeks", breaks = seq(0,8,1)) +
scale_y_continuous(name = "BPRS values modeled", breaks = seq(10,100,5)) +
theme(legend.position = "bottom") +
ggtitle("BPRS: fitted values by treatment") +
theme(legend.box.background = element_rect(),legend.box.margin = margin(2, 2, 2, 2))
(10.12.2019)
I will do those analyses for my research to get better understanding of results as well as better tables and figures.
Very tired, very necessary.
:)