Chapter 1 Start up

Introduction Warmup

It is a good beginning. The task in DataCamp is friendly to beginners.

Clustering and classification, and dimensionality reduction techniques are the two parts I am most interested in.

I heard this course via email sent to HYMY students.

The link to my github repository:

https://github.com/yufanyin/IODS-project

The link to my course diary:

https://yufanyin.github.io/IODS-project/

work through the DataCamp:

DataCamp 1 Hello R

DataCamp 2 Getting Hooked on R


Chapter 2 Regression and model validation

(5.11.-11.11.2019)

Note: ‘(Step N)’ is used for reviewers to check my work queicky.

2.1 Reading the data

Read the data file (Step 1a)

learning2014 <- 
read.table(file = "G:/C-Open Data Science/0-191030/IODS-project-master/learning2014.txt", stringsAsFactors = TRUE) 
str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

Describe the dataset briefly (Step 1b)

The data ‘learning2014’ consists of 166 observations and 7 variables. It contains students’ approaches to learning (deep, stra, surf), scores of attitude, points and some background information (gender, age).

2.2 Visualizations with ggplot2

(Step 2)

Correlations of the observations between the different variables are shown below with ggpairs() function.

library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
p1 <- ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)), upper = list(continuous = wrap("cor", size = 2.8)))
p1

http://127.0.0.1:42872/chunk_output/DC8623CAB88836EA/A5B1EE4E/cd2f6s6tn8gxk/000002.png

The plot shows the data distribution of all the 166 observations of each variable. The colours represent different gender. Female students are shown in red and male students in verdant. In the upper right part presents how one variable correlates to another varibable. We are able to find possible significant regression result.

2.3 Simple regression with lm()

(Step 3)

Simple regression: point ~ gender Firstly we choose gender as explanatory variables and fit a regression model in which exam points is the dependent variable. Result: not significant.

gender_lm <- lm(points ~ gender, data = learning2014)
summary(gender_lm)
## 
## Call:
## lm(formula = points ~ gender, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.3273  -3.3273   0.5179   4.5179  10.6727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  22.3273     0.5613  39.776   <2e-16 ***
## genderM       1.1549     0.9664   1.195    0.234    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.887 on 164 degrees of freedom
## Multiple R-squared:  0.008632,   Adjusted R-squared:  0.002587 
## F-statistic: 1.428 on 1 and 164 DF,  p-value: 0.2338

Simple regression: point ~ attitude

gender_lm <- lm(points ~ attitude, data = learning2014)
summary(gender_lm)
## 
## Call:
## lm(formula = points ~ attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

There is a statistical relationship between point and attitude (p:4.12e-09). Then draw a scatter plot of points versus attitude, fit a linear model and print out the summary.

qplot(attitude, points, data = learning2014) + geom_smooth(method = "lm")

http://127.0.0.1:42872/chunk_output/DC8623CAB88836EA/A5B1EE4E/c5p8rg484lt8q/000002.png

model_point_attitude <- lm(points ~ attitude, data = learning2014)
model_point_attitude
## 
## Call:
## lm(formula = points ~ attitude, data = learning2014)
## 
## Coefficients:
## (Intercept)     attitude  
##      11.637        3.525

2.4 Multiple regression

(Step 4)

Multiple regression model: point ~ attitude + stra This multiple regression is to test whether strategic learning has an influence on points. The p-value (0.08) shows that the significance of the influence is not very strong. The multiple R-squared is a bit higher than that in regression model of points ~ attitude. That means a little higher correlations if strategic learning is taken into account.

model_p_att_stra <- lm(points ~ attitude + stra, data = learning2014)
model_p_att_stra_s <- summary(model_p_att_stra)
model_p_att_stra_s
## 
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.9729     2.3959   3.745  0.00025 ***
## attitude      3.4658     0.5652   6.132 6.31e-09 ***
## stra          0.9137     0.5345   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09

2.5 Diagnostic plots

(Step 5)

The results of diagnostic plots are in accordance with the regression model.

The residuals is close to the 0 line, which means that the relationship is linear. Not many outliers (35, 56,145) are significantly disturbing the analysis but they raise errors in the dataset.

According to Nomal Q-Q plot, most of the values fall on the straight line. The outliers cause randomness in the dataset.

plot(model_p_att_stra, which=c(1,2,5))

http://127.0.0.1:42872/chunk_output/DC8623CAB88836EA/A5B1EE4E/c4jdrj7wils0g/000004.png?fixed_size=1

http://127.0.0.1:42872/chunk_output/DC8623CAB88836EA/A5B1EE4E/c4jdrj7wils0g/000005.png?fixed_size=1

http://127.0.0.1:42872/chunk_output/DC8623CAB88836EA/A5B1EE4E/c4jdrj7wils0g/000006.png


Chapter 3 Logistic regression

(12.11.-18.11.2019)

Note: ‘(Step N)’ is used for reviewers to check my work queicky.

3.1 Reading the data

Read the data file (Step 2a)

getwd()
## [1] "G:/C-Open Data Science/0-191030/IODS-project-master"
setwd("G:/C-Open Data Science/0-191030/IODS-project-master")
alc <- read.csv(file = "G:/C-Open Data Science/0-191030/IODS-project-master/alc.csv")

Print out the names of the variables in the data (Step 2b)

dim(alc)
## [1] 382  36
colnames(alc)
##  [1] "X"          "school"     "sex"        "age"        "address"   
##  [6] "famsize"    "Pstatus"    "Medu"       "Fedu"       "Mjob"      
## [11] "Fjob"       "reason"     "nursery"    "internet"   "guardian"  
## [16] "traveltime" "studytime"  "failures"   "schoolsup"  "famsup"    
## [21] "paid"       "activities" "higher"     "romantic"   "famrel"    
## [26] "freetime"   "goout"      "Dalc"       "Walc"       "health"    
## [31] "absences"   "G1"         "G2"         "G3"         "alc_use"   
## [36] "high_use"
str(alc)
## 'data.frame':    382 obs. of  36 variables:
##  $ X         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 3 0 0 0 0 0 0 0 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
##  $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  6 4 10 2 4 10 0 6 0 0 ...
##  $ G1        : int  5 5 7 15 6 15 12 6 16 14 ...
##  $ G2        : int  6 5 8 14 10 15 12 5 18 15 ...
##  $ G3        : int  6 6 10 15 10 15 11 6 19 15 ...
##  $ alc_use   : num  1 1 2.5 1 1.5 1.5 1 1 1 1 ...
##  $ high_use  : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...

Describe the data set briefly (Step 2c)

The data consists of 382 observations in 35 variables.

Note: It shows 36 variables. Why the first colname is “X” (int 1 2 3 4 5 6 7 8 9 10 …)?

The data were collected by using school reports and questionnaires from students in two Portugal secondary schools. They included students’ grades, demographic characteristics, social and school related features.

The two data sets provided students’ performance in Mathematics (mat) and Portuguese language (por). After data wrangling part, they were merged into one data set “alc”.

To analyse alcohol consumption, two new variables were created. “alc_use” is the average of weekday (“Dalc”) and weekend (“Walc”) alcohol consumption. “high_use” is the logical value created if the “alc_use” of a student is higher than 2 (TRUE).

Data source: https://archive.ics.uci.edu/ml/datasets/Student+Performance

3.2 Hypotheses about the relationships between alcohol consumption and specific variables

Choose variables and write hypotheses (Step 3)

The purpose of the following analysis is to explore the relationships between high/low alcohol consumption and some variables. I chose the following four variables:

‘sex’: student’s sex (binary: ‘F’ - female or ‘M’ - male)

‘failures’: number of past class failures (numeric: n if 1<=n<3, else 4)

‘goout’: going out with friends (numeric: from 1 - very low to 5 - very high)

‘absences’: number of school absences (numeric: from 0 to 93)

The hypotheses are as follows.

Hypothesis 1: There is gender difference (‘sex’) in alcohol consumption. In social science, gender affect many behaviors and preferences and becomes a variable being tested often.

Hypothesis 2: The number of class failures (‘failures’) associates with alcohol consumption. Students with more class failures have higher alcohol consumption.

Hypothesis 3: The frequency of going out with friends (‘goout’) associates with alcohol consumption. Students going out with friends more often have higher alcohol consumption.

Hypothesis 4: The number of school absences (‘absences’) associates with alcohol consumption. Students having more times of school absences have higher alcohol consumption.

3.3 Exploring the relationships numerically and graphically

(Step 4)

library(tidyr)
library(dplyr)
library(ggplot2)
library(knitr)
library(kableExtra)

3.3.1 Alcohol consumption and gender

alc %>% group_by(sex) %>% summarise(count = n()) -> fm
knitr::kable(fm, caption="Students' gender distribution") %>% kable_styling(bootstrap_options = "hover", full_width = FALSE, position = "center")
Students’ gender distribution
sex count
F 198
M 184

There were 198 female students and 184 male students.

ggplot(data = alc, aes(x = alc_use, fill = sex)) + geom_bar() + xlab("Alcohol consumption (1: lowest / 5: highest)") + ylab("Numeber of students") + ggtitle("Alcohol consumption and gender") + scale_fill_discrete("Gender", labels = c("Female", "Male"))

Te figure presents the distribution of alcohol consumption taking gender into account. 1 means the lowest consumption, 5 is the highest consumption.

ggplot(data = alc, aes(x = alc_use, fill = high_use)) + facet_wrap("sex") + geom_bar() + xlab("Student's alcohol consumption") + ggtitle("High alcohol consumption by gender") + ylab("Numeber of students") + scale_fill_discrete("High alcohol use")

This graph shows the number of male and female students with higher alcohol consumption. High alcohol consumption was in red and lower alcohol consumption was in red blue. Male students with high alcohol consumption more than female students with high consumption.

3.3.2 Alcohol consumption and number of class failures

ggplot(data = alc, aes(x = failures,  fill = sex)) + geom_bar() + facet_wrap("high_use") + ggtitle("Alcohol consumption and number of class failures") + xlab("Number of class failures") + ylab("Numeber of students") + scale_fill_discrete("Gender", labels = c("Female", "Male"))

3.3.3 Alcohol consumption and frequency of going out with friends

ggplot(data = alc, aes(x = goout,  fill = sex)) + geom_bar() + facet_wrap("high_use") + ggtitle("Alcohol consumption and frequency of going out with friends") + xlab("Number of going out with friends (1: very low / 5: very high)") + ylab("Numeber of students") + scale_fill_discrete("Gender", labels = c("Female", "Male"))

3.3.4 Alcohol consumption and number of school absences

ggplot(data = alc, aes(x = absences,  fill = sex)) + geom_bar() + facet_wrap("high_use") + ggtitle("Alcohol consumption and number of school absences") + xlab("Number of school absences (from 0 to 93)") + ylab("Numeber of students") + scale_fill_discrete("Student's gender", labels = c("Female", "Male"))

3.4 Exploring the relationship using logistic regression

3.4.1 The fitted model

(Step 5a)

‘sex’, ‘failures’,‘goout’ and ‘absences’

lr_glm <- glm(high_use ~ sex + failures + goout + absences, data = alc, family = "binomial")
lr_glm_sum <- summary(lr_glm)

3.4.2 The coefficients of the model

(Step 5b)

lr_glm_sum
## 
## Call:
## glm(formula = high_use ~ sex + failures + goout + absences, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8046  -0.7661  -0.5196   0.7383   2.3862  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.21840    0.47865  -8.813  < 2e-16 ***
## sexM         0.97670    0.25898   3.771 0.000162 ***
## failures     0.28657    0.17170   1.669 0.095111 .  
## goout        0.71559    0.12089   5.919 3.24e-09 ***
## absences     0.06550    0.01685   3.888 0.000101 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 378.67  on 377  degrees of freedom
## AIC: 388.67
## 
## Number of Fisher Scoring iterations: 4
knitr::kable(lr_glm_sum$coefficients, digits = 3, align = "l", justify = "left", caption="Model coefficients of choosen variables") %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "center")
Model coefficients of choosen variables
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.218 0.479 -8.813 0.000
sexM 0.977 0.259 3.771 0.000
failures 0.287 0.172 1.669 0.095
goout 0.716 0.121 5.919 0.000
absences 0.066 0.017 3.888 0.000

The output shows

  • Being a male student instead of a female student significantly (p < 0.001) increased the probability of high alcohol consumption.

  • Class failure had a positive effect on high consumption with a smaller probability (p = 0.095).

  • The frenquency of going out with friends had a significant positive effect on high alcohol consumption (p < 0.001).

  • School absences had a significant positive effect on high alcohol consumption (< 0.001).

3.4.3 The odds ratios and confidence intervals

(Step 5c)

Odds ratios

or_lr <- coef(lr_glm) %>% exp
or_lr 
## (Intercept)        sexM    failures       goout    absences 
##  0.01472214  2.65567230  1.33184898  2.04538456  1.06769512

Confidence intervals

ci_lr <- confint(lr_glm) %>% exp
## Waiting for profiling to be done...
ci_lr
##                   2.5 %     97.5 %
## (Intercept) 0.005520616 0.03620626
## sexM        1.608518976 4.45004345
## failures    0.949933462 1.86808335
## goout       1.623552039 2.61096941
## absences    1.035225148 1.10698949

Print out the odds ratios with their confidence intervals

od_conf <- cbind(or_lr, ci_lr) 
colnames(od_conf) <- c("odds ratios","2.5 %", "97.5 %")
knitr::kable(od_conf, digits = 3, align = "l",  caption="Odds ratios and confidental intervals") %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "center")
Odds ratios and confidental intervals
odds ratios 2.5 % 97.5 %
(Intercept) 0.015 0.006 0.036
sexM 2.656 1.609 4.450
failures 1.332 0.950 1.868
goout 2.045 1.624 2.611
absences 1.068 1.035 1.107
  • ‘sex’ has a value of 2.656, the confidence interval is between 1.609 and 4.450. So this oods ratio tells us that the probability of high alcohol consumption for male students was a 2.7 times higher than female students.

  • ‘failures’ has a value of 1.332, the confidence interval is between 0.950 and 1.868. Class failures increased the risk of high alcohol consumption.

  • ‘goout’ has a value of 2.045, the confidence interval is between 1.624 and 2.611. Going out with friends significantly increased the risk of high alcohol consumption.

  • ‘absences’ has a value of 1.068, the confidence interval is between 1.035 and 1.107. School absences increased the probability of higher alcohol consumption.

3.5 The predictive power of you model

(Step 6)

probs <- predict(lr_glm, type = "response")
alc <- mutate(alc, prob = probs)
alc <- mutate(alc, pred = prob > 0.5)
select(alc, age, failures, goout, absences, high_use, prob, pred) %>% tail(20)
##     age failures goout absences high_use       prob  pred
## 363  17        0     3        8    FALSE 0.17542917 FALSE
## 364  17        0     5       14    FALSE 0.56870202  TRUE
## 365  18        0     4        0    FALSE 0.20488145 FALSE
## 366  18        0     2        2    FALSE 0.06560625 FALSE
## 367  18        0     3        4     TRUE 0.14068205 FALSE
## 368  18        0     3        0    FALSE 0.11188342 FALSE
## 369  17        0     4       17     TRUE 0.43966501 FALSE
## 370  18        0     4        4     TRUE 0.47069539 FALSE
## 371  18        0     3        5    FALSE 0.31703390 FALSE
## 372  17        0     3        2    FALSE 0.27608939 FALSE
## 373  19        1     2        0    FALSE 0.17887835 FALSE
## 374  18        1     3       14     TRUE 0.52713626  TRUE
## 375  18        0     3        2    FALSE 0.12557744 FALSE
## 376  18        0     3        7    FALSE 0.16615452 FALSE
## 377  19        1     2        0    FALSE 0.07581170 FALSE
## 378  18        0     4        0    FALSE 0.20488145 FALSE
## 379  18        1     1        0    FALSE 0.03855880 FALSE
## 380  18        1     1        0    FALSE 0.03855880 FALSE
## 381  17        0     5        3     TRUE 0.63011977  TRUE
## 382  18        0     1        0     TRUE 0.07404728 FALSE
pred22 <- table(high_use = alc$high_use, pred = alc$pred)
knitr::kable(pred22, align = "l",  caption="2 x 2 cross tabulation" ) %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "center") %>% add_header_above(c(" " = 1, "Pred" = 2)) %>%  pack_rows("High use", 1, 2)
2 x 2 cross tabulation
Pred
FALSE TRUE
High use
FALSE 251 19
TRUE 62 50

The model reported 251 correct false results and 62 false negative results. Among the 62 cases, students had high alcohol consumption but the model predicted not. Among the 19 false positive prediction cases, the model predicted high alcohol consumption but it was not in fact.

loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}
loss_func(class = alc$high_use, prob = alc$prob)
## [1] 0.2120419

It showed that 21.2% of all predictions was incorrect.

3.6 The 10-fold cross-validation on the model

(Step 7)

library(boot)
cross_val<- cv.glm(data = alc, cost = loss_func, glmfit = lr_glm, K = 10)
cross_val$delta[1]
## [1] 0.2172775

3.7 Cross-validation to compare the performance of different logistic regression models

(Step 8)


Chapter 4 Clustering and classification

(19.11.-25.11.2019)

Note: ‘(Step N)’ is used for reviewers to check my work queicky.

4.1 Reading the data

Load the Boston data from the MASS package (Step 2a)

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
data("Boston")

Print out the names of the variables in the data (Step 2b)

dim(Boston)
## [1] 506  14
colnames(Boston)
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

Describe the dataset briefly (Step 2c)

The data consist of 506 observations in 14 variables. Information about housing (proportion of residential land, proportion of non-retail business, rooms per dwelling, proportion of old buildings, accessibility to employment centres and highways), population (crime rate, tax, pupil-teacher ratio, demographic structure), environment (air quality, river) in Boston suburbs were reported.

This data frame contains the following columns:

  • “crim”: per capita crime rate by town.

  • “zn”: proportion of residential land zoned for lots over 25,000 sq.ft.

  • “indus”: proportion of non-retail business acres per town.

  • “chas”: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

  • “nox”: nitrogen oxides concentration (parts per 10 million).

  • “rm”: average number of rooms per dwelling.

  • “age”: proportion of owner-occupied units built prior to 1940.

  • “dis”: weighted mean of distances to five Boston employment centres.

  • “rad”: index of accessibility to radial highways.

  • “tax”: full-value property-tax rate per $10,000.

  • “ptratio”: pupil-teacher ratio by town.

  • “black”: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.

  • “lstat”: lower status of the population (percent).

  • “medv”: median value of owner-occupied homes in $1000s.

4.2 Exploring the data numerically and graphically

4.2.1 Graphical overview of the data

(Step 3a)

library(tidyr)
library(dplyr)
library(corrplot)
library(ggplot2)
library(GGally)
ov_boston <- ggpairs(Boston, mapping = aes(), title ="Overview of Boston data", lower = list(combo = wrap("facethist", bins = 20)), upper = list(continuous = wrap("cor", size = 2.8)))
ov_boston

4.2.2 Summaries of the variables

(Step 3b)

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

4.2.3 Relationships between the variables

(Step 3c)

Calculate and print the correlation matrix

cor_matrix<-cor(Boston) %>% round(digits = 2)
cor_matrix
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00

Specialized according to the significant level and visualize the correlation matrix p.mat <- cor.mtest(cor_matrix)$p

library(corrplot)
p.mat <- cor.mtest(cor_matrix)$p
corrplot(cor_matrix, method="circle", type="upper",  tl.cex = 0.6, p.mat = p.mat, sig.level = 0.01, title="Correlations of Boston data", mar=c(0,0,1,0))

The crime rate strongly positively correlated with accessibility to highways, the property-tax rate and with the lower status.

The nitrogen oxides concentration significantly correlated with The weighted mean of distances to five employment significantly positively correlated with proportion of big residential land zoned and negatively correlated with proportion of non-retail business acres, proportion of old owner-occupied units and nitrogen oxides concentration.

4.3 Standardizing the dataset

4.3.1 Standardizing and printing out summaries

(Step 4a)

Values were standardized after data scaling so that the observations of the different variables can be compared with each other. Positive and negative values were created with an overall mean of 0 and a standard deviation of 1.

bos_scaled <- scale(Boston)
summary(bos_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

4.3.2 Creating a categorical variable of the crime rate

(Step 4b)

Create a quantile vector of crim

class(bos_scaled)
## [1] "matrix"
bos_scaled <- as.data.frame(bos_scaled)
summary(bos_scaled$crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110
bins <- quantile(bos_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610

Create a categorical variable ‘crime’ and look at the table

crime <- cut(bos_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127

Remove original crim and add the new categorical value to scaled data

bos_scaled <- dplyr::select(bos_scaled, -crim)
bos_scaled <- data.frame(bos_scaled, crime)

We checked the scaled values of rate values and defined the quantiles. Then we named them as low (0-25%), medium low (25-50%), medium high(50-75%) and high (75-100%).Then the quantiles of crime rate quantiles ‘crime’ were added and replaced ‘crim’.

4.3.3 Dividing the dataset to train and test sets

(Step 4c)

Choose randomly 80% of the rows and create the train set and the test set

n_bos <- nrow(bos_scaled)
ind_bos <- sample(n_bos,  size = n_bos * 0.8)
boston_train <- bos_scaled[ind_bos,]
boston_test <- bos_scaled[-ind_bos,]

Save the correct classes from test data

correct_classes <- boston_test$crime
correct_classes
##   [1] low      low      med_low  med_high med_high med_high med_high
##   [8] med_high med_high low      low      med_low  med_low  low     
##  [15] low      med_low  med_low  low      low      med_low  low     
##  [22] low      low      med_low  med_low  med_low  med_low  med_low 
##  [29] med_low  med_high med_high med_high med_high med_high med_high
##  [36] med_high med_high med_high med_low  med_low  low      low     
##  [43] low      med_low  med_low  low      low      low      low     
##  [50] med_low  med_low  med_low  med_low  med_low  med_low  med_low 
##  [57] low      med_high med_high med_high med_high low      med_low 
##  [64] low      med_low  low      low      med_low  low      low     
##  [71] med_high med_high med_high low      low      low      low     
##  [78] low      low      high     high     high     high     high    
##  [85] high     high     high     high     high     high     high    
##  [92] high     high     high     high     high     high     high    
##  [99] high     high     high     high    
## Levels: low med_low med_high high

Remove the crime variable from test data set

boston_test <- dplyr::select(boston_test, -crime)

4.4 Linear discriminant analysis

(Step 5)

Fit linear discriminant analysis on the ‘boston_train’ data set using the categorical ‘crime’ as the target variable.

lda.fit <- lda(crime ~ ., data = boston_train)
lda.fit
## Call:
## lda(crime ~ ., data = boston_train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2376238 0.2475248 0.2574257 0.2574257 
## 
## Group means:
##                  zn      indus        chas        nox         rm
## low       0.8671860 -0.8485969 -0.19030615 -0.8630622  0.4101940
## med_low  -0.1172105 -0.2627239 -0.07547406 -0.5410452 -0.1847441
## med_high -0.3948894  0.1789197  0.21980846  0.3465785  0.1023238
## high     -0.4872402  1.0170690 -0.04518867  1.0263404 -0.3282094
##                 age        dis        rad        tax      ptratio
## low      -0.8339186  0.8406002 -0.6971471 -0.7348633 -0.452913751
## med_low  -0.4027126  0.3202213 -0.5500476 -0.4291467 -0.009946484
## med_high  0.3592201 -0.3511215 -0.3756131 -0.2909462 -0.249941864
## high      0.7905465 -0.8503519  1.6386213  1.5144083  0.781350740
##                black        lstat        medv
## low       0.36999537 -0.715425370  0.47076706
## med_low   0.31104125 -0.150826770 -0.03683837
## med_high  0.09915259  0.003071918  0.16883256
## high     -0.79657982  0.809836268 -0.66608744
## 
## Coefficients of linear discriminants:
##                  LD1         LD2          LD3
## zn       0.064713197  0.74524778 -0.908970410
## indus    0.005855833 -0.24857245  0.091501918
## chas    -0.092210600 -0.10691445  0.060138540
## nox      0.395961661 -0.72837304 -1.144675082
## rm      -0.089262903 -0.06315898 -0.259855052
## age      0.244696221 -0.23988852 -0.261379062
## dis     -0.057066293 -0.21888082 -0.007249824
## rad      3.079385046  0.76714473 -0.417017083
## tax     -0.087455797  0.11725343  0.880884627
## ptratio  0.128269592  0.01928150 -0.123383068
## black   -0.137291621  0.01118952  0.079985876
## lstat    0.218280464 -0.20881241  0.217992605
## medv     0.173541485 -0.40568071 -0.226754787
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9522 0.0354 0.0125

The function for lda biplot arrows

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

Target classes as numeric and plot the lda results

classes <- as.numeric(boston_train$crime)
plot(lda.fit, dimen = 2, col = classes, pch = classes, main = "LDA biplot of Boston train set")

The linear discriminant analysis showed the separation of different crime rates.

Note: I do not know why the arrow were not desplayed.

Error in arrows(x0 = 0, y0 = 0, x1 = myscale * heads[, choices[1]], y1 = myscale * : plot.new has not been called yet

4.5 Predicting the classes with the LDA model

Now we Use the LDA outcome to perform a prediction on the crime rate on the test set.

4.5.1 Predicting classes with test data

(Step 6a)

lda.pred <- predict(lda.fit, newdata = boston_test)

4.5.2 Cross tabulate the results with the crime categories

(Step 6b)

pred_table <- table(correct = correct_classes, predicted = lda.pred$class)
pred_table
##           predicted
## correct    low med_low med_high high
##   low       20      11        0    0
##   med_low    4      18        4    0
##   med_high   0       6       16    0
##   high       0       0        0   23

The results were not very good on lower crime rate predictions. The predicted crime rate was higher than it actually was: predicted med_low of and med_high which were low rates or med_low rates. However, the predictions on the high crime rates was good.

4.6 Reloading the dataset and run k-means algorithm

In 4.5, we performed classification.In classification, the classes are defined before (crime rate values from low to high) and the classification model is trained based on data.

Now we do Clustering. The classes / number of classes is unknown and we will find groups based on similarity of the observations.

4.6.1 Standardizing the Boston dataset

(Step 7a)

data(Boston)
bos_scaled2 <- scale(Boston)

4.6.2 Calculating the distances

(Step 7b)

Euclidean distance matrix

dist_eu <- dist(bos_scaled2)
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

Manhattan distance matrix

dist_man <- dist(bos_scaled2, method = "manhattan")
summary(dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618

4.6.3 K-means clustering

(Step 7c)

bos_km <- kmeans(bos_scaled2, centers = 3)

Plot the dataset with clusters

pairs(bos_scaled2, col = bos_km$cluster)

pairs(bos_scaled2[,1:5], col = bos_km$cluster)

pairs(bos_scaled2[,6:10], col = bos_km$cluster)

pairs(bos_scaled2[,11:14], col = bos_km$cluster)

The optimal number of clusters was 3. We got the best overview with three clusters.

4.6.4 Determining the k

set.seed(123)
k_max <- 10 # determine the number of clusters
twcss <- sapply(1:k_max, function(k){kmeans(bos_scaled2, k)$tot.withinss}) # calculate the total within sum of squares
qplot(x = 1:k_max, y = twcss, geom = 'line') # visualize the results

bos_km2 <- kmeans(bos_scaled2, centers = 3) # k-means clustering
pairs(bos_scaled2, col = bos_km2$cluster) # plot the data with clusters

pairs(bos_scaled2[,1:5], col = bos_km2$cluster)

pairs(bos_scaled2[,6:10], col = bos_km2$cluster)

pairs(bos_scaled2[,11:14], col = bos_km2$cluster)

The twcss value decrease heavily at a level of 2 - 3 clusters. The optimal number of clusters was 3.

4.7 Performing k-means on the original data

(Bonus)

data(Boston)
bos_scaled3 <- scale(Boston)
bos_km3 <-kmeans(bos_scaled3, centers = 3)
cluster <- bos_km3$cluster
bos_scaled3 <- data.frame(bos_scaled3, cluster)
lda.fit_cluster <- lda(cluster ~ ., data = bos_scaled3)
lda.fit_cluster
## Call:
## lda(cluster ~ ., data = bos_scaled3)
## 
## Prior probabilities of groups:
##         1         2         3 
## 0.2470356 0.3260870 0.4268775 
## 
## Group means:
##         crim         zn      indus         chas        nox         rm
## 1 -0.3989700  1.2614609 -0.9791535 -0.020354653 -0.8573235  1.0090468
## 2  0.7982270 -0.4872402  1.1186734  0.014005495  1.1351215 -0.4596725
## 3 -0.3788713 -0.3578148 -0.2879024  0.001080671 -0.3709704 -0.2328004
##           age        dis        rad        tax     ptratio      black
## 1 -0.96130713  0.9497716 -0.5867985 -0.6709807 -0.80239137  0.3552363
## 2  0.79930921 -0.8549214  1.2113527  1.2873657  0.59162230 -0.6363367
## 3 -0.05427143  0.1034286 -0.5857564 -0.5951053  0.01241316  0.2805140
##        lstat        medv
## 1 -0.9571271  1.06668290
## 2  0.8622388 -0.67953738
## 3 -0.1047617 -0.09820229
## 
## Coefficients of linear discriminants:
##                 LD1         LD2
## crim    -0.03206338 -0.19094456
## zn       0.02935900 -1.07677218
## indus    0.63347352 -0.09917524
## chas     0.02460719  0.10009606
## nox      1.11749317 -0.75995105
## rm      -0.18841682 -0.57360135
## age     -0.12983139  0.47226685
## dis      0.04493809 -0.34585958
## rad      0.67004295 -0.08584353
## tax      1.03992455 -0.58075025
## ptratio  0.25864960 -0.02605279
## black   -0.01657236  0.01975686
## lstat    0.17365575 -0.41704235
## medv    -0.06819126 -0.79098605
## 
## Proportion of trace:
##    LD1    LD2 
## 0.8506 0.1494
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
classes3 <- as.numeric(bos_scaled3$cluster)
plot(lda.fit_cluster, dimen = 2, col = classes3, pch = classes3, main = "LDA biplot using three clusters")
lda.arrows(lda.fit_cluster, myscale = 2)

4.8 3D plot

(Super-Bonus)

model_predictors <- dplyr::select(boston_train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

Next, install and access the plotly package. Create a 3D plot of the columns of the matrix product.

Note: After running:

plot_ly(x = matrix_product\(LD1, y = matrix_product\)LD2, z = matrix_product\(LD3, type= 'scatter3d', mode='markers', color = boston_train\)crime)

It showed:

WebGL is not supported by your browser. Therefore I cannot see a cool plot.


Chapter 5 Dimensionality reduction techniques

(25.11.-2.12.2019)

5.1 Exploring the data numerically and graphically

Load necessary packages

library(tidyr)
library(dplyr)
library(corrplot)
library(ggplot2)
library(GGally)
library(knitr)
library(kableExtra)
library(stringr)
library(ggfortify)
library(factoextra)

5.1.1 Reading the data

human <- read.table(file = "G:/C-Open Data Science/0-191030/IODS-project-master/human_ana.txt")

5.1.2 Summaries of the variables

(Step 1a)

The data are the conbination of two data sets, Human Development Index (HDI) and Gender Inequality Index.

The data consist of 155 observations in 8 variables.

“Country”: country name “edu2F_edu2M”: ratio of second education of female/male “labF_labM”: ratio of labour forced female/male “exp_edu”: expected years of education “exp_life”: life expectancy at birth “GNI”: Gross National Income per capita “mat_mor”: maternal mortality ratio “ado_birth”: adolescent birth rate “F_parli”: percetange of female representatives in parliament

knitr::kable(summary(human)) %>% kable_styling(bootstrap_options = "striped", position = "center", font_size = 12)
edu2F_edu2M labF_labM exp_edu exp_life GNI mat_mor ado_birth F_parli
Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00 Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
Median :0.9375 Median :0.7535 Median :13.50 Median :74.20 Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65 Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50 Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50

5.1.3 Graphical overview of the data

(Step 1b)

ov_human <- ggpairs(human, mapping = aes(), title ="Overview of 'human' data", lower = list(combo = wrap("facethist", bins = 20)), upper = list(continuous = wrap("cor", size = 2.8)))
ov_human

The overview plot showed the data distributions of all variables and their correlations. GNI, maternal mortality rate and adolescence birth rate showed left skewed distribution representing a increased cound on low rates. On the upper part showed correlations of the variables.

5.1.4 Relationships between the variables

(Step 1c)

cor_human <- cor(human) %>% round(digits = 2)
cor_human %>% knitr::kable(caption = "Correlation between variables") %>% 
  kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "center")
Correlation between variables
edu2F_edu2M labF_labM exp_edu exp_life GNI mat_mor ado_birth F_parli
edu2F_edu2M 1.00 0.01 0.59 0.58 0.43 -0.66 -0.53 0.08
labF_labM 0.01 1.00 0.05 -0.14 -0.02 0.24 0.12 0.25
exp_edu 0.59 0.05 1.00 0.79 0.62 -0.74 -0.70 0.21
exp_life 0.58 -0.14 0.79 1.00 0.63 -0.86 -0.73 0.17
GNI 0.43 -0.02 0.62 0.63 1.00 -0.50 -0.56 0.09
mat_mor -0.66 0.24 -0.74 -0.86 -0.50 1.00 0.76 -0.09
ado_birth -0.53 0.12 -0.70 -0.73 -0.56 0.76 1.00 -0.07
F_parli 0.08 0.25 0.21 0.17 0.09 -0.09 -0.07 1.00
p.mat <- cor.mtest(cor_human)$p
corrplot(cor_human, method="circle",  type="lower",  tl.cex = 0.6, p.mat = p.mat, sig.level = 0.01, tl.srt = 45, title="Correlations of 'human' data", mar=c(0,0,1,0))

Expected years of education strongly positively correlated with Live expectation and strongly negatively correlated with maternal mortality ratio. It indicated that a better education leads to a higher life expectation and contributes to a lower maternal mortality.

Another strong positive correlation was seen between adolescent birth rate and maternal mortality ratio. This may be due to high quality of medical care.

5.2 Principal component analysis (PCA) on non-standardized data

In PCA, the data are first transformed to a new space with equal or less number of dimensions which called the principal components. The 1st principal component captures the maximum amount of variance from the features in the original data. The 2nd principal component is orthogonal to the first and it captures the maximum amount of variability left.

5.2.1 Variability captured by the principal components

(Step 2a)

pca_human <- prcomp(human)
round(pca_human$rotation[,1:2], digits = 4) %>% knitr::kable(caption = "PCA result on all variables") %>% kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "center")
PCA result on all variables
PC1 PC2
edu2F_edu2M 0.0000 0.0007
labF_labM 0.0000 -0.0003
exp_edu -0.0001 0.0076
exp_life -0.0003 0.0283
GNI -1.0000 -0.0058
mat_mor 0.0057 -0.9916
ado_birth 0.0012 -0.1256
F_parli -0.0001 0.0032
s1 <- summary(pca_human)
pca_prc1 <- round(100 * s1$importance[2, ], digits = 3) 
round(100 * s1$importance[2, 1:2], digits = 5) %>% knitr::kable(caption = "Principle components") %>% kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "center")
Principle components
x
PC1 99.99
PC2 0.01

5.2.2 PCA Biplot displaying the observations by the first two principal components

(Step 2b)

The biplot displayed the observations by the first two principal components (PC1 coordinate in x-axis, PC2 coordinate in y-axis), along with arrows representing the original variables.

pca_label1 <-  paste0("GNI (", pca_prc1[1],"%)" )
pca_label2 <-  paste0("Maternal mortality (", pca_prc1[2],"%)" )
biplot(pca_human, choices = 1:2, cex = c(0.55, 0,9), col = c("grey40", "darkgreen"), xlab = pca_label1, ylab = pca_label2, main = "PCA biplot of non-standardized human data", margin =c(0,0,5,0)) 

autoplot(pca_human, data = human, label= TRUE, label.size = 3.0, colour = "darkgreen", loadings = TRUE, loadings.label = TRUE, loadings.colour = "red",) + ggtitle("PCA biplot of non-standardized human data 2") + xlab(paste0(pca_label1))  + ylab(paste0(pca_label2)) + theme_bw()

Principal component 1 (GNI) represents 99.99% of the variance of the non-standardized data. PC2 (maternal mortalility ratio) captures 0.01% of the variance. The Gross National Income per capita explains 99.99% of the principal component 1 of the data.

5.3 Principal component analysis (PCA) on standardized data

5.3.1 Scaling the data

(Step 3a)

The summary of the standardized data is as follows (The mean of all values is 0 and the standard diviation reaches 1).

stzd_human <- scale(human)
knitr::kable(summary(stzd_human)) %>% 
  kable_styling(bootstrap_options = "striped", position = "center", font_size = 11)
edu2F_edu2M labF_labM exp_edu exp_life GNI mat_mor ado_birth F_parli
Min. :-2.8189 Min. :-2.6247 Min. :-2.7378 Min. :-2.7188 Min. :-0.9193 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
1st Qu.:-0.5233 1st Qu.:-0.5484 1st Qu.:-0.6782 1st Qu.:-0.6425 1st Qu.:-0.7243 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
Median : 0.3503 Median : 0.2316 Median : 0.1140 Median : 0.3056 Median :-0.3013 Median :-0.4726 Median :-0.3298 Median :-0.1403
Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
3rd Qu.: 0.5958 3rd Qu.: 0.7350 3rd Qu.: 0.7126 3rd Qu.: 0.6717 3rd Qu.: 0.3712 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
Max. : 2.6646 Max. : 1.6632 Max. : 2.4730 Max. : 1.4218 Max. : 5.6890 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850

5.3.2 Variability captured by the principal components

(Step 3b)

pca_stzd_human <- prcomp(stzd_human)
round(pca_stzd_human$rotation[, 1:2], digits = 4) %>% knitr::kable(caption = "PC1 & 2 of standardized 'human' data") %>% kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "center")
PC1 & 2 of standardized ‘human’ data
PC1 PC2
edu2F_edu2M -0.3566 0.0380
labF_labM 0.0546 0.7243
exp_edu -0.4277 0.1394
exp_life -0.4437 -0.0253
GNI -0.3505 0.0506
mat_mor 0.4370 0.1451
ado_birth 0.4113 0.0771
F_parli -0.0844 0.6514
s2 <- summary(pca_stzd_human)
pca_prc2 <- round(100 * s2$importance[2, ], digits = 1) 

pca_label2.1 <- paste0("Education and health (",pca_prc2[1],"%) ")
pca_label2.2 <- paste0("Female social participation (",pca_prc2[2],"%) ")

round(100 * s2$importance[2, 1:2], digits = 2) %>% knitr::kable(caption = "Principle components") %>% kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "center")
Principle components
x
PC1 53.61
PC2 16.24

5.3.3 PCA Biplot

(Step 3c)

biplot(pca_stzd_human, choices = 1:2, cex = c(0.5, 0,9), col = c("grey40", "deeppink2"), xlab = pca_label2.1, ylab = pca_label2.2, main = "PCA biplot of standardized 'human' data", margin =c(0,0,5,0))

autoplot(prcomp(stzd_human), data = human, colour = "darkgreen", label= TRUE, label.size = 3.0, loadings = TRUE, loadings.label = TRUE, loadings.colour = "red",) + ggtitle("PCA biplot of standardized 'human' data 2") + xlab(paste0(pca_label2.1))  + ylab(paste0(pca_label2.2)) + theme_bw()

5.3.4 Interpretations of the first two principal component dimensions

(Step 4)

The principal component representing 16.2% of the data variation was strongly associated with percentage of parliament representatives (“F_parli”) and ratio of labour forced female/male (“labF_labM”). So the share of of woman in labour market and in the political representatives were mostly associated with principal component 2. The variables which showed the strongest association with principal component 1 and represented 53.6% of the data variation were life expectancy (“exp_life”), expected years of education (“exp_edu”), the ratio of second education of female/male (“edu2F_edu2M”), adolescent birth rate (“ado_birth”) and maternal mortality ratio (“mat_mor”). Hence principal component 1 discribed the health and education situation in those countries.

The PCA result on between non-standardized data differed from that on standardized data. In the PCA of the non-standardized human data, the principal component 1 represents almost all varation and Gross National Income per capita is the value representing principle component 1. The PCA of standardized values looks quite different. The PC1 & PC2 values spread all over the biplot and more variables are associated to these two principal components. To perform the PCA, it is necessary to standardizing variables of obsvervations so that values become comparable. (Before doing this exercise, I did not know why I needed to standardize first! Now I understand.)

The angle between the arrows represents the original features of the data and can be interpreted as the correlation between these features. Small angle between the arrows means that the correlation is positive. The angle between the feature and the principal component can be interpreted as the correlation between these two components. Similarly, if the angle is small there is a positive correlation. The length of the arrows is proportional to the standard deviation of the features.

5.4 Multiple Correspondence Analysis (MCA)

5.4.1 Structure and dimensions of ‘tea’ data

(Step 5a)

The tea data about tea consumption behaviour were loaded from the package Factominer. The data consist of 300 observations in 36 variables.

library(FactoMineR)
data(tea)
dim(tea)
## [1] 300  36
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...

5.4.2 Visualization of ‘tea’ data

(Step 5b)

Error in the following chunk:

Warning: attributes are not identical across measure variables;

they will be dropped

WHY? The error occurred in knitting index.rmd. Everything was fine when I knitted chapter5.rmd

I had to add #### so that all the following codes would be able to display without running.

# column names to keep in the dataset
#### keep_tea <- c("Tea", "How", "how", "price", "where", "lunch")
#### keep_tea

# select and create a new dataset
#### tea_time <- select(tea, one_of(keep_tea))

#### gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar(fill = "darkblue") + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

5.4.3 Multiple Correspondence Analysis and its summary

(Step 5c)

#### mca_tea_time <- MCA(tea_time, graph = FALSE)
#### summary(mca_tea_time)

Visualize MCA

#### plot(mca_tea_time, invisible=c("ind"), habillage = "quali")

5.4.4 MCA biplot of tea consumption behaviour

(Step 5d)

An individual value between 0 & 1 was added to colour the cosine-squared (cos2), a value between 0 & 1. The closer to 1, the better this value projected on the dimension.

#### fviz_mca_biplot(mca_tea_time, col.ind = "cos2", col.var = "darkred", label = "var", geom =  c("point","text"), labelsize = 4, arrows = c(FALSE,TRUE)) + labs(title = "MCA of tea consumption behaviour") + theme_grey() + theme(axis.line = element_line(size = 0.5), panel.background = element_rect(fill = "gray93"))

5.4.4 MCA biplot of tea consumption behaviour

(Step 5d)

An individual value between 0 & 1 was added to colour the cosine-squared (cos2), a value between 0 & 1. The closer to 1, the better this value projected on the dimension.

5.4.5 MCA of reason for drinking tea

(Step 5e)

“sex”: Gender of tea consumers “age_Q”: Age group of tea consumers “spirituality”: Spiritual reason “healthy”: Health reason “slimming”: To lose weight or not “relaxing”: Relaxation or not

(Same error in this chunk)

#### keep_reason <- c("sex", "age_Q", "spirituality", "healthy", "slimming", "relaxing")
#### tea_reason <- select(tea, one_of(keep_reason))

#### gather(tea_reason) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar(fill = "darkblue", alpha = 0.6) + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

MCA of ‘tea_reason’ data

#### mca_tea_reason <- MCA(tea_reason, graph = FALSE)
#### summary(mca_tea_reason)

Call: MCA(X = tea_reason, graph = FALSE)

Eigenvalues

                Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
                   

Variance 0.233 0.217 0.202 0.173 0.166 0.138

% of var. 15.562 14.476 13.456 11.502 11.066 9.187

Cumulative % of var. 15.562 30.039 43.495 54.997 66.063 75.251

                Dim.7   Dim.8   Dim.9
                   

Variance 0.132 0.126 0.113

% of var. 8.833 8.393 7.523

Cumulative % of var. 84.083 92.477 100.000

Individuals (the 10 first)

             Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
                

1 | 0.491 0.344 0.135 | 0.199 0.061 0.022 | 0.110

2 | 0.506 0.366 0.210 | 0.599 0.551 0.294 | -0.564

3 | 0.039 0.002 0.001 | 0.353 0.191 0.119 | -0.380

4 | -0.602 0.518 0.306 | -0.531 0.432 0.237 | 0.178

5 | 0.208 0.062 0.024 | -0.446 0.305 0.112 | -0.116

6 | -0.270 0.104 0.081 | -0.288 0.128 0.093 | 0.198

7 | 0.024 0.001 0.000 | -0.047 0.003 0.001 | 0.294

8 | -0.193 0.053 0.025 | 0.146 0.033 0.014 | -0.211

9 | 0.308 0.135 0.049 | -0.410 0.259 0.088 | 0.073

10 | 0.024 0.001 0.000 | -0.047 0.003 0.001 | 0.294

             ctr   cos2  
                

1 0.020 0.007 |

2 0.526 0.261 |

3 0.239 0.138 |

4 0.052 0.027 |

5 0.022 0.008 |

6 0.065 0.044 |

7 0.143 0.054 |

8 0.074 0.030 |

9 0.009 0.003 |

10 0.143 0.054 |

Categories (the 10 first)

              Dim.1     ctr    cos2  v.test     Dim.2     ctr
                 

F | -0.256 2.767 0.095 -5.338 | 0.220 2.198

M | 0.373 4.037 0.095 5.338 | -0.321 3.208

15-24 | -0.834 15.237 0.308 -9.593 | -0.524 6.472

25-34 | 0.681 7.619 0.139 6.437 | -0.744 9.760

35-44 | 0.018 0.003 0.000 0.121 | 0.149 0.229

45-59 | 0.692 6.948 0.122 6.043 | 0.728 8.276

+60 | -0.346 1.086 0.017 -2.282 | 1.293 16.262

Not.spirituality | 0.302 4.458 0.199 7.719 | 0.212 2.374

spirituality | -0.661 9.770 0.199 -7.719 | -0.465 5.203

healthy | -0.247 3.050 0.142 -6.525 | 0.304 4.981

             cos2  v.test     Dim.3     ctr    cos2  v.test  
                

F 0.070 4.589 | -0.555 15.066 0.449 -11.582 |

M 0.070 -4.589 | 0.809 21.981 0.449 11.582 |

15-24 0.122 -6.030 | -0.522 6.893 0.120 -6.000 |

25-34 0.165 -7.027 | 0.983 18.367 0.289 9.294 |

35-44 0.003 1.014 | -0.261 0.750 0.010 -1.770 |

45-59 0.135 6.361 | -0.716 8.605 0.131 -6.254 |

+60 0.243 8.517 | 0.901 8.498 0.118 5.936 |

Not.spirituality 0.099 5.433 | 0.017 0.016 0.001 0.432 |

spirituality 0.099 -5.433 | -0.037 0.035 0.001 -0.432 |

healthy 0.216 8.043 | 0.179 1.853 0.075 4.730 |

Categorical variables (eta2)

            Dim.1 Dim.2 Dim.3  

sex | 0.095 0.070 0.449 |

age_Q | 0.433 0.534 0.522 |

spirituality | 0.199 0.099 0.001 |

healthy | 0.142 0.216 0.075 |

slimming | 0.100 0.272 0.107 |

relaxing | 0.431 0.111 0.058 |

MCA biplot of ‘tea_reason’ data

#### fviz_mca_biplot(mca_tea_reason, col.ind = "cos2", col.var = "darkred", alpha.var = "contrib", label = "var", geom =  c("point","text"), labelsize = 3.5, arrows = c(FALSE,TRUE)) + labs(title = "MCA of tea consumption reasons") + theme_grey() + theme(axis.line = element_line(size = 0.5), panel.background = element_rect(fill = "gray93"))

Comment on the output

(Step 5f)

Variable categories closing to each other showed higher similarity. The plot included two colour schemes representing the contribution of the variables to the dimenstions(“contrib”) and the “cos2” of the individuals (the closer to 1, the better it projected on the dimension) .

People in the age of 15-24, spirituality and relaxation reason were close. Men in the age of 25-34 tended not to drink tea for health reason or losing weight. People in the age of 45-59 tended not to drink tea for relaxation or spirituality reason. People in age of 60+ and women group have similarities with drinking tea for health reasons and weight losing.

Note: PCA and MCA biplot help to visualize results much better than doing in SPSS. Wrting codes in R is a bit more abstract but it is worth learning.


Chapter 6 Analysis of longitudinal data

(3.12.-10.12.2019)

6.1 Loading the wide and long data sets

Load and check the data sets

# load necessary packages
library(tidyr)
library(dplyr)
library(corrplot)
library(ggplot2)
library(GGally)
library(knitr)
library(kableExtra)
library(ggthemes)
library(stringr)
library(lme4)

Load the wide (bprs & rats) and long (bprs_l & rats_l) data sets

bprs <- read.table("G:/C-Open Data Science/0-191030/IODS-project-master/BPRS.txt", sep = " ", header = TRUE)
rats <- read.table("G:/C-Open Data Science/0-191030/IODS-project-master/rats.txt", header = TRUE)
bprs_l <- read.table("G:/C-Open Data Science/0-191030/IODS-project-master/BPRS_l.txt")
rats_l <- read.table("G:/C-Open Data Science/0-191030/IODS-project-master/rats_l.txt")
str(bprs_l) 
## 'data.frame':    360 obs. of  5 variables:
##  $ treatment: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ subject  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ weeks    : Factor w/ 9 levels "week0","week1",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BPRS     : int  42 58 54 55 72 48 71 30 41 57 ...
##  $ week     : int  0 0 0 0 0 0 0 0 0 0 ...
str(rats_l)
## 'data.frame':    176 obs. of  5 variables:
##  $ ID    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Group : int  1 1 1 1 1 1 1 1 2 2 ...
##  $ WD    : Factor w/ 11 levels "WD1","WD15","WD22",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Weight: int  240 225 245 260 255 260 275 245 410 405 ...
##  $ Time  : int  1 1 1 1 1 1 1 1 1 1 ...

bprs data: factor treatment & subject rats data: factor ID & Group

bprs_l$treatment <- factor(bprs_l$treatment)
bprs_l$subject <- factor(bprs_l$subject)
rats_l$ID <- factor(rats_l$ID)
rats_l$Group <- factor(rats_l$Group)
str(bprs_l)
## 'data.frame':    360 obs. of  5 variables:
##  $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ subject  : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ weeks    : Factor w/ 9 levels "week0","week1",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BPRS     : int  42 58 54 55 72 48 71 30 41 57 ...
##  $ week     : int  0 0 0 0 0 0 0 0 0 0 ...
str(rats_l)
## 'data.frame':    176 obs. of  5 variables:
##  $ ID    : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
##  $ WD    : Factor w/ 11 levels "WD1","WD15","WD22",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Weight: int  240 225 245 260 255 260 275 245 410 405 ...
##  $ Time  : int  1 1 1 1 1 1 1 1 1 1 ...

The BPRS dataset

A total 40 male subjects were measured with 2 treatment groups using BPRS (Brief Psychiatric Rating Scale). One BPRS measurement was conducted before treatment started (week0), and weekly measurements followed up to 8 weeks. The BPRS assesses the levels of 18 symptom constructs to evaluate whether patients have schizophrenia (Vehkalahti & Everitt, 2019). The levels of symptoms constructs were from 1 (not present) to 7 (extremely severe).

The long BPRS data ‘bprs_l’

They consisted of 360 observations in 5 variables. The variables included:
“treatment”: the psychological treatment of the male subjects (1 & 2)
“subject”: the male individuals identification number “weeks”: the week of treatment and BPRS evaluation “bprs”: the values of bprs “week”: the week number of the BPRS evaluation.

knitr::kable(bprs_l, caption = "The long BPRS data 'bprs_l'") %>% 
  kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "center") %>% 
  scroll_box(height = "200px")
The long BPRS data ‘bprs_l’
treatment subject weeks BPRS week
1 1 week0 42 0
1 2 week0 58 0
1 3 week0 54 0
1 4 week0 55 0
1 5 week0 72 0
1 6 week0 48 0
1 7 week0 71 0
1 8 week0 30 0
1 9 week0 41 0
1 10 week0 57 0
1 11 week0 30 0
1 12 week0 55 0
1 13 week0 36 0
1 14 week0 38 0
1 15 week0 66 0
1 16 week0 41 0
1 17 week0 45 0
1 18 week0 39 0
1 19 week0 24 0
1 20 week0 38 0
2 1 week0 52 0
2 2 week0 30 0
2 3 week0 65 0
2 4 week0 37 0
2 5 week0 59 0
2 6 week0 30 0
2 7 week0 69 0
2 8 week0 62 0
2 9 week0 38 0
2 10 week0 65 0
2 11 week0 78 0
2 12 week0 38 0
2 13 week0 63 0
2 14 week0 40 0
2 15 week0 40 0
2 16 week0 54 0
2 17 week0 33 0
2 18 week0 28 0
2 19 week0 52 0
2 20 week0 47 0
1 1 week1 36 1
1 2 week1 68 1
1 3 week1 55 1
1 4 week1 77 1
1 5 week1 75 1
1 6 week1 43 1
1 7 week1 61 1
1 8 week1 36 1
1 9 week1 43 1
1 10 week1 51 1
1 11 week1 34 1
1 12 week1 52 1
1 13 week1 32 1
1 14 week1 35 1
1 15 week1 68 1
1 16 week1 35 1
1 17 week1 38 1
1 18 week1 35 1
1 19 week1 28 1
1 20 week1 34 1
2 1 week1 73 1
2 2 week1 23 1
2 3 week1 31 1
2 4 week1 31 1
2 5 week1 67 1
2 6 week1 33 1
2 7 week1 52 1
2 8 week1 54 1
2 9 week1 40 1
2 10 week1 44 1
2 11 week1 95 1
2 12 week1 41 1
2 13 week1 65 1
2 14 week1 37 1
2 15 week1 36 1
2 16 week1 45 1
2 17 week1 41 1
2 18 week1 30 1
2 19 week1 43 1
2 20 week1 36 1
1 1 week2 36 2
1 2 week2 61 2
1 3 week2 41 2
1 4 week2 49 2
1 5 week2 72 2
1 6 week2 41 2
1 7 week2 47 2
1 8 week2 38 2
1 9 week2 39 2
1 10 week2 51 2
1 11 week2 34 2
1 12 week2 49 2
1 13 week2 36 2
1 14 week2 36 2
1 15 week2 65 2
1 16 week2 45 2
1 17 week2 46 2
1 18 week2 27 2
1 19 week2 31 2
1 20 week2 27 2
2 1 week2 42 2
2 2 week2 32 2
2 3 week2 33 2
2 4 week2 27 2
2 5 week2 58 2
2 6 week2 37 2
2 7 week2 41 2
2 8 week2 49 2
2 9 week2 38 2
2 10 week2 31 2
2 11 week2 75 2
2 12 week2 36 2
2 13 week2 60 2
2 14 week2 31 2
2 15 week2 55 2
2 16 week2 35 2
2 17 week2 30 2
2 18 week2 29 2
2 19 week2 26 2
2 20 week2 32 2
1 1 week3 43 3
1 2 week3 55 3
1 3 week3 38 3
1 4 week3 54 3
1 5 week3 65 3
1 6 week3 38 3
1 7 week3 30 3
1 8 week3 38 3
1 9 week3 35 3
1 10 week3 55 3
1 11 week3 41 3
1 12 week3 54 3
1 13 week3 31 3
1 14 week3 34 3
1 15 week3 49 3
1 16 week3 42 3
1 17 week3 38 3
1 18 week3 25 3
1 19 week3 28 3
1 20 week3 25 3
2 1 week3 41 3
2 2 week3 24 3
2 3 week3 28 3
2 4 week3 31 3
2 5 week3 61 3
2 6 week3 33 3
2 7 week3 33 3
2 8 week3 39 3
2 9 week3 27 3
2 10 week3 34 3
2 11 week3 76 3
2 12 week3 27 3
2 13 week3 53 3
2 14 week3 38 3
2 15 week3 55 3
2 16 week3 27 3
2 17 week3 32 3
2 18 week3 33 3
2 19 week3 27 3
2 20 week3 29 3
1 1 week4 41 4
1 2 week4 43 4
1 3 week4 43 4
1 4 week4 56 4
1 5 week4 50 4
1 6 week4 36 4
1 7 week4 27 4
1 8 week4 31 4
1 9 week4 28 4
1 10 week4 53 4
1 11 week4 36 4
1 12 week4 48 4
1 13 week4 25 4
1 14 week4 25 4
1 15 week4 36 4
1 16 week4 31 4
1 17 week4 40 4
1 18 week4 29 4
1 19 week4 29 4
1 20 week4 25 4
2 1 week4 39 4
2 2 week4 20 4
2 3 week4 22 4
2 4 week4 31 4
2 5 week4 49 4
2 6 week4 28 4
2 7 week4 34 4
2 8 week4 55 4
2 9 week4 31 4
2 10 week4 39 4
2 11 week4 66 4
2 12 week4 29 4
2 13 week4 52 4
2 14 week4 35 4
2 15 week4 42 4
2 16 week4 25 4
2 17 week4 46 4
2 18 week4 30 4
2 19 week4 24 4
2 20 week4 25 4
1 1 week5 40 5
1 2 week5 34 5
1 3 week5 28 5
1 4 week5 50 5
1 5 week5 39 5
1 6 week5 29 5
1 7 week5 40 5
1 8 week5 26 5
1 9 week5 22 5
1 10 week5 43 5
1 11 week5 36 5
1 12 week5 43 5
1 13 week5 25 5
1 14 week5 27 5
1 15 week5 32 5
1 16 week5 31 5
1 17 week5 33 5
1 18 week5 28 5
1 19 week5 21 5
1 20 week5 27 5
2 1 week5 38 5
2 2 week5 20 5
2 3 week5 25 5
2 4 week5 26 5
2 5 week5 38 5
2 6 week5 26 5
2 7 week5 37 5
2 8 week5 51 5
2 9 week5 24 5
2 10 week5 34 5
2 11 week5 64 5
2 12 week5 27 5
2 13 week5 32 5
2 14 week5 30 5
2 15 week5 30 5
2 16 week5 22 5
2 17 week5 43 5
2 18 week5 26 5
2 19 week5 32 5
2 20 week5 23 5
1 1 week6 38 6
1 2 week6 28 6
1 3 week6 29 6
1 4 week6 47 6
1 5 week6 32 6
1 6 week6 33 6
1 7 week6 30 6
1 8 week6 26 6
1 9 week6 20 6
1 10 week6 43 6
1 11 week6 38 6
1 12 week6 37 6
1 13 week6 21 6
1 14 week6 25 6
1 15 week6 27 6
1 16 week6 29 6
1 17 week6 27 6
1 18 week6 21 6
1 19 week6 22 6
1 20 week6 21 6
2 1 week6 43 6
2 2 week6 19 6
2 3 week6 24 6
2 4 week6 24 6
2 5 week6 37 6
2 6 week6 27 6
2 7 week6 37 6
2 8 week6 55 6
2 9 week6 22 6
2 10 week6 41 6
2 11 week6 64 6
2 12 week6 21 6
2 13 week6 37 6
2 14 week6 33 6
2 15 week6 26 6
2 16 week6 22 6
2 17 week6 43 6
2 18 week6 36 6
2 19 week6 21 6
2 20 week6 23 6
1 1 week7 47 7
1 2 week7 28 7
1 3 week7 25 7
1 4 week7 42 7
1 5 week7 38 7
1 6 week7 27 7
1 7 week7 31 7
1 8 week7 25 7
1 9 week7 23 7
1 10 week7 39 7
1 11 week7 36 7
1 12 week7 36 7
1 13 week7 19 7
1 14 week7 26 7
1 15 week7 30 7
1 16 week7 26 7
1 17 week7 31 7
1 18 week7 25 7
1 19 week7 23 7
1 20 week7 19 7
2 1 week7 62 7
2 2 week7 18 7
2 3 week7 31 7
2 4 week7 26 7
2 5 week7 36 7
2 6 week7 23 7
2 7 week7 38 7
2 8 week7 59 7
2 9 week7 21 7
2 10 week7 42 7
2 11 week7 60 7
2 12 week7 22 7
2 13 week7 52 7
2 14 week7 30 7
2 15 week7 30 7
2 16 week7 22 7
2 17 week7 43 7
2 18 week7 33 7
2 19 week7 21 7
2 20 week7 23 7
1 1 week8 51 8
1 2 week8 28 8
1 3 week8 24 8
1 4 week8 46 8
1 5 week8 32 8
1 6 week8 25 8
1 7 week8 31 8
1 8 week8 24 8
1 9 week8 21 8
1 10 week8 32 8
1 11 week8 36 8
1 12 week8 31 8
1 13 week8 22 8
1 14 week8 26 8
1 15 week8 37 8
1 16 week8 30 8
1 17 week8 27 8
1 18 week8 20 8
1 19 week8 22 8
1 20 week8 21 8
2 1 week8 50 8
2 2 week8 20 8
2 3 week8 32 8
2 4 week8 23 8
2 5 week8 35 8
2 6 week8 21 8
2 7 week8 35 8
2 8 week8 66 8
2 9 week8 21 8
2 10 week8 39 8
2 11 week8 75 8
2 12 week8 23 8
2 13 week8 28 8
2 14 week8 27 8
2 15 week8 37 8
2 16 week8 22 8
2 17 week8 43 8
2 18 week8 30 8
2 19 week8 21 8
2 20 week8 23 8

The RATS dataset The data consisted of measurements on how rats grew. Three groups of rats got different diets. The body weight of every animal was measured over a period of 9 weeks. The research question is whether the growth profiles of the different diet groups differ (Vehkalahti & Everitt, 2019).

The long RATS data ‘rats_l’ The data consisted of 176 observations in 5 variables. The variables included:

“ID”: the individual rat’s identification number “Group”: the diet group (1, 2 & 3) “WD”: the factorial value of the time (the date when rat’ weight was measured) “Weight”: weight of individual rat at the given day “Time”: the time point of the weight measurement (days)

# rats_ data set
knitr::kable(rats_l, caption = "The long RATS data 'rats_l'") %>% 
  kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "center") %>% 
  scroll_box(height = "200px")
The long RATS data ‘rats_l’
ID Group WD Weight Time
1 1 WD1 240 1
2 1 WD1 225 1
3 1 WD1 245 1
4 1 WD1 260 1
5 1 WD1 255 1
6 1 WD1 260 1
7 1 WD1 275 1
8 1 WD1 245 1
9 2 WD1 410 1
10 2 WD1 405 1
11 2 WD1 445 1
12 2 WD1 555 1
13 3 WD1 470 1
14 3 WD1 535 1
15 3 WD1 520 1
16 3 WD1 510 1
1 1 WD8 250 8
2 1 WD8 230 8
3 1 WD8 250 8
4 1 WD8 255 8
5 1 WD8 260 8
6 1 WD8 265 8
7 1 WD8 275 8
8 1 WD8 255 8
9 2 WD8 415 8
10 2 WD8 420 8
11 2 WD8 445 8
12 2 WD8 560 8
13 3 WD8 465 8
14 3 WD8 525 8
15 3 WD8 525 8
16 3 WD8 510 8
1 1 WD15 255 15
2 1 WD15 230 15
3 1 WD15 250 15
4 1 WD15 255 15
5 1 WD15 255 15
6 1 WD15 270 15
7 1 WD15 260 15
8 1 WD15 260 15
9 2 WD15 425 15
10 2 WD15 430 15
11 2 WD15 450 15
12 2 WD15 565 15
13 3 WD15 475 15
14 3 WD15 530 15
15 3 WD15 530 15
16 3 WD15 520 15
1 1 WD22 260 22
2 1 WD22 232 22
3 1 WD22 255 22
4 1 WD22 265 22
5 1 WD22 270 22
6 1 WD22 275 22
7 1 WD22 270 22
8 1 WD22 268 22
9 2 WD22 428 22
10 2 WD22 440 22
11 2 WD22 452 22
12 2 WD22 580 22
13 3 WD22 485 22
14 3 WD22 533 22
15 3 WD22 540 22
16 3 WD22 515 22
1 1 WD29 262 29
2 1 WD29 240 29
3 1 WD29 262 29
4 1 WD29 265 29
5 1 WD29 270 29
6 1 WD29 275 29
7 1 WD29 273 29
8 1 WD29 270 29
9 2 WD29 438 29
10 2 WD29 448 29
11 2 WD29 455 29
12 2 WD29 590 29
13 3 WD29 487 29
14 3 WD29 535 29
15 3 WD29 543 29
16 3 WD29 530 29
1 1 WD36 258 36
2 1 WD36 240 36
3 1 WD36 265 36
4 1 WD36 268 36
5 1 WD36 273 36
6 1 WD36 277 36
7 1 WD36 274 36
8 1 WD36 265 36
9 2 WD36 443 36
10 2 WD36 460 36
11 2 WD36 455 36
12 2 WD36 597 36
13 3 WD36 493 36
14 3 WD36 540 36
15 3 WD36 546 36
16 3 WD36 538 36
1 1 WD43 266 43
2 1 WD43 243 43
3 1 WD43 267 43
4 1 WD43 270 43
5 1 WD43 274 43
6 1 WD43 278 43
7 1 WD43 276 43
8 1 WD43 265 43
9 2 WD43 442 43
10 2 WD43 458 43
11 2 WD43 451 43
12 2 WD43 595 43
13 3 WD43 493 43
14 3 WD43 525 43
15 3 WD43 538 43
16 3 WD43 535 43
1 1 WD44 266 44
2 1 WD44 244 44
3 1 WD44 267 44
4 1 WD44 272 44
5 1 WD44 273 44
6 1 WD44 278 44
7 1 WD44 271 44
8 1 WD44 267 44
9 2 WD44 446 44
10 2 WD44 464 44
11 2 WD44 450 44
12 2 WD44 595 44
13 3 WD44 504 44
14 3 WD44 530 44
15 3 WD44 544 44
16 3 WD44 542 44
1 1 WD50 265 50
2 1 WD50 238 50
3 1 WD50 264 50
4 1 WD50 274 50
5 1 WD50 276 50
6 1 WD50 284 50
7 1 WD50 282 50
8 1 WD50 273 50
9 2 WD50 456 50
10 2 WD50 475 50
11 2 WD50 462 50
12 2 WD50 612 50
13 3 WD50 507 50
14 3 WD50 543 50
15 3 WD50 553 50
16 3 WD50 550 50
1 1 WD57 272 57
2 1 WD57 247 57
3 1 WD57 268 57
4 1 WD57 273 57
5 1 WD57 278 57
6 1 WD57 279 57
7 1 WD57 281 57
8 1 WD57 274 57
9 2 WD57 468 57
10 2 WD57 484 57
11 2 WD57 466 57
12 2 WD57 618 57
13 3 WD57 518 57
14 3 WD57 544 57
15 3 WD57 555 57
16 3 WD57 553 57
1 1 WD64 278 64
2 1 WD64 245 64
3 1 WD64 269 64
4 1 WD64 275 64
5 1 WD64 280 64
6 1 WD64 281 64
7 1 WD64 284 64
8 1 WD64 278 64
9 2 WD64 478 64
10 2 WD64 496 64
11 2 WD64 472 64
12 2 WD64 628 64
13 3 WD64 525 64
14 3 WD64 559 64
15 3 WD64 548 64
16 3 WD64 569 64

Part 1 Analyses of chapter 8 of MABS using the RATS data (rats_l)

6.2-6.5

6.2 Standardizing the dataset and exploring the data graphically

6.2.1 Exploring the data graphically

ggplot(rats_l, aes(x = Time, y = Weight,linetype = ID)) +
  geom_line(color = "darkblue") +
  scale_linetype_manual(values = rep(1:6, times= 3)) + 
  facet_grid(. ~ Group, labeller = label_both) +
  theme(legend.position = "right") +
  scale_y_continuous(limits = c(min(rats_l$Weight), max(rats_l$Weight))) + 
 ggtitle("The long RATS data 'rats_l'")

Rat’s starting weights in group 2 & 3 were higher than in group 1 and the weights in group 2 & 3 increased in quite high rates. The weights in group 1 increased in a smaller rate.

6.2.2 Standardizing the dataset

Calculating StdWeight and added as a new variable to the ‘rats_l’

rats_l <- rats_l %>%
  group_by(Time) %>%
  mutate(StdWeight = ((Weight - mean(Weight))/sd(Weight))) %>%
  ungroup()
str(rats_l)
## Classes 'tbl_df', 'tbl' and 'data.frame':    176 obs. of  6 variables:
##  $ ID       : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Group    : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
##  $ WD       : Factor w/ 11 levels "WD1","WD15","WD22",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Weight   : int  240 225 245 260 255 260 275 245 410 405 ...
##  $ Time     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ StdWeight: num  -1.001 -1.12 -0.961 -0.842 -0.882 ...

6.2.3 Plotting the standardised data

ggplot(rats_l, aes(x = Time, y = StdWeight, linetype = ID)) +
  geom_line(color = "darkred") +
  scale_linetype_manual(values = rep(1:6, times=3)) +
  facet_grid(. ~ Group, labeller = label_both) +
  scale_y_continuous(name = "StdWeight: standardized Weight values") +
  ggtitle("Standardized 'rats_l' Weight values") + 
  theme(panel.grid.major.y = element_line(colour = "darkred"))

6.3 Profiles

6.3.1 Adding categorical variable

Add the Time as a categorical variable “Time1”

rats_l <- rats_l %>%
  mutate(Time1 = factor(rats_l$Time)) %>%
  ungroup()
str(rats_l)
## Classes 'tbl_df', 'tbl' and 'data.frame':    176 obs. of  7 variables:
##  $ ID       : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Group    : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
##  $ WD       : Factor w/ 11 levels "WD1","WD15","WD22",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Weight   : int  240 225 245 260 255 260 275 245 410 405 ...
##  $ Time     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ StdWeight: num  -1.001 -1.12 -0.961 -0.842 -0.882 ...
##  $ Time1    : Factor w/ 11 levels "1","8","15","22",..: 1 1 1 1 1 1 1 1 1 1 ...

6.3.2 Boxplot

The overview of rats’ weights over the whole measurement period with boxplots. Different diet groups were in different colors.

ggplot(data = rats_l, aes(x = rats_l$Time1, y = Weight, fill = rats_l$Group)) +
  geom_boxplot() +
  ylab("Weight") + 
  xlab("Measurement time [days]") +
  ggtitle("Rat weights") +
  scale_fill_discrete(name = "Diet group") +
  theme(legend.position = "right") 

Calculate mean weight of all rats & the standard error of each diet group and add as a variable

n_rats <- rats_l$Time %>% unique() %>% length()
rats_s <- rats_l %>%
  group_by(Group, Time) %>%
  summarise(mean = mean(Weight), se = (sd(Weight)/sqrt(n_rats))) %>%
  ungroup()
str(rats_s)
## Classes 'tbl_df', 'tbl' and 'data.frame':    33 obs. of  4 variables:
##  $ Group: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Time : int  1 8 15 22 29 36 43 44 50 57 ...
##  $ mean : num  251 255 254 262 265 ...
##  $ se   : num  4.59 3.95 3.46 4.1 3.33 ...

6.3.3 Plotting the mean profiles

ggplot(rats_s, aes(x = Time, y = mean, linetype = Group,  color = Group, shape = Group)) +
  geom_line(size = 0.6) +
  scale_linetype_manual(name = "Diet group", values = c(1,2,3)) +
  geom_point(size=1.5) +
  scale_shape_manual(name = "Diet group", values = c(16,17,18)) +
  scale_color_manual(name = "Diet group", values = c("darkblue", "darkred", "darkgreen")) +
  geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.8) +
  theme(legend.position = "right") +
  scale_y_continuous(name = "Mean (Weight) +/- SE (Weight)") + 
  ggtitle("Mean weight profiles of different diet groups")

This graph showed the rat weight development of different diet groups during the measurement period. Group 1 started at a lower level and the weight increased slightly . In Group 2 and group 3, weights started from a higher values and then decreased. The standard errors in group 2 & 3 were higher, which indicated that the weight difference between the individual rats was bigger within the diet group.

6.4 Finding the outlier

6.4.1 Creating summary variable and calculating the mean weights of every group

Create a summary data by Group and ID with mean as the summary variable

rats_l10s <- rats_l %>%
  filter(Time > 1) %>%
  group_by(Group, ID) %>%
  summarise(mean=mean(Weight)) %>%
  ungroup()
str(rats_l10s)
## Classes 'tbl_df', 'tbl' and 'data.frame':    16 obs. of  3 variables:
##  $ Group: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
##  $ ID   : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ mean : num  263 239 262 267 271 ...

6.4.2 Boxplot of mean weights of different diet groups excluding the first measurement

ggplot(rats_l10s, aes(x = Group, y = mean, fill = Group)) +
  geom_boxplot() +
  stat_summary(fun.y = "mean", geom = "point", shape = 23, size = 3, fill = "black") +
  scale_y_continuous(name = "Mean (Weight) / Time 8-64 [days]") + 
  xlab("Diet group") + 
  scale_fill_discrete(name = "Diet group") +
  ggtitle("Mean weights of diet groups excluding the first measurement")

The boxplot showed the outliers and they could disturb further comparisons of the different diet groups.

6.4.3 Removing outlier

rats_l10s_o <- rats_l10s %>%
    filter(
      (mean > 250 & Group == 1) | 
      (mean < 550 & Group == 2) |
      (mean > 500 & Group == 3))  %>% 
      ungroup()
str(rats_l10s_o)
## Classes 'tbl_df', 'tbl' and 'data.frame':    13 obs. of  3 variables:
##  $ Group: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 2 2 2 ...
##  $ ID   : Factor w/ 16 levels "1","2","3","4",..: 1 3 4 5 6 7 8 9 10 11 ...
##  $ mean : num  263 262 267 271 276 ...

6.4.Boxplot of the mean value without the outlier

ggplot(rats_l10s_o, aes(x = Group, y = mean, fill = Group)) +
  geom_boxplot() +
  stat_summary(fun.y = "mean", geom = "point", shape = 23, size = 3, fill = "black") +
  scale_y_continuous(name = "Mean (Weight) / Time 8-64 [days]") +
  xlab("Diet group") + 
  scale_fill_discrete(name = "Diet group") +
  ggtitle("Mean weights of diet groups without the outlier")

This time there were no outlier values in the boxplot Mean weights of diet groups during 8 to 64 days.

6.5 Linear regression model and ANOVA

Note: I used ANOVA more frequently than t-test so I skip t-test.

Creating a new data table rats_l10s_b. The weight baseline values were created by the first weigh measurement on day 1 (WD1) as a new variable to the summary data.

rats_l10s_b <- rats_l10s %>%
  mutate(baseline = rats$WD1)
rats_l10s_b %>% knitr::kable() %>% kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "center") %>% scroll_box(height = "200px")
Group ID mean baseline
1 1 263.2 240
1 2 238.9 225
1 3 261.7 245
1 4 267.2 260
1 5 270.9 255
1 6 276.2 260
1 7 274.6 275
1 8 267.5 245
2 9 443.9 410
2 10 457.5 405
2 11 455.8 445
2 12 594.0 555
3 13 495.2 470
3 14 536.4 535
3 15 542.2 520
3 16 536.2 510

To find which diet group had the highest increase in weight during the measurement period, anova tests were done using this new data ‘rats_l10s_b’ with the mean weight values and the baseline weight values.

6.5.1 Linear regression model 1 (mean weight ~ diet groups)

rats_lrm1 <- lm(mean ~ Group, data = rats_l10s_b)
round(rats_lrm1$coefficients, digits = 2) %>% knitr::kable() %>% kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "center")
x
(Intercept) 265.02
Group2 222.77
Group3 262.48

The model showed that the mean value of group 2 was 222g higher and the mean value of group 3 was 262g higher than that of group 1.

Examine the group means using a post-hoc test after the ANOVA to find whether the diet groups differed to each other.

The model shows that the mean value of group 2 is 222g higher and the mean value of group 3 is 262g higher compared to group 1 mean weights. Let’s have a look now if the diet groups differ to each other. With a post-hoc test after the ANOVA we can check the group means.

6.5.2 ANOVA of Linear regression model 1 (mean weight ~ diet groups)

rats_anova1 <- anova(rats_lrm1)
rats_anova1
## Analysis of Variance Table
## 
## Response: mean
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## Group      2 238620  119310  88.525 2.679e-08 ***
## Residuals 13  17521    1348                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA of the linear model (mean weight ~ diet groups) showed that group 2 had a significant different weight increase compared to group 1 & 3.

6.5.3 ANOVA with a post-hoc test

Perform an ANOVA and a post-hoc test (TukeyHSD) to find whether group means of diet groups were different from each other. The dataset rats_l10s_b was without the first measurement (start value) and inluded the outlier values.

rats_anova2 <- aov(mean ~ Group, data = rats_l10s_b)
summary(rats_anova2)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## Group        2 238620  119310   88.53 2.68e-08 ***
## Residuals   13  17521    1348                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(rats_anova2, "Group", ordered = TRUE, conf.level = 0.95)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
##     factor levels have been ordered
## 
## Fit: aov(formula = mean ~ Group, data = rats_l10s_b)
## 
## $Group
##        diff       lwr      upr     p adj
## 2-1 222.775 163.41452 282.1355 0.0000006
## 3-1 262.475 203.11452 321.8355 0.0000001
## 3-2  39.700 -28.84358 108.2436 0.3099506

This post-hoc test provided the differences between the diet groups. Group 1 & 2 were significantly different in mean weight. So were Group 1 & 3. Group 2 and 3 were not significantly different in the mean weight mean, influenced by the outlier.

6.5.4 Linear regression model 2 (mean ~ baseline + Group)

The difference of each group to the baselines:

The linear model with the mean as the response

rats_lrm2 <- lm(mean ~ baseline + Group, data = rats_l10s_b)
round(rats_lrm2$coefficients, digits = 2) %>% knitr::kable() %>% kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "center")
x
(Intercept) 33.16
baseline 0.93
Group2 34.86
Group3 23.68

That was the regression model of mean weights compared to the weight baseline (the first weight measurement of each dietary group). The results shows that the weight difference between baseline and group 2 was 34g. The weight difference between baseline and group 3 and baseline was 23g. Therefore Group 2 has the highest weight increase rates during the measurement period.

6.5.5 ANOVA of Linear regression model 2 (mean ~ baseline + Group)

rats_anova3 <- anova(rats_lrm2)
rats_anova3
## Analysis of Variance Table
## 
## Response: mean
##           Df Sum Sq Mean Sq   F value   Pr(>F)    
## baseline   1 253625  253625 1859.8201 1.57e-14 ***
## Group      2    879     439    3.2219  0.07586 .  
## Residuals 12   1636     136                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The baseline had a significant influence in the weight increase in different diet groups. Group 2 had a significant weight increase relative to the baseline, compared to group 1 & 3. Diet 2 may lead to the highest weight increase.

Part 2 Analyses of chapter 9 of MABS using the BPRS data (bprs_l)

6.6-6.8

(10.12.2019)

Overview plot

str(bprs_l)
## 'data.frame':    360 obs. of  5 variables:
##  $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ subject  : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ weeks    : Factor w/ 9 levels "week0","week1",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BPRS     : int  42 58 54 55 72 48 71 30 41 57 ...
##  $ week     : int  0 0 0 0 0 0 0 0 0 0 ...
pairs(bprs_l, col = bprs_l$subject)

Plotting the bprs_l

ggplot(bprs_l, aes(x = week, y = BPRS, shape = subject, group = treatment)) +
  geom_point(color = "darkblue") + 
  scale_shape_manual(values = rep(1:10, times = 2)) +
  scale_x_continuous(name = "Weeks", breaks = seq(0,8,1)) + 
  scale_y_continuous(name = "BPRS value") + 
  theme(legend.position = "bottom") + 
  ggtitle("Overview of BPRS data") +
  theme(legend.box.background = element_rect(),legend.box.margin = margin(2, 2, 2, 2))

ggplot(bprs_l, aes(x = week, y = BPRS, linetype = subject)) +
  geom_line(color = "darkblue") +
  scale_linetype_manual(values = rep(1:6, times = 4)) +
  facet_grid(. ~ treatment, labeller = label_both) +
  scale_x_continuous(name = "Weeks", breaks = seq(0,8,1)) + 
  scale_y_continuous(name = "BPRS values observed", breaks = seq(10,100,5)) + 
  theme(legend.position = "bottom") +
  ggtitle("Overview of BPRS data by treatment") +
  theme(legend.box.background = element_rect(),legend.box.margin = margin(2, 2, 2, 2))

ggplot(bprs_l, aes(x = week, y = BPRS)) +
  geom_line(aes(linetype = treatment), color = "darkblue") +
  scale_linetype_manual(values = rep(1:2, times = 1)) +
  facet_grid(.~ subject) +
  scale_x_continuous(name = "Weeks", breaks = seq(0,8,4)) + 
  scale_y_continuous(name = "BPRS values observed", breaks = seq(10,100,5)) + 
  theme(legend.position = "right") +
  ggtitle("Overview of BPRS data by subject") +
  theme(legend.box.background = element_rect(),legend.box.margin = margin(2, 2, 2, 2))

There were 40 subjects in 2 treatment groups (20 in each group). After the first BPRS determination (week 0), the treatment started and 8 more BPRS determinations followed.

Linear model 1 (BPRS ~ week + treatment)

In the first linear regression model, the BPRS value was dependent varibale and week + treatment were independent variables (explanatory variables).

bprs_l_lrm1 <- lm(BPRS ~ week + treatment, data = bprs_l)
summary(bprs_l_lrm1)
## 
## Call:
## lm(formula = BPRS ~ week + treatment, data = bprs_l)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.4539     1.3670  33.982   <2e-16 ***
## week         -2.2704     0.2524  -8.995   <2e-16 ***
## treatment2    0.5722     1.3034   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16

Weekly measurement had a significant effect on the BPRS values. With every new BRPS determination per week, the BRPS value decreases 2.2 units. Treatment 2 had a better effect on the BPRS values of the subjects but it was not significant.

To explore whether treatments helped patients and whether the subjects (patients) affected the model, the subjects were added as a random effect term to a random intercept model.

6.6 Random intercept model (BPRS ~ week + treatment + (1 | subject))

In the random intercept model, “treatment” and the measurement periods “week” were the explanatory variables. The subjects (the 40 men which undergoing the treatment and measurements) was the random-effects term. This model explored whether linear regression fitted for each subject to the other subjects.

bprs_l_rim <- lmer(BPRS ~ week + treatment + (1 | subject), data = bprs_l, REML = FALSE)
summary(bprs_l_rim)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: BPRS ~ week + treatment + (1 | subject)
##    Data: bprs_l
## 
##      AIC      BIC   logLik deviance df.resid 
##   2748.7   2768.1  -1369.4   2738.7      355 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0481 -0.6749 -0.1361  0.4813  3.4855 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept)  47.41    6.885  
##  Residual             104.21   10.208  
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     1.9090  24.334
## week         -2.2704     0.2084 -10.896
## treatment2    0.5722     1.0761   0.532
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.437       
## treatment2 -0.282  0.000

According to the summary, the standard error of the measurement periods ‘week’ in the random intercept model (0.2084) was smaller than the standard error in the linear regression model (0.2524). Similarly, the standard error of the treatment 2 in the random intercept model (1.0761) was smaller than that in the linear regression model (1.3034). They meant that the effect of subjects was taken into account in this model.

Generally, the mixed model takes into account the correlations between the observations (‘subjects’). So it can be more accurate than linear regression model. The p-value was calculated by inverse function of the t-values (inverting the t value to the p value).

6.7 Random Intercept and Random Slope Model (BPRS ~ week + treatment + (week | subject))

Fit a random intercept and random slope model to the BPRS values. It allowed the linear regression to fit for every patient (‘subject’) to differ in the intercept and in the slope. ‘Subject’ differences was taken into account of the BPRS values and the measurement period. In this model, “week” and “subject” were random-effect terms.

A total of 9 (0-8 weeks) different regression models were defined for every subject for every time period (every week).

bprs_l_rim1 <- lmer(BPRS ~ week + treatment + (week | subject), data = bprs_l, REML = FALSE)
summary(bprs_l_rim1)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: BPRS ~ week + treatment + (week | subject)
##    Data: bprs_l
## 
##      AIC      BIC   logLik deviance df.resid 
##   2745.4   2772.6  -1365.7   2731.4      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8919 -0.6194 -0.0691  0.5531  3.7977 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.8222  8.0512        
##           week         0.9609  0.9803   -0.51
##  Residual             97.4304  9.8707        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     2.1052  22.066
## week         -2.2704     0.2977  -7.626
## treatment2    0.5722     1.0405   0.550
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.582       
## treatment2 -0.247  0.000

The results of this model showed not much difference with the model above. ANOVA was conducted to find differences in these models.

anova(bprs_l_rim1, bprs_l_rim)
## Data: bprs_l
## Models:
## bprs_l_rim: BPRS ~ week + treatment + (1 | subject)
## bprs_l_rim1: BPRS ~ week + treatment + (week | subject)
##             Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)  
## bprs_l_rim   5 2748.7 2768.1 -1369.4   2738.7                           
## bprs_l_rim1  7 2745.4 2772.6 -1365.7   2731.4 7.2721      2    0.02636 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA showed that the random Intercept and random Slope model (taking week and subject into account) gave a better fit for the data (p-value of 0.02) with a significant outcome.

6.8 Random intercept and random slope model with interaction (BPRS ~ week * treatment + (week | subject))

In the following random intercept and random slope model, the interaction of treatment and time (weeks) were included to see whether the treatment time period interacted with the BRPS value outcome.

bprs_l_rim2 <- lmer(BPRS ~ week * treatment + (week | subject), data = bprs_l, REML = FALSE)
summary(bprs_l_rim2)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: BPRS ~ week * treatment + (week | subject)
##    Data: bprs_l
## 
##      AIC      BIC   logLik deviance df.resid 
##   2744.3   2775.4  -1364.1   2728.3      352 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0512 -0.6271 -0.0768  0.5288  3.9260 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.9964  8.0620        
##           week         0.9687  0.9842   -0.51
##  Residual             96.4707  9.8220        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##                 Estimate Std. Error t value
## (Intercept)      47.8856     2.2521  21.262
## week             -2.6283     0.3589  -7.323
## treatment2       -2.2911     1.9090  -1.200
## week:treatment2   0.7158     0.4010   1.785
## 
## Correlation of Fixed Effects:
##             (Intr) week   trtmn2
## week        -0.650              
## treatment2  -0.424  0.469       
## wek:trtmnt2  0.356 -0.559 -0.840
anova(bprs_l_rim2, bprs_l_rim1)
## Data: bprs_l
## Models:
## bprs_l_rim1: BPRS ~ week + treatment + (week | subject)
## bprs_l_rim2: BPRS ~ week * treatment + (week | subject)
##             Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)  
## bprs_l_rim1  7 2745.4 2772.6 -1365.7   2731.4                           
## bprs_l_rim2  8 2744.3 2775.4 -1364.1   2728.3 3.1712      1    0.07495 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA shows that the random intercept and random slope model with interaction ‘time * treatment’ fitted better on the data (p-value = 0.075 < 0.1). So there was an interaction of treatments with the treament time.

Check the BPRS data again on a graph

ggplot(bprs_l, aes(x = week, y = BPRS, linetype = subject)) +
  geom_line(color = "darkblue") +
  scale_linetype_manual(values = rep(1:6, times = 4)) +
  facet_grid(. ~ treatment, labeller = label_both) +
  scale_x_continuous(name = "Weeks", breaks = seq(0,8,1)) + 
  scale_y_continuous(name = "BPRS values observed", breaks = seq(10,100,5)) + 
  theme(legend.position = "bottom") + 
  ggtitle("BPRS: original values by treatment") +
  theme(legend.box.background = element_rect(),legend.box.margin = margin(2, 2, 2, 2))

Add fitted BPRS values created by the random intercept and random slope model with interaction ‘time * treatment’ to the dataset

bprs_l_fit <- fitted(bprs_l_rim2)
bprs_l$fitted_bprs <- round(bprs_l_fit, digits = 2)
bprs_l %>% knitr::kable() %>% kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "center") %>% scroll_box(height = "300px")
treatment subject weeks BPRS week fitted_bprs
1 1 week0 42 0 49.24
1 2 week0 58 0 46.97
1 3 week0 54 0 47.66
1 4 week0 55 0 49.85
1 5 week0 72 0 66.39
1 6 week0 48 0 42.59
1 7 week0 71 0 54.58
1 8 week0 30 0 47.93
1 9 week0 41 0 42.08
1 10 week0 57 0 53.35
1 11 week0 30 0 59.97
1 12 week0 55 0 48.42
1 13 week0 36 0 50.36
1 14 week0 38 0 40.93
1 15 week0 66 0 55.07
1 16 week0 41 0 44.43
1 17 week0 45 0 43.93
1 18 week0 39 0 36.82
1 19 week0 24 0 38.14
1 20 week0 38 0 39.01
2 1 week0 52 0 46.95
2 2 week0 30 0 44.68
2 3 week0 65 0 45.36
2 4 week0 37 0 47.56
2 5 week0 59 0 64.10
2 6 week0 30 0 40.30
2 7 week0 69 0 52.29
2 8 week0 62 0 45.64
2 9 week0 38 0 39.79
2 10 week0 65 0 51.05
2 11 week0 78 0 57.68
2 12 week0 38 0 46.13
2 13 week0 63 0 48.07
2 14 week0 40 0 38.64
2 15 week0 40 0 52.78
2 16 week0 54 0 42.14
2 17 week0 33 0 41.63
2 18 week0 28 0 34.53
2 19 week0 52 0 35.85
2 20 week0 47 0 36.71
1 1 week1 36 1 47.97
1 2 week1 68 1 43.72
1 3 week1 55 1 44.47
1 4 week1 77 1 47.41
1 5 week1 75 1 62.11
1 6 week1 43 1 40.04
1 7 week1 61 1 51.11
1 8 week1 36 1 46.36
1 9 week1 43 1 39.11
1 10 week1 51 1 50.79
1 11 week1 34 1 58.01
1 12 week1 52 1 45.55
1 13 week1 32 1 47.26
1 14 week1 35 1 38.73
1 15 week1 68 1 51.69
1 16 week1 35 1 41.50
1 17 week1 38 1 42.30
1 18 week1 35 1 35.03
1 19 week1 28 1 35.66
1 20 week1 34 1 36.35
2 1 week1 73 1 46.39
2 2 week1 23 1 42.14
2 3 week1 31 1 42.89
2 4 week1 31 1 45.83
2 5 week1 67 1 60.53
2 6 week1 33 1 38.47
2 7 week1 52 1 49.53
2 8 week1 54 1 44.79
2 9 week1 40 1 37.53
2 10 week1 44 1 49.21
2 11 week1 95 1 56.44
2 12 week1 41 1 43.97
2 13 week1 65 1 45.68
2 14 week1 37 1 37.16
2 15 week1 36 1 50.11
2 16 week1 45 1 39.92
2 17 week1 41 1 40.72
2 18 week1 30 1 33.45
2 19 week1 43 1 34.09
2 20 week1 36 1 34.78
1 1 week2 36 2 46.69
1 2 week2 61 2 40.46
1 3 week2 41 2 41.28
1 4 week2 49 2 44.96
1 5 week2 72 2 57.82
1 6 week2 41 2 37.49
1 7 week2 47 2 47.64
1 8 week2 38 2 44.79
1 9 week2 39 2 36.14
1 10 week2 51 2 48.23
1 11 week2 34 2 56.06
1 12 week2 49 2 42.68
1 13 week2 36 2 44.15
1 14 week2 36 2 36.53
1 15 week2 65 2 48.30
1 16 week2 45 2 38.56
1 17 week2 46 2 40.67
1 18 week2 27 2 33.24
1 19 week2 31 2 33.19
1 20 week2 27 2 33.70
2 1 week2 42 2 45.83
2 2 week2 32 2 39.60
2 3 week2 33 2 40.42
2 4 week2 27 2 44.10
2 5 week2 58 2 56.96
2 6 week2 37 2 36.63
2 7 week2 41 2 46.78
2 8 week2 49 2 43.93
2 9 week2 38 2 35.28
2 10 week2 31 2 47.37
2 11 week2 75 2 55.20
2 12 week2 36 2 41.82
2 13 week2 60 2 43.29
2 14 week2 31 2 35.67
2 15 week2 55 2 47.44
2 16 week2 35 2 37.70
2 17 week2 30 2 39.81
2 18 week2 29 2 32.38
2 19 week2 26 2 32.33
2 20 week2 32 2 32.84
1 1 week3 43 3 45.42
1 2 week3 55 3 37.20
1 3 week3 38 3 38.08
1 4 week3 54 3 42.52
1 5 week3 65 3 53.54
1 6 week3 38 3 34.94
1 7 week3 30 3 44.17
1 8 week3 38 3 43.22
1 9 week3 35 3 33.17
1 10 week3 55 3 45.68
1 11 week3 41 3 54.11
1 12 week3 54 3 39.81
1 13 week3 31 3 41.04
1 14 week3 34 3 34.33
1 15 week3 49 3 44.92
1 16 week3 42 3 35.63
1 17 week3 38 3 39.04
1 18 week3 25 3 31.46
1 19 week3 28 3 30.71
1 20 week3 25 3 31.04
2 1 week3 41 3 45.27
2 2 week3 24 3 37.06
2 3 week3 28 3 37.94
2 4 week3 31 3 42.37
2 5 week3 61 3 53.39
2 6 week3 33 3 34.79
2 7 week3 33 3 44.02
2 8 week3 39 3 43.08
2 9 week3 27 3 33.02
2 10 week3 34 3 45.53
2 11 week3 76 3 53.96
2 12 week3 27 3 39.66
2 13 week3 53 3 40.90
2 14 week3 38 3 34.18
2 15 week3 55 3 44.78
2 16 week3 27 3 35.48
2 17 week3 32 3 38.89
2 18 week3 33 3 31.31
2 19 week3 27 3 30.57
2 20 week3 29 3 30.90
1 1 week4 41 4 44.14
1 2 week4 43 4 33.95
1 3 week4 43 4 34.89
1 4 week4 56 4 40.07
1 5 week4 50 4 49.25
1 6 week4 36 4 32.38
1 7 week4 27 4 40.69
1 8 week4 31 4 41.65
1 9 week4 28 4 30.19
1 10 week4 53 4 43.12
1 11 week4 36 4 52.15
1 12 week4 48 4 36.94
1 13 week4 25 4 37.94
1 14 week4 25 4 32.13
1 15 week4 36 4 41.54
1 16 week4 31 4 32.69
1 17 week4 40 4 37.41
1 18 week4 29 4 29.67
1 19 week4 29 4 28.23
1 20 week4 25 4 28.39
2 1 week4 39 4 44.72
2 2 week4 20 4 34.52
2 3 week4 22 4 35.47
2 4 week4 31 4 40.65
2 5 week4 49 4 49.83
2 6 week4 28 4 32.95
2 7 week4 34 4 41.27
2 8 week4 55 4 42.23
2 9 week4 31 4 30.77
2 10 week4 39 4 43.70
2 11 week4 66 4 52.72
2 12 week4 29 4 37.51
2 13 week4 52 4 38.51
2 14 week4 35 4 32.70
2 15 week4 42 4 42.11
2 16 week4 25 4 33.26
2 17 week4 46 4 37.98
2 18 week4 30 4 30.24
2 19 week4 24 4 28.81
2 20 week4 25 4 28.96
1 1 week5 40 5 42.87
1 2 week5 34 5 30.69
1 3 week5 28 5 31.70
1 4 week5 50 5 37.63
1 5 week5 39 5 44.97
1 6 week5 29 5 29.83
1 7 week5 40 5 37.22
1 8 week5 26 5 40.08
1 9 week5 22 5 27.22
1 10 week5 43 5 40.57
1 11 week5 36 5 50.20
1 12 week5 43 5 34.06
1 13 week5 25 5 34.83
1 14 week5 27 5 29.92
1 15 week5 32 5 38.16
1 16 week5 31 5 29.76
1 17 week5 33 5 35.78
1 18 week5 28 5 27.88
1 19 week5 21 5 25.76
1 20 week5 27 5 25.73
2 1 week5 38 5 44.16
2 2 week5 20 5 31.98
2 3 week5 25 5 32.99
2 4 week5 26 5 38.92
2 5 week5 38 5 46.26
2 6 week5 26 5 31.12
2 7 week5 37 5 38.51
2 8 week5 51 5 41.37
2 9 week5 24 5 28.51
2 10 week5 34 5 41.86
2 11 week5 64 5 51.49
2 12 week5 27 5 35.35
2 13 week5 32 5 36.12
2 14 week5 30 5 31.21
2 15 week5 30 5 39.45
2 16 week5 22 5 31.04
2 17 week5 43 5 37.07
2 18 week5 26 5 29.17
2 19 week5 32 5 27.05
2 20 week5 23 5 27.02
1 1 week6 38 6 41.60
1 2 week6 28 6 27.43
1 3 week6 29 6 28.51
1 4 week6 47 6 35.18
1 5 week6 32 6 40.69
1 6 week6 33 6 27.28
1 7 week6 30 6 33.75
1 8 week6 26 6 38.51
1 9 week6 20 6 24.25
1 10 week6 43 6 38.01
1 11 week6 38 6 48.24
1 12 week6 37 6 31.19
1 13 week6 21 6 31.73
1 14 week6 25 6 27.72
1 15 week6 27 6 34.78
1 16 week6 29 6 26.82
1 17 week6 27 6 34.15
1 18 week6 21 6 26.09
1 19 week6 22 6 23.28
1 20 week6 21 6 23.08
2 1 week6 43 6 43.60
2 2 week6 19 6 29.44
2 3 week6 24 6 30.52
2 4 week6 24 6 37.19
2 5 week6 37 6 42.69
2 6 week6 27 6 29.28
2 7 week6 37 6 35.76
2 8 week6 55 6 40.52
2 9 week6 22 6 26.25
2 10 week6 41 6 40.02
2 11 week6 64 6 50.25
2 12 week6 21 6 33.20
2 13 week6 37 6 33.73
2 14 week6 33 6 29.73
2 15 week6 26 6 36.78
2 16 week6 22 6 28.82
2 17 week6 43 6 36.15
2 18 week6 36 6 28.10
2 19 week6 21 6 25.29
2 20 week6 23 6 25.08
1 1 week7 47 7 40.32
1 2 week7 28 7 24.18
1 3 week7 25 7 25.32
1 4 week7 42 7 32.74
1 5 week7 38 7 36.40
1 6 week7 27 7 24.72
1 7 week7 31 7 30.28
1 8 week7 25 7 36.94
1 9 week7 23 7 21.28
1 10 week7 39 7 35.46
1 11 week7 36 7 46.29
1 12 week7 36 7 28.32
1 13 week7 19 7 28.62
1 14 week7 26 7 25.52
1 15 week7 30 7 31.40
1 16 week7 26 7 23.88
1 17 week7 31 7 32.52
1 18 week7 25 7 24.31
1 19 week7 23 7 20.81
1 20 week7 19 7 20.43
2 1 week7 62 7 43.04
2 2 week7 18 7 26.90
2 3 week7 31 7 28.04
2 4 week7 26 7 35.46
2 5 week7 36 7 39.12
2 6 week7 23 7 27.44
2 7 week7 38 7 33.00
2 8 week7 59 7 39.66
2 9 week7 21 7 24.00
2 10 week7 42 7 38.18
2 11 week7 60 7 49.01
2 12 week7 22 7 31.04
2 13 week7 52 7 31.34
2 14 week7 30 7 28.24
2 15 week7 30 7 34.12
2 16 week7 22 7 26.60
2 17 week7 43 7 35.24
2 18 week7 33 7 27.03
2 19 week7 21 7 23.53
2 20 week7 23 7 23.15
1 1 week8 51 8 39.05
1 2 week8 28 8 20.92
1 3 week8 24 8 22.13
1 4 week8 46 8 30.29
1 5 week8 32 8 32.12
1 6 week8 25 8 22.17
1 7 week8 31 8 26.81
1 8 week8 24 8 35.37
1 9 week8 21 8 18.31
1 10 week8 32 8 32.90
1 11 week8 36 8 44.33
1 12 week8 31 8 25.45
1 13 week8 22 8 25.52
1 14 week8 26 8 23.32
1 15 week8 37 8 28.02
1 16 week8 30 8 20.95
1 17 week8 27 8 30.89
1 18 week8 20 8 22.52
1 19 week8 22 8 18.33
1 20 week8 21 8 17.77
2 1 week8 50 8 42.48
2 2 week8 20 8 24.35
2 3 week8 32 8 25.57
2 4 week8 23 8 33.73
2 5 week8 35 8 35.55
2 6 week8 21 8 25.61
2 7 week8 35 8 30.25
2 8 week8 66 8 38.81
2 9 week8 21 8 21.74
2 10 week8 39 8 36.34
2 11 week8 75 8 47.77
2 12 week8 23 8 28.89
2 13 week8 28 8 28.95
2 14 week8 27 8 26.75
2 15 week8 37 8 31.45
2 16 week8 22 8 24.38
2 17 week8 43 8 34.33
2 18 week8 30 8 25.96
2 19 week8 21 8 21.77
2 20 week8 23 8 21.21

Draw the plot with the fitted BPRS values

ggplot(bprs_l, aes(x = week, y = fitted_bprs, linetype = subject)) +
  geom_line(color = "darkblue") +
  scale_linetype_manual(values = rep(1:6, times = 4)) +
  facet_grid(. ~ treatment, labeller = label_both) +
  scale_x_continuous(name = "Weeks", breaks = seq(0,8,1)) + 
  scale_y_continuous(name = "BPRS values modeled", breaks = seq(10,100,5)) + 
  theme(legend.position = "bottom") + 
  ggtitle("BPRS: fitted values by treatment") +
  theme(legend.box.background = element_rect(),legend.box.margin = margin(2, 2, 2, 2))


Chapter 7 Final note: See you later, R

(10.12.2019)

I will do those analyses for my research to get better understanding of results as well as better tables and figures.

Very tired, very necessary.

:)