Chapter 1 Start up

Introduction ~~Warmup~~

It is a good beginning. The task in DataCamp is friendly to beginners.

Clustering and classification, and dimensionality reduction techniques are the two parts I am most interested in.

I heard this course via email sent to HYMY students.

The link to my github repository:

https://github.com/yufanyin/IODS-project

The link to my course diary:

https://yufanyin.github.io/IODS-project/

work through the DataCamp:

DataCamp 1 Hello R

DataCamp 2 Getting Hooked on R

Chapter 2 Regression and model validation

(5.11.-11.11.2019)

Work through the DataCamp

DataCamp 3 Regression and model validation
Data wrangling

R script of data Wrangling exercise is available here.

https://github.com/yufanyin/IODS-project/blob/master/create_learning2014.R

The result was saved as ‘learning2014.csv’.

Note: ‘(Step N)’ is used for reviewers to check my work queicky.

2.1 Reading the data

Read the data file (Step 1a)

learning2014 <- 
read.table(file = "G:/C-Open Data Science/0-191030/IODS-project-master/learning2014.txt", stringsAsFactors = TRUE) 
str(learning2014)

## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

Describe the dataset briefly (Step 1b)

The data ‘learning2014’ consists of 166 observations and 7 variables. It contains students’ approaches to learning (deep, stra, surf), scores of attitude, points and some background information (gender, age).

2.2 Visualizations with ggplot2

(Step 2)

Correlations of the observations between the different variables are shown below with ggpairs() function.

library(ggplot2)
library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

p1 <- ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)), upper = list(continuous = wrap("cor", size = 2.8)))
p1

http://127.0.0.1:42872/chunk_output/DC8623CAB88836EA/A5B1EE4E/cd2f6s6tn8gxk/000002.png

The plot shows the data distribution of all the 166 observations of each variable. The colours represent different gender. Female students are shown in red and male students in verdant. In the upper right part presents how one variable correlates to another varibable. We are able to find possible significant regression result.

2.3 Simple regression with lm()

(Step 3)

Simple regression: point ~ gender Firstly we choose gender as explanatory variables and fit a regression model in which exam points is the dependent variable. Result: not significant.

gender_lm <- lm(points ~ gender, data = learning2014)
summary(gender_lm)

## 
## Call:
## lm(formula = points ~ gender, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.3273  -3.3273   0.5179   4.5179  10.6727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  22.3273     0.5613  39.776   <2e-16 ***
## genderM       1.1549     0.9664   1.195    0.234    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.887 on 164 degrees of freedom
## Multiple R-squared:  0.008632,   Adjusted R-squared:  0.002587 
## F-statistic: 1.428 on 1 and 164 DF,  p-value: 0.2338

Simple regression: point ~ attitude

gender_lm <- lm(points ~ attitude, data = learning2014)
summary(gender_lm)

## 
## Call:
## lm(formula = points ~ attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

There is a statistical relationship between point and attitude (p:4.12e-09). Then draw a scatter plot of points versus attitude, fit a linear model and print out the summary.

qplot(attitude, points, data = learning2014) + geom_smooth(method = "lm")

http://127.0.0.1:42872/chunk_output/DC8623CAB88836EA/A5B1EE4E/c5p8rg484lt8q/000002.png

model_point_attitude <- lm(points ~ attitude, data = learning2014)
model_point_attitude

## 
## Call:
## lm(formula = points ~ attitude, data = learning2014)
## 
## Coefficients:
## (Intercept)     attitude  
##      11.637        3.525

2.4 Multiple regression

(Step 4)

Multiple regression model: point ~ attitude + stra This multiple regression is to test whether strategic learning has an influence on points. The p-value (0.08) shows that the significance of the influence is not very strong. The multiple R-squared is a bit higher than that in regression model of points ~ attitude. That means a little higher correlations if strategic learning is taken into account.

model_p_att_stra <- lm(points ~ attitude + stra, data = learning2014)
model_p_att_stra_s <- summary(model_p_att_stra)
model_p_att_stra_s

## 
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.9729     2.3959   3.745  0.00025 ***
## attitude      3.4658     0.5652   6.132 6.31e-09 ***
## stra          0.9137     0.5345   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09

2.5 Diagnostic plots

(Step 5)

The results of diagnostic plots are in accordance with the regression model.

The residuals is close to the 0 line, which means that the relationship is linear. Not many outliers (35, 56,145) are significantly disturbing the analysis but they raise errors in the dataset.

According to Nomal Q-Q plot, most of the values fall on the straight line. The outliers cause randomness in the dataset.

plot(model_p_att_stra, which=c(1,2,5))

http://127.0.0.1:42872/chunk_output/DC8623CAB88836EA/A5B1EE4E/c4jdrj7wils0g/000004.png?fixed_size=1

http://127.0.0.1:42872/chunk_output/DC8623CAB88836EA/A5B1EE4E/c4jdrj7wils0g/000005.png?fixed_size=1

http://127.0.0.1:42872/chunk_output/DC8623CAB88836EA/A5B1EE4E/c4jdrj7wils0g/000006.png

Chapter 3 Logistic regression

(12.11.-18.11.2019)

Work through the DataCamp

DataCamp 4 Logistic regression
Data wrangling

R script of data Wrangling exercise is available here.

https://github.com/yufanyin/IODS-project/blob/master/create_alc.R

The result was saved as ‘alc.csv’.

Note: ‘(Step N)’ is used for reviewers to check my work queicky.

3.1 Reading the data

Read the data file (Step 2a)

getwd()

## [1] "G:/C-Open Data Science/0-191030/IODS-project-master"

setwd("G:/C-Open Data Science/0-191030/IODS-project-master")
alc <- read.csv(file = "G:/C-Open Data Science/0-191030/IODS-project-master/alc.csv")

Print out the names of the variables in the data (Step 2b)

dim(alc)

## [1] 382  36

colnames(alc)

##  [1] "X"          "school"     "sex"        "age"        "address"   
##  [6] "famsize"    "Pstatus"    "Medu"       "Fedu"       "Mjob"      
## [11] "Fjob"       "reason"     "nursery"    "internet"   "guardian"  
## [16] "traveltime" "studytime"  "failures"   "schoolsup"  "famsup"    
## [21] "paid"       "activities" "higher"     "romantic"   "famrel"    
## [26] "freetime"   "goout"      "Dalc"       "Walc"       "health"    
## [31] "absences"   "G1"         "G2"         "G3"         "alc_use"   
## [36] "high_use"

str(alc)

## 'data.frame':    382 obs. of  36 variables:
##  $ X         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 3 0 0 0 0 0 0 0 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
##  $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  6 4 10 2 4 10 0 6 0 0 ...
##  $ G1        : int  5 5 7 15 6 15 12 6 16 14 ...
##  $ G2        : int  6 5 8 14 10 15 12 5 18 15 ...
##  $ G3        : int  6 6 10 15 10 15 11 6 19 15 ...
##  $ alc_use   : num  1 1 2.5 1 1.5 1.5 1 1 1 1 ...
##  $ high_use  : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...

Describe the data set briefly (Step 2c)

The data consists of 382 observations in 35 variables.

Note: It shows 36 variables. Why the first colname is “X” (int 1 2 3 4 5 6 7 8 9 10 …)?

The data were collected by using school reports and questionnaires from students in two Portugal secondary schools. They included students’ grades, demographic characteristics, social and school related features.

The two data sets provided students’ performance in Mathematics (mat) and Portuguese language (por). After data wrangling part, they were merged into one data set “alc”.

To analyse alcohol consumption, two new variables were created. “alc_use” is the average of weekday (“Dalc”) and weekend (“Walc”) alcohol consumption. “high_use” is the logical value created if the “alc_use” of a student is higher than 2 (TRUE).

Data source: https://archive.ics.uci.edu/ml/datasets/Student+Performance

3.2 Hypotheses about the relationships between alcohol consumption and specific variables

Choose variables and write hypotheses (Step 3)

The purpose of the following analysis is to explore the relationships between high/low alcohol consumption and some variables. I chose the following four variables:

‘sex’: student’s sex (binary: ‘F’ - female or ‘M’ - male)

‘failures’: number of past class failures (numeric: n if 1<=n<3, else 4)

‘goout’: going out with friends (numeric: from 1 - very low to 5 - very high)

‘absences’: number of school absences (numeric: from 0 to 93)

The hypotheses are as follows.

Hypothesis 1: There is gender difference (‘sex’) in alcohol consumption. In social science, gender affect many behaviors and preferences and becomes a variable being tested often.

Hypothesis 2: The number of class failures (‘failures’) associates with alcohol consumption. Students with more class failures have higher alcohol consumption.

Hypothesis 3: The frequency of going out with friends (‘goout’) associates with alcohol consumption. Students going out with friends more often have higher alcohol consumption.

Hypothesis 4: The number of school absences (‘absences’) associates with alcohol consumption. Students having more times of school absences have higher alcohol consumption.

3.3 Exploring the relationships numerically and graphically

(Step 4)

library(tidyr)
library(dplyr)
library(ggplot2)
library(knitr)
library(kableExtra)

3.3.1 Alcohol consumption and gender

alc %>% group_by(sex) %>% summarise(count = n()) -> fm
knitr::kable(fm, caption="Students' gender distribution") %>% kable_styling(bootstrap_options = "hover", full_width = FALSE, position = "center")

Students’ gender distribution
sex	count
F	198
M	184

There were 198 female students and 184 male students.

ggplot(data = alc, aes(x = alc_use, fill = sex)) + geom_bar() + xlab("Alcohol consumption (1: lowest / 5: highest)") + ylab("Numeber of students") + ggtitle("Alcohol consumption and gender") + scale_fill_discrete("Gender", labels = c("Female", "Male"))

Te figure presents the distribution of alcohol consumption taking gender into account. 1 means the lowest consumption, 5 is the highest consumption.

ggplot(data = alc, aes(x = alc_use, fill = high_use)) + facet_wrap("sex") + geom_bar() + xlab("Student's alcohol consumption") + ggtitle("High alcohol consumption by gender") + ylab("Numeber of students") + scale_fill_discrete("High alcohol use")

This graph shows the number of male and female students with higher alcohol consumption. High alcohol consumption was in red and lower alcohol consumption was in red blue. Male students with high alcohol consumption more than female students with high consumption.

3.3.2 Alcohol consumption and number of class failures

ggplot(data = alc, aes(x = failures,  fill = sex)) + geom_bar() + facet_wrap("high_use") + ggtitle("Alcohol consumption and number of class failures") + xlab("Number of class failures") + ylab("Numeber of students") + scale_fill_discrete("Gender", labels = c("Female", "Male"))

3.3.3 Alcohol consumption and frequency of going out with friends

ggplot(data = alc, aes(x = goout,  fill = sex)) + geom_bar() + facet_wrap("high_use") + ggtitle("Alcohol consumption and frequency of going out with friends") + xlab("Number of going out with friends (1: very low / 5: very high)") + ylab("Numeber of students") + scale_fill_discrete("Gender", labels = c("Female", "Male"))

3.3.4 Alcohol consumption and number of school absences

ggplot(data = alc, aes(x = absences,  fill = sex)) + geom_bar() + facet_wrap("high_use") + ggtitle("Alcohol consumption and number of school absences") + xlab("Number of school absences (from 0 to 93)") + ylab("Numeber of students") + scale_fill_discrete("Student's gender", labels = c("Female", "Male"))

3.4 Exploring the relationship using logistic regression

3.4.1 The fitted model

(Step 5a)

‘sex’, ‘failures’,‘goout’ and ‘absences’

lr_glm <- glm(high_use ~ sex + failures + goout + absences, data = alc, family = "binomial")
lr_glm_sum <- summary(lr_glm)

3.4.2 The coefficients of the model

(Step 5b)

lr_glm_sum

## 
## Call:
## glm(formula = high_use ~ sex + failures + goout + absences, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8046  -0.7661  -0.5196   0.7383   2.3862  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.21840    0.47865  -8.813  < 2e-16 ***
## sexM         0.97670    0.25898   3.771 0.000162 ***
## failures     0.28657    0.17170   1.669 0.095111 .  
## goout        0.71559    0.12089   5.919 3.24e-09 ***
## absences     0.06550    0.01685   3.888 0.000101 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 378.67  on 377  degrees of freedom
## AIC: 388.67
## 
## Number of Fisher Scoring iterations: 4

knitr::kable(lr_glm_sum$coefficients, digits = 3, align = "l", justify = "left", caption="Model coefficients of choosen variables") %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "center")

Model coefficients of choosen variables
	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	-4.218	0.479	-8.813	0.000
sexM	0.977	0.259	3.771	0.000
failures	0.287	0.172	1.669	0.095
goout	0.716	0.121	5.919	0.000
absences	0.066	0.017	3.888	0.000

The output shows

Being a male student instead of a female student significantly (p < 0.001) increased the probability of high alcohol consumption.
Class failure had a positive effect on high consumption with a smaller probability (p = 0.095).
The frenquency of going out with friends had a significant positive effect on high alcohol consumption (p < 0.001).
School absences had a significant positive effect on high alcohol consumption (< 0.001).

3.4.3 The odds ratios and confidence intervals

(Step 5c)

Odds ratios

or_lr <- coef(lr_glm) %>% exp
or_lr

## (Intercept)        sexM    failures       goout    absences 
##  0.01472214  2.65567230  1.33184898  2.04538456  1.06769512

Confidence intervals

ci_lr <- confint(lr_glm) %>% exp

## Waiting for profiling to be done...

ci_lr

##                   2.5 %     97.5 %
## (Intercept) 0.005520616 0.03620626
## sexM        1.608518976 4.45004345
## failures    0.949933462 1.86808335
## goout       1.623552039 2.61096941
## absences    1.035225148 1.10698949

Print out the odds ratios with their confidence intervals

od_conf <- cbind(or_lr, ci_lr) 
colnames(od_conf) <- c("odds ratios","2.5 %", "97.5 %")
knitr::kable(od_conf, digits = 3, align = "l",  caption="Odds ratios and confidental intervals") %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "center")

Odds ratios and confidental intervals
	odds ratios	2.5 %	97.5 %
(Intercept)	0.015	0.006	0.036
sexM	2.656	1.609	4.450
failures	1.332	0.950	1.868
goout	2.045	1.624	2.611
absences	1.068	1.035	1.107

‘sex’ has a value of 2.656, the confidence interval is between 1.609 and 4.450. So this oods ratio tells us that the probability of high alcohol consumption for male students was a 2.7 times higher than female students.
‘failures’ has a value of 1.332, the confidence interval is between 0.950 and 1.868. Class failures increased the risk of high alcohol consumption.
‘goout’ has a value of 2.045, the confidence interval is between 1.624 and 2.611. Going out with friends significantly increased the risk of high alcohol consumption.
‘absences’ has a value of 1.068, the confidence interval is between 1.035 and 1.107. School absences increased the probability of higher alcohol consumption.

3.5 The predictive power of you model

(Step 6)

probs <- predict(lr_glm, type = "response")
alc <- mutate(alc, prob = probs)
alc <- mutate(alc, pred = prob > 0.5)
select(alc, age, failures, goout, absences, high_use, prob, pred) %>% tail(20)

##     age failures goout absences high_use       prob  pred
## 363  17        0     3        8    FALSE 0.17542917 FALSE
## 364  17        0     5       14    FALSE 0.56870202  TRUE
## 365  18        0     4        0    FALSE 0.20488145 FALSE
## 366  18        0     2        2    FALSE 0.06560625 FALSE
## 367  18        0     3        4     TRUE 0.14068205 FALSE
## 368  18        0     3        0    FALSE 0.11188342 FALSE
## 369  17        0     4       17     TRUE 0.43966501 FALSE
## 370  18        0     4        4     TRUE 0.47069539 FALSE
## 371  18        0     3        5    FALSE 0.31703390 FALSE
## 372  17        0     3        2    FALSE 0.27608939 FALSE
## 373  19        1     2        0    FALSE 0.17887835 FALSE
## 374  18        1     3       14     TRUE 0.52713626  TRUE
## 375  18        0     3        2    FALSE 0.12557744 FALSE
## 376  18        0     3        7    FALSE 0.16615452 FALSE
## 377  19        1     2        0    FALSE 0.07581170 FALSE
## 378  18        0     4        0    FALSE 0.20488145 FALSE
## 379  18        1     1        0    FALSE 0.03855880 FALSE
## 380  18        1     1        0    FALSE 0.03855880 FALSE
## 381  17        0     5        3     TRUE 0.63011977  TRUE
## 382  18        0     1        0     TRUE 0.07404728 FALSE

pred22 <- table(high_use = alc$high_use, pred = alc$pred)
knitr::kable(pred22, align = "l",  caption="2 x 2 cross tabulation" ) %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "center") %>% add_header_above(c(" " = 1, "Pred" = 2)) %>%  pack_rows("High use", 1, 2)

2 x 2 cross tabulation
	Pred
	FALSE	TRUE
High use
FALSE	251	19
TRUE	62	50

The model reported 251 correct false results and 62 false negative results. Among the 62 cases, students had high alcohol consumption but the model predicted not. Among the 19 false positive prediction cases, the model predicted high alcohol consumption but it was not in fact.

loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}
loss_func(class = alc$high_use, prob = alc$prob)

## [1] 0.2120419

It showed that 21.2% of all predictions was incorrect.

3.6 The 10-fold cross-validation on the model

(Step 7)

library(boot)
cross_val<- cv.glm(data = alc, cost = loss_func, glmfit = lr_glm, K = 10)
cross_val$delta[1]

## [1] 0.2172775

3.7 Cross-validation to compare the performance of different logistic regression models

(Step 8)

Chapter 4 Clustering and classification

(19.11.-25.11.2019)

Work through the DataCamp

DataCamp 5 Clustering and classification
Data source

Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81–102.

Belsley D.A., Kuh, E. and Welsch, R.E. (1980) Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley.

Note: ‘(Step N)’ is used for reviewers to check my work queicky.

4.1 Reading the data

Load the Boston data from the MASS package (Step 2a)

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

data("Boston")

Print out the names of the variables in the data (Step 2b)

dim(Boston)

## [1] 506  14

colnames(Boston)

##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"

str(Boston)

## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

Describe the dataset briefly (Step 2c)

The data consist of 506 observations in 14 variables. Information about housing (proportion of residential land, proportion of non-retail business, rooms per dwelling, proportion of old buildings, accessibility to employment centres and highways), population (crime rate, tax, pupil-teacher ratio, demographic structure), environment (air quality, river) in Boston suburbs were reported.

This data frame contains the following columns:

“crim”: per capita crime rate by town.
“zn”: proportion of residential land zoned for lots over 25,000 sq.ft.
“indus”: proportion of non-retail business acres per town.
“chas”: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
“nox”: nitrogen oxides concentration (parts per 10 million).
“rm”: average number of rooms per dwelling.
“age”: proportion of owner-occupied units built prior to 1940.
“dis”: weighted mean of distances to five Boston employment centres.
“rad”: index of accessibility to radial highways.
“tax”: full-value property-tax rate per $10,000.
“ptratio”: pupil-teacher ratio by town.
“black”: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
“lstat”: lower status of the population (percent).
“medv”: median value of owner-occupied homes in $1000s.

4.2 Exploring the data numerically and graphically

4.2.1 Graphical overview of the data

(Step 3a)

library(tidyr)
library(dplyr)
library(corrplot)
library(ggplot2)
library(GGally)
ov_boston <- ggpairs(Boston, mapping = aes(), title ="Overview of Boston data", lower = list(combo = wrap("facethist", bins = 20)), upper = list(continuous = wrap("cor", size = 2.8)))
ov_boston

4.2.2 Summaries of the variables

(Step 3b)

summary(Boston)

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

4.2.3 Relationships between the variables

(Step 3c)

Calculate and print the correlation matrix

cor_matrix<-cor(Boston) %>% round(digits = 2)
cor_matrix

##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00

Specialized according to the significant level and visualize the correlation matrix p.mat <- cor.mtest(cor_matrix)$p

library(corrplot)
p.mat <- cor.mtest(cor_matrix)$p
corrplot(cor_matrix, method="circle", type="upper",  tl.cex = 0.6, p.mat = p.mat, sig.level = 0.01, title="Correlations of Boston data", mar=c(0,0,1,0))

The crime rate strongly positively correlated with accessibility to highways, the property-tax rate and with the lower status.

The nitrogen oxides concentration significantly correlated with The weighted mean of distances to five employment significantly positively correlated with proportion of big residential land zoned and negatively correlated with proportion of non-retail business acres, proportion of old owner-occupied units and nitrogen oxides concentration.

4.3 Standardizing the dataset

4.3.1 Standardizing and printing out summaries

(Step 4a)

Values were standardized after data scaling so that the observations of the different variables can be compared with each other. Positive and negative values were created with an overall mean of 0 and a standard deviation of 1.

bos_scaled <- scale(Boston)
summary(bos_scaled)

##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

4.3.2 Creating a categorical variable of the crime rate

(Step 4b)

Create a quantile vector of crim

class(bos_scaled)

## [1] "matrix"

bos_scaled <- as.data.frame(bos_scaled)
summary(bos_scaled$crim)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110

bins <- quantile(bos_scaled$crim)
bins

##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610

Create a categorical variable ‘crime’ and look at the table

crime <- cut(bos_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
table(crime)

## crime
##      low  med_low med_high     high 
##      127      126      126      127

Remove original crim and add the new categorical value to scaled data

bos_scaled <- dplyr::select(bos_scaled, -crim)
bos_scaled <- data.frame(bos_scaled, crime)

We checked the scaled values of rate values and defined the quantiles. Then we named them as low (0-25%), medium low (25-50%), medium high(50-75%) and high (75-100%).Then the quantiles of crime rate quantiles ‘crime’ were added and replaced ‘crim’.

4.3.3 Dividing the dataset to train and test sets

(Step 4c)

Choose randomly 80% of the rows and create the train set and the test set

n_bos <- nrow(bos_scaled)
ind_bos <- sample(n_bos,  size = n_bos * 0.8)
boston_train <- bos_scaled[ind_bos,]
boston_test <- bos_scaled[-ind_bos,]

Save the correct classes from test data

correct_classes <- boston_test$crime
correct_classes

##   [1] low      low      med_low  med_high med_high med_high med_high
##   [8] med_high med_high low      low      med_low  med_low  low     
##  [15] low      med_low  med_low  low      low      med_low  low     
##  [22] low      low      med_low  med_low  med_low  med_low  med_low 
##  [29] med_low  med_high med_high med_high med_high med_high med_high
##  [36] med_high med_high med_high med_low  med_low  low      low     
##  [43] low      med_low  med_low  low      low      low      low     
##  [50] med_low  med_low  med_low  med_low  med_low  med_low  med_low 
##  [57] low      med_high med_high med_high med_high low      med_low 
##  [64] low      med_low  low      low      med_low  low      low     
##  [71] med_high med_high med_high low      low      low      low     
##  [78] low      low      high     high     high     high     high    
##  [85] high     high     high     high     high     high     high    
##  [92] high     high     high     high     high     high     high    
##  [99] high     high     high     high    
## Levels: low med_low med_high high

Remove the crime variable from test data set

boston_test <- dplyr::select(boston_test, -crime)

4.4 Linear discriminant analysis

(Step 5)

Fit linear discriminant analysis on the ‘boston_train’ data set using the categorical ‘crime’ as the target variable.

lda.fit <- lda(crime ~ ., data = boston_train)
lda.fit

## Call:
## lda(crime ~ ., data = boston_train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2376238 0.2475248 0.2574257 0.2574257 
## 
## Group means:
##                  zn      indus        chas        nox         rm
## low       0.8671860 -0.8485969 -0.19030615 -0.8630622  0.4101940
## med_low  -0.1172105 -0.2627239 -0.07547406 -0.5410452 -0.1847441
## med_high -0.3948894  0.1789197  0.21980846  0.3465785  0.1023238
## high     -0.4872402  1.0170690 -0.04518867  1.0263404 -0.3282094
##                 age        dis        rad        tax      ptratio
## low      -0.8339186  0.8406002 -0.6971471 -0.7348633 -0.452913751
## med_low  -0.4027126  0.3202213 -0.5500476 -0.4291467 -0.009946484
## med_high  0.3592201 -0.3511215 -0.3756131 -0.2909462 -0.249941864
## high      0.7905465 -0.8503519  1.6386213  1.5144083  0.781350740
##                black        lstat        medv
## low       0.36999537 -0.715425370  0.47076706
## med_low   0.31104125 -0.150826770 -0.03683837
## med_high  0.09915259  0.003071918  0.16883256
## high     -0.79657982  0.809836268 -0.66608744
## 
## Coefficients of linear discriminants:
##                  LD1         LD2          LD3
## zn       0.064713197  0.74524778 -0.908970410
## indus    0.005855833 -0.24857245  0.091501918
## chas    -0.092210600 -0.10691445  0.060138540
## nox      0.395961661 -0.72837304 -1.144675082
## rm      -0.089262903 -0.06315898 -0.259855052
## age      0.244696221 -0.23988852 -0.261379062
## dis     -0.057066293 -0.21888082 -0.007249824
## rad      3.079385046  0.76714473 -0.417017083
## tax     -0.087455797  0.11725343  0.880884627
## ptratio  0.128269592  0.01928150 -0.123383068
## black   -0.137291621  0.01118952  0.079985876
## lstat    0.218280464 -0.20881241  0.217992605
## medv     0.173541485 -0.40568071 -0.226754787
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9522 0.0354 0.0125

The function for lda biplot arrows

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

Target classes as numeric and plot the lda results

classes <- as.numeric(boston_train$crime)
plot(lda.fit, dimen = 2, col = classes, pch = classes, main = "LDA biplot of Boston train set")

The linear discriminant analysis showed the separation of different crime rates.

Note: I do not know why the arrow were not desplayed.

Error in arrows(x0 = 0, y0 = 0, x1 = myscale * heads[, choices[1]], y1 = myscale * : plot.new has not been called yet

4.5 Predicting the classes with the LDA model

Now we Use the LDA outcome to perform a prediction on the crime rate on the test set.

4.5.1 Predicting classes with test data

(Step 6a)

lda.pred <- predict(lda.fit, newdata = boston_test)

4.5.2 Cross tabulate the results with the crime categories

(Step 6b)

pred_table <- table(correct = correct_classes, predicted = lda.pred$class)
pred_table

##           predicted
## correct    low med_low med_high high
##   low       20      11        0    0
##   med_low    4      18        4    0
##   med_high   0       6       16    0
##   high       0       0        0   23

The results were not very good on lower crime rate predictions. The predicted crime rate was higher than it actually was: predicted med_low of and med_high which were low rates or med_low rates. However, the predictions on the high crime rates was good.

4.6 Reloading the dataset and run k-means algorithm

In 4.5, we performed classification.In classification, the classes are defined before (crime rate values from low to high) and the classification model is trained based on data.

Now we do Clustering. The classes / number of classes is unknown and we will find groups based on similarity of the observations.

4.6.1 Standardizing the Boston dataset

(Step 7a)

data(Boston)
bos_scaled2 <- scale(Boston)

4.6.2 Calculating the distances

(Step 7b)

Euclidean distance matrix

dist_eu <- dist(bos_scaled2)
summary(dist_eu)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

Manhattan distance matrix

dist_man <- dist(bos_scaled2, method = "manhattan")
summary(dist_man)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618

4.6.3 K-means clustering

(Step 7c)

bos_km <- kmeans(bos_scaled2, centers = 3)

Plot the dataset with clusters

pairs(bos_scaled2, col = bos_km$cluster)

pairs(bos_scaled2[,1:5], col = bos_km$cluster)

pairs(bos_scaled2[,6:10], col = bos_km$cluster)

pairs(bos_scaled2[,11:14], col = bos_km$cluster)

The optimal number of clusters was 3. We got the best overview with three clusters.

4.6.4 Determining the k

set.seed(123)
k_max <- 10 # determine the number of clusters
twcss <- sapply(1:k_max, function(k){kmeans(bos_scaled2, k)$tot.withinss}) # calculate the total within sum of squares
qplot(x = 1:k_max, y = twcss, geom = 'line') # visualize the results

bos_km2 <- kmeans(bos_scaled2, centers = 3) # k-means clustering
pairs(bos_scaled2, col = bos_km2$cluster) # plot the data with clusters

pairs(bos_scaled2[,1:5], col = bos_km2$cluster)

pairs(bos_scaled2[,6:10], col = bos_km2$cluster)

pairs(bos_scaled2[,11:14], col = bos_km2$cluster)

The twcss value decrease heavily at a level of 2 - 3 clusters. The optimal number of clusters was 3.

4.7 Performing k-means on the original data

(Bonus)

data(Boston)
bos_scaled3 <- scale(Boston)
bos_km3 <-kmeans(bos_scaled3, centers = 3)
cluster <- bos_km3$cluster
bos_scaled3 <- data.frame(bos_scaled3, cluster)
lda.fit_cluster <- lda(cluster ~ ., data = bos_scaled3)
lda.fit_cluster

## Call:
## lda(cluster ~ ., data = bos_scaled3)
## 
## Prior probabilities of groups:
##         1         2         3 
## 0.2470356 0.3260870 0.4268775 
## 
## Group means:
##         crim         zn      indus         chas        nox         rm
## 1 -0.3989700  1.2614609 -0.9791535 -0.020354653 -0.8573235  1.0090468
## 2  0.7982270 -0.4872402  1.1186734  0.014005495  1.1351215 -0.4596725
## 3 -0.3788713 -0.3578148 -0.2879024  0.001080671 -0.3709704 -0.2328004
##           age        dis        rad        tax     ptratio      black
## 1 -0.96130713  0.9497716 -0.5867985 -0.6709807 -0.80239137  0.3552363
## 2  0.79930921 -0.8549214  1.2113527  1.2873657  0.59162230 -0.6363367
## 3 -0.05427143  0.1034286 -0.5857564 -0.5951053  0.01241316  0.2805140
##        lstat        medv
## 1 -0.9571271  1.06668290
## 2  0.8622388 -0.67953738
## 3 -0.1047617 -0.09820229
## 
## Coefficients of linear discriminants:
##                 LD1         LD2
## crim    -0.03206338 -0.19094456
## zn       0.02935900 -1.07677218
## indus    0.63347352 -0.09917524
## chas     0.02460719  0.10009606
## nox      1.11749317 -0.75995105
## rm      -0.18841682 -0.57360135
## age     -0.12983139  0.47226685
## dis      0.04493809 -0.34585958
## rad      0.67004295 -0.08584353
## tax      1.03992455 -0.58075025
## ptratio  0.25864960 -0.02605279
## black   -0.01657236  0.01975686
## lstat    0.17365575 -0.41704235
## medv    -0.06819126 -0.79098605
## 
## Proportion of trace:
##    LD1    LD2 
## 0.8506 0.1494

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

classes3 <- as.numeric(bos_scaled3$cluster)
plot(lda.fit_cluster, dimen = 2, col = classes3, pch = classes3, main = "LDA biplot using three clusters")
lda.arrows(lda.fit_cluster, myscale = 2)

4.8 3D plot

(Super-Bonus)

model_predictors <- dplyr::select(boston_train, -crime)
# check the dimensions
dim(model_predictors)

## [1] 404  13

dim(lda.fit$scaling)

## [1] 13  3

# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

Next, install and access the plotly package. Create a 3D plot of the columns of the matrix product.

Note: After running:

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = boston_train$crime)

It showed:

WebGL is not supported by your browser. Therefore I cannot see a cool plot.

Chapter 5 Dimensionality reduction techniques

(25.11.-2.12.2019)

Data wrangling

R script of data Wrangling exercise is available here.

https://github.com/yufanyin/IODS-project/blob/master/create_human.R

5.1 Exploring the data numerically and graphically

Load necessary packages

library(tidyr)
library(dplyr)
library(corrplot)
library(ggplot2)
library(GGally)
library(knitr)
library(kableExtra)
library(stringr)
library(ggfortify)
library(factoextra)

5.1.1 Reading the data

human <- read.table(file = "G:/C-Open Data Science/0-191030/IODS-project-master/human_ana.txt")

5.1.2 Summaries of the variables

(Step 1a)

The data are the conbination of two data sets, Human Development Index (HDI) and Gender Inequality Index.

The data consist of 155 observations in 8 variables.

“Country”: country name “edu2F_edu2M”: ratio of second education of female/male “labF_labM”: ratio of labour forced female/male “exp_edu”: expected years of education “exp_life”: life expectancy at birth “GNI”: Gross National Income per capita “mat_mor”: maternal mortality ratio “ado_birth”: adolescent birth rate “F_parli”: percetange of female representatives in parliament

knitr::kable(summary(human)) %>% kable_styling(bootstrap_options = "striped", position = "center", font_size = 12)

edu2F_edu2M	labF_labM	exp_edu	exp_life	GNI	mat_mor	ado_birth	F_parli
Min. :0.1717	Min. :0.1857	Min. : 5.40	Min. :49.00	Min. : 581	Min. : 1.0	Min. : 0.60	Min. : 0.00
1st Qu.:0.7264	1st Qu.:0.5984	1st Qu.:11.25	1st Qu.:66.30	1st Qu.: 4198	1st Qu.: 11.5	1st Qu.: 12.65	1st Qu.:12.40
Median :0.9375	Median :0.7535	Median :13.50	Median :74.20	Median : 12040	Median : 49.0	Median : 33.60	Median :19.30
Mean :0.8529	Mean :0.7074	Mean :13.18	Mean :71.65	Mean : 17628	Mean : 149.1	Mean : 47.16	Mean :20.91
3rd Qu.:0.9968	3rd Qu.:0.8535	3rd Qu.:15.20	3rd Qu.:77.25	3rd Qu.: 24512	3rd Qu.: 190.0	3rd Qu.: 71.95	3rd Qu.:27.95
Max. :1.4967	Max. :1.0380	Max. :20.20	Max. :83.50	Max. :123124	Max. :1100.0	Max. :204.80	Max. :57.50

5.1.3 Graphical overview of the data

(Step 1b)

ov_human <- ggpairs(human, mapping = aes(), title ="Overview of 'human' data", lower = list(combo = wrap("facethist", bins = 20)), upper = list(continuous = wrap("cor", size = 2.8)))
ov_human

The overview plot showed the data distributions of all variables and their correlations. GNI, maternal mortality rate and adolescence birth rate showed left skewed distribution representing a increased cound on low rates. On the upper part showed correlations of the variables.

5.1.4 Relationships between the variables

(Step 1c)

cor_human <- cor(human) %>% round(digits = 2)
cor_human %>% knitr::kable(caption = "Correlation between variables") %>% 
  kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "center")

Correlation between variables
	edu2F_edu2M	labF_labM	exp_edu	exp_life	GNI	mat_mor	ado_birth	F_parli
edu2F_edu2M	1.00	0.01	0.59	0.58	0.43	-0.66	-0.53	0.08
labF_labM	0.01	1.00	0.05	-0.14	-0.02	0.24	0.12	0.25
exp_edu	0.59	0.05	1.00	0.79	0.62	-0.74	-0.70	0.21
exp_life	0.58	-0.14	0.79	1.00	0.63	-0.86	-0.73	0.17
GNI	0.43	-0.02	0.62	0.63	1.00	-0.50	-0.56	0.09
mat_mor	-0.66	0.24	-0.74	-0.86	-0.50	1.00	0.76	-0.09
ado_birth	-0.53	0.12	-0.70	-0.73	-0.56	0.76	1.00	-0.07
F_parli	0.08	0.25	0.21	0.17	0.09	-0.09	-0.07	1.00

p.mat <- cor.mtest(cor_human)$p
corrplot(cor_human, method="circle",  type="lower",  tl.cex = 0.6, p.mat = p.mat, sig.level = 0.01, tl.srt = 45, title="Correlations of 'human' data", mar=c(0,0,1,0))

Expected years of education strongly positively correlated with Live expectation and strongly negatively correlated with maternal mortality ratio. It indicated that a better education leads to a higher life expectation and contributes to a lower maternal mortality.

Another strong positive correlation was seen between adolescent birth rate and maternal mortality ratio. This may be due to high quality of medical care.

5.2 Principal component analysis (PCA) on non-standardized data

In PCA, the data are first transformed to a new space with equal or less number of dimensions which called the principal components. The 1st principal component captures the maximum amount of variance from the features in the original data. The 2nd principal component is orthogonal to the first and it captures the maximum amount of variability left.

5.2.1 Variability captured by the principal components

(Step 2a)

pca_human <- prcomp(human)
round(pca_human$rotation[,1:2], digits = 4) %>% knitr::kable(caption = "PCA result on all variables") %>% kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "center")

PCA result on all variables
	PC1	PC2
edu2F_edu2M	0.0000	0.0007
labF_labM	0.0000	-0.0003
exp_edu	-0.0001	0.0076
exp_life	-0.0003	0.0283
GNI	-1.0000	-0.0058
mat_mor	0.0057	-0.9916
ado_birth	0.0012	-0.1256
F_parli	-0.0001	0.0032

s1 <- summary(pca_human)
pca_prc1 <- round(100 * s1$importance[2, ], digits = 3) 
round(100 * s1$importance[2, 1:2], digits = 5) %>% knitr::kable(caption = "Principle components") %>% kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "center")

Principle components
	x
PC1	99.99
PC2	0.01

5.2.2 PCA Biplot displaying the observations by the first two principal components

(Step 2b)

The biplot displayed the observations by the first two principal components (PC1 coordinate in x-axis, PC2 coordinate in y-axis), along with arrows representing the original variables.

pca_label1 <-  paste0("GNI (", pca_prc1[1],"%)" )
pca_label2 <-  paste0("Maternal mortality (", pca_prc1[2],"%)" )
biplot(pca_human, choices = 1:2, cex = c(0.55, 0,9), col = c("grey40", "darkgreen"), xlab = pca_label1, ylab = pca_label2, main = "PCA biplot of non-standardized human data", margin =c(0,0,5,0))

autoplot(pca_human, data = human, label= TRUE, label.size = 3.0, colour = "darkgreen", loadings = TRUE, loadings.label = TRUE, loadings.colour = "red",) + ggtitle("PCA biplot of non-standardized human data 2") + xlab(paste0(pca_label1))  + ylab(paste0(pca_label2)) + theme_bw()

Principal component 1 (GNI) represents 99.99% of the variance of the non-standardized data. PC2 (maternal mortalility ratio) captures 0.01% of the variance. The Gross National Income per capita explains 99.99% of the principal component 1 of the data.

5.3 Principal component analysis (PCA) on standardized data

5.3.1 Scaling the data

(Step 3a)

The summary of the standardized data is as follows (The mean of all values is 0 and the standard diviation reaches 1).

stzd_human <- scale(human)
knitr::kable(summary(stzd_human)) %>% 
  kable_styling(bootstrap_options = "striped", position = "center", font_size = 11)

edu2F_edu2M	labF_labM	exp_edu	exp_life	GNI	mat_mor	ado_birth	F_parli
Min. :-2.8189	Min. :-2.6247	Min. :-2.7378	Min. :-2.7188	Min. :-0.9193	Min. :-0.6992	Min. :-1.1325	Min. :-1.8203
1st Qu.:-0.5233	1st Qu.:-0.5484	1st Qu.:-0.6782	1st Qu.:-0.6425	1st Qu.:-0.7243	1st Qu.:-0.6496	1st Qu.:-0.8394	1st Qu.:-0.7409
Median : 0.3503	Median : 0.2316	Median : 0.1140	Median : 0.3056	Median :-0.3013	Median :-0.4726	Median :-0.3298	Median :-0.1403
Mean : 0.0000	Mean : 0.0000	Mean : 0.0000	Mean : 0.0000	Mean : 0.0000	Mean : 0.0000	Mean : 0.0000	Mean : 0.0000
3rd Qu.: 0.5958	3rd Qu.: 0.7350	3rd Qu.: 0.7126	3rd Qu.: 0.6717	3rd Qu.: 0.3712	3rd Qu.: 0.1932	3rd Qu.: 0.6030	3rd Qu.: 0.6127
Max. : 2.6646	Max. : 1.6632	Max. : 2.4730	Max. : 1.4218	Max. : 5.6890	Max. : 4.4899	Max. : 3.8344	Max. : 3.1850

5.3.2 Variability captured by the principal components

(Step 3b)

pca_stzd_human <- prcomp(stzd_human)
round(pca_stzd_human$rotation[, 1:2], digits = 4) %>% knitr::kable(caption = "PC1 & 2 of standardized 'human' data") %>% kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "center")

PC1 & 2 of standardized ‘human’ data
	PC1	PC2
edu2F_edu2M	-0.3566	0.0380
labF_labM	0.0546	0.7243
exp_edu	-0.4277	0.1394
exp_life	-0.4437	-0.0253
GNI	-0.3505	0.0506
mat_mor	0.4370	0.1451
ado_birth	0.4113	0.0771
F_parli	-0.0844	0.6514

s2 <- summary(pca_stzd_human)
pca_prc2 <- round(100 * s2$importance[2, ], digits = 1) 

pca_label2.1 <- paste0("Education and health (",pca_prc2[1],"%) ")
pca_label2.2 <- paste0("Female social participation (",pca_prc2[2],"%) ")

round(100 * s2$importance[2, 1:2], digits = 2) %>% knitr::kable(caption = "Principle components") %>% kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "center")

Principle components
	x
PC1	53.61
PC2	16.24

5.3.3 PCA Biplot

(Step 3c)

biplot(pca_stzd_human, choices = 1:2, cex = c(0.5, 0,9), col = c("grey40", "deeppink2"), xlab = pca_label2.1, ylab = pca_label2.2, main = "PCA biplot of standardized 'human' data", margin =c(0,0,5,0))

autoplot(prcomp(stzd_human), data = human, colour = "darkgreen", label= TRUE, label.size = 3.0, loadings = TRUE, loadings.label = TRUE, loadings.colour = "red",) + ggtitle("PCA biplot of standardized 'human' data 2") + xlab(paste0(pca_label2.1))  + ylab(paste0(pca_label2.2)) + theme_bw()

5.3.4 Interpretations of the first two principal component dimensions

(Step 4)

The principal component representing 16.2% of the data variation was strongly associated with percentage of parliament representatives (“F_parli”) and ratio of labour forced female/male (“labF_labM”). So the share of of woman in labour market and in the political representatives were mostly associated with principal component 2. The variables which showed the strongest association with principal component 1 and represented 53.6% of the data variation were life expectancy (“exp_life”), expected years of education (“exp_edu”), the ratio of second education of female/male (“edu2F_edu2M”), adolescent birth rate (“ado_birth”) and maternal mortality ratio (“mat_mor”). Hence principal component 1 discribed the health and education situation in those countries.

The PCA result on between non-standardized data differed from that on standardized data. In the PCA of the non-standardized human data, the principal component 1 represents almost all varation and Gross National Income per capita is the value representing principle component 1. The PCA of standardized values looks quite different. The PC1 & PC2 values spread all over the biplot and more variables are associated to these two principal components. To perform the PCA, it is necessary to standardizing variables of obsvervations so that values become comparable. (Before doing this exercise, I did not know why I needed to standardize first! Now I understand.)

The angle between the arrows represents the original features of the data and can be interpreted as the correlation between these features. Small angle between the arrows means that the correlation is positive. The angle between the feature and the principal component can be interpreted as the correlation between these two components. Similarly, if the angle is small there is a positive correlation. The length of the arrows is proportional to the standard deviation of the features.

5.4 Multiple Correspondence Analysis (MCA)

5.4.1 Structure and dimensions of ‘tea’ data

(Step 5a)

The tea data about tea consumption behaviour were loaded from the package Factominer. The data consist of 300 observations in 36 variables.

library(FactoMineR)
data(tea)
dim(tea)

## [1] 300  36

str(tea)

## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...

5.4.2 Visualization of ‘tea’ data

(Step 5b)

Error in the following chunk:

Warning: attributes are not identical across measure variables;

they will be dropped

WHY? The error occurred in knitting index.rmd. Everything was fine when I knitted chapter5.rmd

I had to add #### so that all the following codes would be able to display without running.

# column names to keep in the dataset
#### keep_tea <- c("Tea", "How", "how", "price", "where", "lunch")
#### keep_tea

# select and create a new dataset
#### tea_time <- select(tea, one_of(keep_tea))

#### gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar(fill = "darkblue") + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

5.4.3 Multiple Correspondence Analysis and its summary

(Step 5c)

#### mca_tea_time <- MCA(tea_time, graph = FALSE)
#### summary(mca_tea_time)

Visualize MCA

#### plot(mca_tea_time, invisible=c("ind"), habillage = "quali")

5.4.4 MCA biplot of tea consumption behaviour

(Step 5d)

An individual value between 0 & 1 was added to colour the cosine-squared (cos2), a value between 0 & 1. The closer to 1, the better this value projected on the dimension.

#### fviz_mca_biplot(mca_tea_time, col.ind = "cos2", col.var = "darkred", label = "var", geom =  c("point","text"), labelsize = 4, arrows = c(FALSE,TRUE)) + labs(title = "MCA of tea consumption behaviour") + theme_grey() + theme(axis.line = element_line(size = 0.5), panel.background = element_rect(fill = "gray93"))

5.4.4 MCA biplot of tea consumption behaviour

(Step 5d)

An individual value between 0 & 1 was added to colour the cosine-squared (cos2), a value between 0 & 1. The closer to 1, the better this value projected on the dimension.

5.4.5 MCA of reason for drinking tea

(Step 5e)

“sex”: Gender of tea consumers “age_Q”: Age group of tea consumers “spirituality”: Spiritual reason “healthy”: Health reason “slimming”: To lose weight or not “relaxing”: Relaxation or not

(Same error in this chunk)

#### keep_reason <- c("sex", "age_Q", "spirituality", "healthy", "slimming", "relaxing")
#### tea_reason <- select(tea, one_of(keep_reason))

#### gather(tea_reason) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar(fill = "darkblue", alpha = 0.6) + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

MCA of ‘tea_reason’ data

#### mca_tea_reason <- MCA(tea_reason, graph = FALSE)
#### summary(mca_tea_reason)

Call: MCA(X = tea_reason, graph = FALSE)

Eigenvalues

                Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6

Variance 0.233 0.217 0.202 0.173 0.166 0.138

% of var. 15.562 14.476 13.456 11.502 11.066 9.187

Cumulative % of var. 15.562 30.039 43.495 54.997 66.063 75.251

                Dim.7   Dim.8   Dim.9

Variance 0.132 0.126 0.113

% of var. 8.833 8.393 7.523

Cumulative % of var. 84.083 92.477 100.000

Individuals (the 10 first)

             Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3

1 | 0.491 0.344 0.135 | 0.199 0.061 0.022 | 0.110

2 | 0.506 0.366 0.210 | 0.599 0.551 0.294 | -0.564

3 | 0.039 0.002 0.001 | 0.353 0.191 0.119 | -0.380

4 | -0.602 0.518 0.306 | -0.531 0.432 0.237 | 0.178

5 | 0.208 0.062 0.024 | -0.446 0.305 0.112 | -0.116

6 | -0.270 0.104 0.081 | -0.288 0.128 0.093 | 0.198

7 | 0.024 0.001 0.000 | -0.047 0.003 0.001 | 0.294

8 | -0.193 0.053 0.025 | 0.146 0.033 0.014 | -0.211

9 | 0.308 0.135 0.049 | -0.410 0.259 0.088 | 0.073

10 | 0.024 0.001 0.000 | -0.047 0.003 0.001 | 0.294

             ctr   cos2

1 0.020 0.007 |

2 0.526 0.261 |

3 0.239 0.138 |

4 0.052 0.027 |

5 0.022 0.008 |

6 0.065 0.044 |

7 0.143 0.054 |

8 0.074 0.030 |

9 0.009 0.003 |

10 0.143 0.054 |

Categories (the 10 first)

              Dim.1     ctr    cos2  v.test     Dim.2     ctr

F | -0.256 2.767 0.095 -5.338 | 0.220 2.198

M | 0.373 4.037 0.095 5.338 | -0.321 3.208

15-24 | -0.834 15.237 0.308 -9.593 | -0.524 6.472

25-34 | 0.681 7.619 0.139 6.437 | -0.744 9.760

35-44 | 0.018 0.003 0.000 0.121 | 0.149 0.229

45-59 | 0.692 6.948 0.122 6.043 | 0.728 8.276

+60 | -0.346 1.086 0.017 -2.282 | 1.293 16.262

Not.spirituality | 0.302 4.458 0.199 7.719 | 0.212 2.374

spirituality | -0.661 9.770 0.199 -7.719 | -0.465 5.203

healthy | -0.247 3.050 0.142 -6.525 | 0.304 4.981

             cos2  v.test     Dim.3     ctr    cos2  v.test

F 0.070 4.589 | -0.555 15.066 0.449 -11.582 |

M 0.070 -4.589 | 0.809 21.981 0.449 11.582 |

15-24 0.122 -6.030 | -0.522 6.893 0.120 -6.000 |

25-34 0.165 -7.027 | 0.983 18.367 0.289 9.294 |

35-44 0.003 1.014 | -0.261 0.750 0.010 -1.770 |

45-59 0.135 6.361 | -0.716 8.605 0.131 -6.254 |

+60 0.243 8.517 | 0.901 8.498 0.118 5.936 |

Not.spirituality 0.099 5.433 | 0.017 0.016 0.001 0.432 |

spirituality 0.099 -5.433 | -0.037 0.035 0.001 -0.432 |

healthy 0.216 8.043 | 0.179 1.853 0.075 4.730 |

Categorical variables (eta2)

            Dim.1 Dim.2 Dim.3

sex | 0.095 0.070 0.449 |

age_Q | 0.433 0.534 0.522 |

spirituality | 0.199 0.099 0.001 |

healthy | 0.142 0.216 0.075 |

slimming | 0.100 0.272 0.107 |

relaxing | 0.431 0.111 0.058 |

MCA biplot of ‘tea_reason’ data

#### fviz_mca_biplot(mca_tea_reason, col.ind = "cos2", col.var = "darkred", alpha.var = "contrib", label = "var", geom =  c("point","text"), labelsize = 3.5, arrows = c(FALSE,TRUE)) + labs(title = "MCA of tea consumption reasons") + theme_grey() + theme(axis.line = element_line(size = 0.5), panel.background = element_rect(fill = "gray93"))

Comment on the output

(Step 5f)

Variable categories closing to each other showed higher similarity. The plot included two colour schemes representing the contribution of the variables to the dimenstions(“contrib”) and the “cos2” of the individuals (the closer to 1, the better it projected on the dimension) .

People in the age of 15-24, spirituality and relaxation reason were close. Men in the age of 25-34 tended not to drink tea for health reason or losing weight. People in the age of 45-59 tended not to drink tea for relaxation or spirituality reason. People in age of 60+ and women group have similarities with drinking tea for health reasons and weight losing.

Note: PCA and MCA biplot help to visualize results much better than doing in SPSS. Wrting codes in R is a bit more abstract but it is worth learning.

Chapter 6 Analysis of longitudinal data

(3.12.-10.12.2019)

Data wrangling

R script of data Wrangling exercise is available here.

https://github.com/yufanyin/IODS-project/blob/master/create_BPRS_rats_l.R
Reference to the data source

https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt

https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt
Note:

6.2-6.5 Analyses of chapter 8 of MABS using the RATS data (rats_l)

6.6-6.8 Analyses of chapter 9 of MABS using the BPRS data (bprs_l)

6.1 Loading the wide and long data sets

Load and check the data sets

# load necessary packages
library(tidyr)
library(dplyr)
library(corrplot)
library(ggplot2)
library(GGally)
library(knitr)
library(kableExtra)
library(ggthemes)
library(stringr)
library(lme4)

Load the wide (bprs & rats) and long (bprs_l & rats_l) data sets

bprs <- read.table("G:/C-Open Data Science/0-191030/IODS-project-master/BPRS.txt", sep = " ", header = TRUE)
rats <- read.table("G:/C-Open Data Science/0-191030/IODS-project-master/rats.txt", header = TRUE)
bprs_l <- read.table("G:/C-Open Data Science/0-191030/IODS-project-master/BPRS_l.txt")
rats_l <- read.table("G:/C-Open Data Science/0-191030/IODS-project-master/rats_l.txt")

str(bprs_l)

## 'data.frame':    360 obs. of  5 variables:
##  $ treatment: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ subject  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ weeks    : Factor w/ 9 levels "week0","week1",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BPRS     : int  42 58 54 55 72 48 71 30 41 57 ...
##  $ week     : int  0 0 0 0 0 0 0 0 0 0 ...

str(rats_l)

## 'data.frame':    176 obs. of  5 variables:
##  $ ID    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Group : int  1 1 1 1 1 1 1 1 2 2 ...
##  $ WD    : Factor w/ 11 levels "WD1","WD15","WD22",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Weight: int  240 225 245 260 255 260 275 245 410 405 ...
##  $ Time  : int  1 1 1 1 1 1 1 1 1 1 ...

bprs data: factor treatment & subject rats data: factor ID & Group

bprs_l$treatment <- factor(bprs_l$treatment)
bprs_l$subject <- factor(bprs_l$subject)
rats_l$ID <- factor(rats_l$ID)
rats_l$Group <- factor(rats_l$Group)
str(bprs_l)

## 'data.frame':    360 obs. of  5 variables:
##  $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ subject  : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ weeks    : Factor w/ 9 levels "week0","week1",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BPRS     : int  42 58 54 55 72 48 71 30 41 57 ...
##  $ week     : int  0 0 0 0 0 0 0 0 0 0 ...

str(rats_l)

## 'data.frame':    176 obs. of  5 variables:
##  $ ID    : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
##  $ WD    : Factor w/ 11 levels "WD1","WD15","WD22",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Weight: int  240 225 245 260 255 260 275 245 410 405 ...
##  $ Time  : int  1 1 1 1 1 1 1 1 1 1 ...

The BPRS dataset

A total 40 male subjects were measured with 2 treatment groups using BPRS (Brief Psychiatric Rating Scale). One BPRS measurement was conducted before treatment started (week0), and weekly measurements followed up to 8 weeks. The BPRS assesses the levels of 18 symptom constructs to evaluate whether patients have schizophrenia (Vehkalahti & Everitt, 2019). The levels of symptoms constructs were from 1 (not present) to 7 (extremely severe).

The long BPRS data ‘bprs_l’

They consisted of 360 observations in 5 variables. The variables included:
“treatment”: the psychological treatment of the male subjects (1 & 2)
“subject”: the male individuals identification number “weeks”: the week of treatment and BPRS evaluation “bprs”: the values of bprs “week”: the week number of the BPRS evaluation.

knitr::kable(bprs_l, caption = "The long BPRS data 'bprs_l'") %>% 
  kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "center") %>% 
  scroll_box(height = "200px")

The long BPRS data ‘bprs_l’
treatment	subject	weeks	BPRS	week
1	1	week0	42	0
1	2	week0	58	0
1	3	week0	54	0
1	4	week0	55	0
1	5	week0	72	0
1	6	week0	48	0
1	7	week0	71	0
1	8	week0	30	0
1	9	week0	41	0
1	10	week0	57	0
1	11	week0	30	0
1	12	week0	55	0
1	13	week0	36	0
1	14	week0	38	0
1	15	week0	66	0
1	16	week0	41	0
1	17	week0	45	0
1	18	week0	39	0
1	19	week0	24	0
1	20	week0	38	0
2	1	week0	52	0
2	2	week0	30	0
2	3	week0	65	0
2	4	week0	37	0
2	5	week0	59	0
2	6	week0	30	0
2	7	week0	69	0
2	8	week0	62	0
2	9	week0	38	0
2	10	week0	65	0
2	11	week0	78	0
2	12	week0	38	0
2	13	week0	63	0
2	14	week0	40	0
2	15	week0	40	0
2	16	week0	54	0
2	17	week0	33	0
2	18	week0	28	0
2	19	week0	52	0
2	20	week0	47	0
1	1	week1	36	1
1	2	week1	68	1
1	3	week1	55	1
1	4	week1	77	1
1	5	week1	75	1
1	6	week1	43	1
1	7	week1	61	1
1	8	week1	36	1
1	9	week1	43	1
1	10	week1	51	1
1	11	week1	34	1
1	12	week1	52	1
1	13	week1	32	1
1	14	week1	35	1
1	15	week1	68	1
1	16	week1	35	1
1	17	week1	38	1
1	18	week1	35	1
1	19	week1	28	1
1	20	week1	34	1
2	1	week1	73	1
2	2	week1	23	1
2	3	week1	31	1
2	4	week1	31	1
2	5	week1	67	1
2	6	week1	33	1
2	7	week1	52	1
2	8	week1	54	1
2	9	week1	40	1
2	10	week1	44	1
2	11	week1	95	1
2	12	week1	41	1
2	13	week1	65	1
2	14	week1	37	1
2	15	week1	36	1
2	16	week1	45	1
2	17	week1	41	1
2	18	week1	30	1
2	19	week1	43	1
2	20	week1	36	1
1	1	week2	36	2
1	2	week2	61	2
1	3	week2	41	2
1	4	week2	49	2
1	5	week2	72	2
1	6	week2	41	2
1	7	week2	47	2
1	8	week2	38	2
1	9	week2	39	2
1	10	week2	51	2
1	11	week2	34	2
1	12	week2	49	2
1	13	week2	36	2
1	14	week2	36	2
1	15	week2	65	2
1	16	week2	45	2
1	17	week2	46	2
1	18	week2	27	2
1	19	week2	31	2
1	20	week2	27	2
2	1	week2	42	2
2	2	week2	32	2
2	3	week2	33	2
2	4	week2	27	2
2	5	week2	58	2
2	6	week2	37	2
2	7	week2	41	2
2	8	week2	49	2
2	9	week2	38	2
2	10	week2	31	2
2	11	week2	75	2
2	12	week2	36	2
2	13	week2	60	2
2	14	week2	31	2
2	15	week2	55	2
2	16	week2	35	2
2	17	week2	30	2
2	18	week2	29	2
2	19	week2	26	2
2	20	week2	32	2
1	1	week3	43	3
1	2	week3	55	3
1	3	week3	38	3
1	4	week3	54	3
1	5	week3	65	3
1	6	week3	38	3
1	7	week3	30	3
1	8	week3	38	3
1	9	week3	35	3
1	10	week3	55	3
1	11	week3	41	3
1	12	week3	54	3
1	13	week3	31	3
1	14	week3	34	3
1	15	week3	49	3
1	16	week3	42	3
1	17	week3	38	3
1	18	week3	25	3
1	19	week3	28	3
1	20	week3	25	3
2	1	week3	41	3
2	2	week3	24	3
2	3	week3	28	3
2	4	week3	31	3
2	5	week3	61	3
2	6	week3	33	3
2	7	week3	33	3
2	8	week3	39	3
2	9	week3	27	3
2	10	week3	34	3
2	11	week3	76	3
2	12	week3	27	3
2	13	week3	53	3
2	14	week3	38	3
2	15	week3	55	3
2	16	week3	27	3
2	17	week3	32	3
2	18	week3	33	3
2	19	week3	27	3
2	20	week3	29	3
1	1	week4	41	4
1	2	week4	43	4
1	3	week4	43	4
1	4	week4	56	4
1	5	week4	50	4
1	6	week4	36	4
1	7	week4	27	4
1	8	week4	31	4
1	9	week4	28	4
1	10	week4	53	4
1	11	week4	36	4
1	12	week4	48	4
1	13	week4	25	4
1	14	week4	25	4
1	15	week4	36	4
1	16	week4	31	4
1	17	week4	40	4
1	18	week4	29	4
1	19	week4	29	4
1	20	week4	25	4
2	1	week4	39	4
2	2	week4	20	4
2	3	week4	22	4
2	4	week4	31	4
2	5	week4	49	4
2	6	week4	28	4
2	7	week4	34	4
2	8	week4	55	4
2	9	week4	31	4
2	10	week4	39	4
2	11	week4	66	4
2	12	week4	29	4
2	13	week4	52	4
2	14	week4	35	4
2	15	week4	42	4
2	16	week4	25	4
2	17	week4	46	4
2	18	week4	30	4
2	19	week4	24	4
2	20	week4	25	4
1	1	week5	40	5
1	2	week5	34	5
1	3	week5	28	5
1	4	week5	50	5
1	5	week5	39	5
1	6	week5	29	5
1	7	week5	40	5
1	8	week5	26	5
1	9	week5	22	5
1	10	week5	43	5
1	11	week5	36	5
1	12	week5	43	5
1	13	week5	25	5
1	14	week5	27	5
1	15	week5	32	5
1	16	week5	31	5
1	17	week5	33	5
1	18	week5	28	5
1	19	week5	21	5
1	20	week5	27	5
2	1	week5	38	5
2	2	week5	20	5
2	3	week5	25	5
2	4	week5	26	5
2	5	week5	38	5
2	6	week5	26	5
2	7	week5	37	5
2	8	week5	51	5
2	9	week5	24	5
2	10	week5	34	5
2	11	week5	64	5
2	12	week5	27	5
2	13	week5	32	5
2	14	week5	30	5
2	15	week5	30	5
2	16	week5	22	5
2	17	week5	43	5
2	18	week5	26	5
2	19	week5	32	5
2	20	week5	23	5
1	1	week6	38	6
1	2	week6	28	6
1	3	week6	29	6
1	4	week6	47	6
1	5	week6	32	6
1	6	week6	33	6
1	7	week6	30	6
1	8	week6	26	6
1	9	week6	20	6
1	10	week6	43	6
1	11	week6	38	6
1	12	week6	37	6
1	13	week6	21	6
1	14	week6	25	6
1	15	week6	27	6
1	16	week6	29	6
1	17	week6	27	6
1	18	week6	21	6
1	19	week6	22	6
1	20	week6	21	6
2	1	week6	43	6
2	2	week6	19	6
2	3	week6	24	6
2	4	week6	24	6
2	5	week6	37	6
2	6	week6	27	6
2	7	week6	37	6
2	8	week6	55	6
2	9	week6	22	6
2	10	week6	41	6
2	11	week6	64	6
2	12	week6	21	6
2	13	week6	37	6
2	14	week6	33	6
2	15	week6	26	6
2	16	week6	22	6
2	17	week6	43	6
2	18	week6	36	6
2	19	week6	21	6
2	20	week6	23	6
1	1	week7	47	7
1	2	week7	28	7
1	3	week7	25	7
1	4	week7	42	7
1	5	week7	38	7
1	6	week7	27	7
1	7	week7	31	7
1	8	week7	25	7
1	9	week7	23	7
1	10	week7	39	7
1	11	week7	36	7
1	12	week7	36	7
1	13	week7	19	7
1	14	week7	26	7
1	15	week7	30	7
1	16	week7	26	7
1	17	week7	31	7
1	18	week7	25	7
1	19	week7	23	7
1	20	week7	19	7
2	1	week7	62	7
2	2	week7	18	7
2	3	week7	31	7
2	4	week7	26	7
2	5	week7	36	7
2	6	week7	23	7
2	7	week7	38	7
2	8	week7	59	7
2	9	week7	21	7
2	10	week7	42	7
2	11	week7	60	7
2	12	week7	22	7
2	13	week7	52	7
2	14	week7	30	7
2	15	week7	30	7
2	16	week7	22	7
2	17	week7	43	7
2	18	week7	33	7
2	19	week7	21	7
2	20	week7	23	7
1	1	week8	51	8
1	2	week8	28	8
1	3	week8	24	8
1	4	week8	46	8
1	5	week8	32	8
1	6	week8	25	8
1	7	week8	31	8
1	8	week8	24	8
1	9	week8	21	8
1	10	week8	32	8
1	11	week8	36	8
1	12	week8	31	8
1	13	week8	22	8
1	14	week8	26	8
1	15	week8	37	8
1	16	week8	30	8
1	17	week8	27	8
1	18	week8	20	8
1	19	week8	22	8
1	20	week8	21	8
2	1	week8	50	8
2	2	week8	20	8
2	3	week8	32	8
2	4	week8	23	8
2	5	week8	35	8
2	6	week8	21	8
2	7	week8	35	8
2	8	week8	66	8
2	9	week8	21	8
2	10	week8	39	8
2	11	week8	75	8
2	12	week8	23	8
2	13	week8	28	8
2	14	week8	27	8
2	15	week8	37	8
2	16	week8	22	8
2	17	week8	43	8
2	18	week8	30	8
2	19	week8	21	8
2	20	week8	23	8

The RATS dataset The data consisted of measurements on how rats grew. Three groups of rats got different diets. The body weight of every animal was measured over a period of 9 weeks. The research question is whether the growth profiles of the different diet groups differ (Vehkalahti & Everitt, 2019).

The long RATS data ‘rats_l’ The data consisted of 176 observations in 5 variables. The variables included:

“ID”: the individual rat’s identification number “Group”: the diet group (1, 2 & 3) “WD”: the factorial value of the time (the date when rat’ weight was measured) “Weight”: weight of individual rat at the given day “Time”: the time point of the weight measurement (days)

# rats_ data set
knitr::kable(rats_l, caption = "The long RATS data 'rats_l'") %>% 
  kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "center") %>% 
  scroll_box(height = "200px")

The long RATS data ‘rats_l’
ID	Group	WD	Weight	Time
1	1	WD1	240	1
2	1	WD1	225	1
3	1	WD1	245	1
4	1	WD1	260	1
5	1	WD1	255	1
6	1	WD1	260	1
7	1	WD1	275	1
8	1	WD1	245	1
9	2	WD1	410	1
10	2	WD1	405	1
11	2	WD1	445	1
12	2	WD1	555	1
13	3	WD1	470	1
14	3	WD1	535	1
15	3	WD1	520	1
16	3	WD1	510	1
1	1	WD8	250	8
2	1	WD8	230	8
3	1	WD8	250	8
4	1	WD8	255	8
5	1	WD8	260	8
6	1	WD8	265	8
7	1	WD8	275	8
8	1	WD8	255	8
9	2	WD8	415	8
10	2	WD8	420	8
11	2	WD8	445	8
12	2	WD8	560	8
13	3	WD8	465	8
14	3	WD8	525	8
15	3	WD8	525	8
16	3	WD8	510	8
1	1	WD15	255	15
2	1	WD15	230	15
3	1	WD15	250	15
4	1	WD15	255	15
5	1	WD15	255	15
6	1	WD15	270	15
7	1	WD15	260	15
8	1	WD15	260	15
9	2	WD15	425	15
10	2	WD15	430	15
11	2	WD15	450	15
12	2	WD15	565	15
13	3	WD15	475	15
14	3	WD15	530	15
15	3	WD15	530	15
16	3	WD15	520	15
1	1	WD22	260	22
2	1	WD22	232	22
3	1	WD22	255	22
4	1	WD22	265	22
5	1	WD22	270	22
6	1	WD22	275	22
7	1	WD22	270	22
8	1	WD22	268	22
9	2	WD22	428	22
10	2	WD22	440	22
11	2	WD22	452	22
12	2	WD22	580	22
13	3	WD22	485	22
14	3	WD22	533	22
15	3	WD22	540	22
16	3	WD22	515	22
1	1	WD29	262	29
2	1	WD29	240	29
3	1	WD29	262	29
4	1	WD29	265	29
5	1	WD29	270	29
6	1	WD29	275	29
7	1	WD29	273	29
8	1	WD29	270	29
9	2	WD29	438	29
10	2	WD29	448	29
11	2	WD29	455	29
12	2	WD29	590	29
13	3	WD29	487	29
14	3	WD29	535	29
15	3	WD29	543	29
16	3	WD29	530	29
1	1	WD36	258	36
2	1	WD36	240	36
3	1	WD36	265	36
4	1	WD36	268	36
5	1	WD36	273	36
6	1	WD36	277	36
7	1	WD36	274	36
8	1	WD36	265	36
9	2	WD36	443	36
10	2	WD36	460	36
11	2	WD36	455	36
12	2	WD36	597	36
13	3	WD36	493	36
14	3	WD36	540	36
15	3	WD36	546	36
16	3	WD36	538	36
1	1	WD43	266	43
2	1	WD43	243	43
3	1	WD43	267	43
4	1	WD43	270	43
5	1	WD43	274	43
6	1	WD43	278	43
7	1	WD43	276	43
8	1	WD43	265	43
9	2	WD43	442	43
10	2	WD43	458	43
11	2	WD43	451	43
12	2	WD43	595	43
13	3	WD43	493	43
14	3	WD43	525	43
15	3	WD43	538	43
16	3	WD43	535	43
1	1	WD44	266	44
2	1	WD44	244	44
3	1	WD44	267	44
4	1	WD44	272	44
5	1	WD44	273	44
6	1	WD44	278	44
7	1	WD44	271	44
8	1	WD44	267	44
9	2	WD44	446	44
10	2	WD44	464	44
11	2	WD44	450	44
12	2	WD44	595	44
13	3	WD44	504	44
14	3	WD44	530	44
15	3	WD44	544	44
16	3	WD44	542	44
1	1	WD50	265	50
2	1	WD50	238	50
3	1	WD50	264	50
4	1	WD50	274	50
5	1	WD50	276	50
6	1	WD50	284	50
7	1	WD50	282	50
8	1	WD50	273	50
9	2	WD50	456	50
10	2	WD50	475	50
11	2	WD50	462	50
12	2	WD50	612	50
13	3	WD50	507	50
14	3	WD50	543	50
15	3	WD50	553	50
16	3	WD50	550	50
1	1	WD57	272	57
2	1	WD57	247	57
3	1	WD57	268	57
4	1	WD57	273	57
5	1	WD57	278	57
6	1	WD57	279	57
7	1	WD57	281	57
8	1	WD57	274	57
9	2	WD57	468	57
10	2	WD57	484	57
11	2	WD57	466	57
12	2	WD57	618	57
13	3	WD57	518	57
14	3	WD57	544	57
15	3	WD57	555	57
16	3	WD57	553	57
1	1	WD64	278	64
2	1	WD64	245	64
3	1	WD64	269	64
4	1	WD64	275	64
5	1	WD64	280	64
6	1	WD64	281	64
7	1	WD64	284	64
8	1	WD64	278	64
9	2	WD64	478	64
10	2	WD64	496	64
11	2	WD64	472	64
12	2	WD64	628	64
13	3	WD64	525	64
14	3	WD64	559	64
15	3	WD64	548	64
16	3	WD64	569	64

Part 1 Analyses of chapter 8 of MABS using the RATS data (rats_l)

6.2-6.5

6.2 Standardizing the dataset and exploring the data graphically

6.2.1 Exploring the data graphically

ggplot(rats_l, aes(x = Time, y = Weight,linetype = ID)) +
  geom_line(color = "darkblue") +
  scale_linetype_manual(values = rep(1:6, times= 3)) + 
  facet_grid(. ~ Group, labeller = label_both) +
  theme(legend.position = "right") +
  scale_y_continuous(limits = c(min(rats_l$Weight), max(rats_l$Weight))) + 
 ggtitle("The long RATS data 'rats_l'")

Rat’s starting weights in group 2 & 3 were higher than in group 1 and the weights in group 2 & 3 increased in quite high rates. The weights in group 1 increased in a smaller rate.

6.2.2 Standardizing the dataset

Calculating StdWeight and added as a new variable to the ‘rats_l’

rats_l <- rats_l %>%
  group_by(Time) %>%
  mutate(StdWeight = ((Weight - mean(Weight))/sd(Weight))) %>%
  ungroup()
str(rats_l)

## Classes 'tbl_df', 'tbl' and 'data.frame':    176 obs. of  6 variables:
##  $ ID       : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Group    : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
##  $ WD       : Factor w/ 11 levels "WD1","WD15","WD22",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Weight   : int  240 225 245 260 255 260 275 245 410 405 ...
##  $ Time     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ StdWeight: num  -1.001 -1.12 -0.961 -0.842 -0.882 ...

6.2.3 Plotting the standardised data

ggplot(rats_l, aes(x = Time, y = StdWeight, linetype = ID)) +
  geom_line(color = "darkred") +
  scale_linetype_manual(values = rep(1:6, times=3)) +
  facet_grid(. ~ Group, labeller = label_both) +
  scale_y_continuous(name = "StdWeight: standardized Weight values") +
  ggtitle("Standardized 'rats_l' Weight values") + 
  theme(panel.grid.major.y = element_line(colour = "darkred"))

6.3 Profiles

6.3.1 Adding categorical variable

Add the Time as a categorical variable “Time1”

rats_l <- rats_l %>%
  mutate(Time1 = factor(rats_l$Time)) %>%
  ungroup()
str(rats_l)

## Classes 'tbl_df', 'tbl' and 'data.frame':    176 obs. of  7 variables:
##  $ ID       : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Group    : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
##  $ WD       : Factor w/ 11 levels "WD1","WD15","WD22",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Weight   : int  240 225 245 260 255 260 275 245 410 405 ...
##  $ Time     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ StdWeight: num  -1.001 -1.12 -0.961 -0.842 -0.882 ...
##  $ Time1    : Factor w/ 11 levels "1","8","15","22",..: 1 1 1 1 1 1 1 1 1 1 ...

6.3.2 Boxplot

The overview of rats’ weights over the whole measurement period with boxplots. Different diet groups were in different colors.

ggplot(data = rats_l, aes(x = rats_l$Time1, y = Weight, fill = rats_l$Group)) +
  geom_boxplot() +
  ylab("Weight") + 
  xlab("Measurement time [days]") +
  ggtitle("Rat weights") +
  scale_fill_discrete(name = "Diet group") +
  theme(legend.position = "right")

Calculate mean weight of all rats & the standard error of each diet group and add as a variable

n_rats <- rats_l$Time %>% unique() %>% length()
rats_s <- rats_l %>%
  group_by(Group, Time) %>%
  summarise(mean = mean(Weight), se = (sd(Weight)/sqrt(n_rats))) %>%
  ungroup()
str(rats_s)

## Classes 'tbl_df', 'tbl' and 'data.frame':    33 obs. of  4 variables:
##  $ Group: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Time : int  1 8 15 22 29 36 43 44 50 57 ...
##  $ mean : num  251 255 254 262 265 ...
##  $ se   : num  4.59 3.95 3.46 4.1 3.33 ...

6.3.3 Plotting the mean profiles

ggplot(rats_s, aes(x = Time, y = mean, linetype = Group,  color = Group, shape = Group)) +
  geom_line(size = 0.6) +
  scale_linetype_manual(name = "Diet group", values = c(1,2,3)) +
  geom_point(size=1.5) +
  scale_shape_manual(name = "Diet group", values = c(16,17,18)) +
  scale_color_manual(name = "Diet group", values = c("darkblue", "darkred", "darkgreen")) +
  geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.8) +
  theme(legend.position = "right") +
  scale_y_continuous(name = "Mean (Weight) +/- SE (Weight)") + 
  ggtitle("Mean weight profiles of different diet groups")

This graph showed the rat weight development of different diet groups during the measurement period. Group 1 started at a lower level and the weight increased slightly . In Group 2 and group 3, weights started from a higher values and then decreased. The standard errors in group 2 & 3 were higher, which indicated that the weight difference between the individual rats was bigger within the diet group.

6.4 Finding the outlier

6.4.1 Creating summary variable and calculating the mean weights of every group

Create a summary data by Group and ID with mean as the summary variable

rats_l10s <- rats_l %>%
  filter(Time > 1) %>%
  group_by(Group, ID) %>%
  summarise(mean=mean(Weight)) %>%
  ungroup()
str(rats_l10s)

## Classes 'tbl_df', 'tbl' and 'data.frame':    16 obs. of  3 variables:
##  $ Group: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
##  $ ID   : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ mean : num  263 239 262 267 271 ...

6.4.2 Boxplot of mean weights of different diet groups excluding the first measurement

ggplot(rats_l10s, aes(x = Group, y = mean, fill = Group)) +
  geom_boxplot() +
  stat_summary(fun.y = "mean", geom = "point", shape = 23, size = 3, fill = "black") +
  scale_y_continuous(name = "Mean (Weight) / Time 8-64 [days]") + 
  xlab("Diet group") + 
  scale_fill_discrete(name = "Diet group") +
  ggtitle("Mean weights of diet groups excluding the first measurement")

The boxplot showed the outliers and they could disturb further comparisons of the different diet groups.

6.4.3 Removing outlier

rats_l10s_o <- rats_l10s %>%
    filter(
      (mean > 250 & Group == 1) | 
      (mean < 550 & Group == 2) |
      (mean > 500 & Group == 3))  %>% 
      ungroup()
str(rats_l10s_o)

## Classes 'tbl_df', 'tbl' and 'data.frame':    13 obs. of  3 variables:
##  $ Group: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 2 2 2 ...
##  $ ID   : Factor w/ 16 levels "1","2","3","4",..: 1 3 4 5 6 7 8 9 10 11 ...
##  $ mean : num  263 262 267 271 276 ...

6.4.Boxplot of the mean value without the outlier

ggplot(rats_l10s_o, aes(x = Group, y = mean, fill = Group)) +
  geom_boxplot() +
  stat_summary(fun.y = "mean", geom = "point", shape = 23, size = 3, fill = "black") +
  scale_y_continuous(name = "Mean (Weight) / Time 8-64 [days]") +
  xlab("Diet group") + 
  scale_fill_discrete(name = "Diet group") +
  ggtitle("Mean weights of diet groups without the outlier")

This time there were no outlier values in the boxplot Mean weights of diet groups during 8 to 64 days.

6.5 Linear regression model and ANOVA

Note: I used ANOVA more frequently than t-test so I skip t-test.

Creating a new data table rats_l10s_b. The weight baseline values were created by the first weigh measurement on day 1 (WD1) as a new variable to the summary data.

rats_l10s_b <- rats_l10s %>%
  mutate(baseline = rats$WD1)
rats_l10s_b %>% knitr::kable() %>% kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "center") %>% scroll_box(height = "200px")

Group	ID	mean	baseline
1	1	263.2	240
1	2	238.9	225
1	3	261.7	245
1	4	267.2	260
1	5	270.9	255
1	6	276.2	260
1	7	274.6	275
1	8	267.5	245
2	9	443.9	410
2	10	457.5	405
2	11	455.8	445
2	12	594.0	555
3	13	495.2	470
3	14	536.4	535
3	15	542.2	520
3	16	536.2	510

To find which diet group had the highest increase in weight during the measurement period, anova tests were done using this new data ‘rats_l10s_b’ with the mean weight values and the baseline weight values.

6.5.1 Linear regression model 1 (mean weight ~ diet groups)

rats_lrm1 <- lm(mean ~ Group, data = rats_l10s_b)
round(rats_lrm1$coefficients, digits = 2) %>% knitr::kable() %>% kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "center")

	x
(Intercept)	265.02
Group2	222.77
Group3	262.48

The model showed that the mean value of group 2 was 222g higher and the mean value of group 3 was 262g higher than that of group 1.

Examine the group means using a post-hoc test after the ANOVA to find whether the diet groups differed to each other.

The model shows that the mean value of group 2 is 222g higher and the mean value of group 3 is 262g higher compared to group 1 mean weights. Let’s have a look now if the diet groups differ to each other. With a post-hoc test after the ANOVA we can check the group means.

6.5.2 ANOVA of Linear regression model 1 (mean weight ~ diet groups)

rats_anova1 <- anova(rats_lrm1)
rats_anova1

## Analysis of Variance Table
## 
## Response: mean
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## Group      2 238620  119310  88.525 2.679e-08 ***
## Residuals 13  17521    1348                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA of the linear model (mean weight ~ diet groups) showed that group 2 had a significant different weight increase compared to group 1 & 3.

6.5.3 ANOVA with a post-hoc test

Perform an ANOVA and a post-hoc test (TukeyHSD) to find whether group means of diet groups were different from each other. The dataset rats_l10s_b was without the first measurement (start value) and inluded the outlier values.

rats_anova2 <- aov(mean ~ Group, data = rats_l10s_b)
summary(rats_anova2)

##             Df Sum Sq Mean Sq F value   Pr(>F)    
## Group        2 238620  119310   88.53 2.68e-08 ***
## Residuals   13  17521    1348                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

TukeyHSD(rats_anova2, "Group", ordered = TRUE, conf.level = 0.95)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
##     factor levels have been ordered
## 
## Fit: aov(formula = mean ~ Group, data = rats_l10s_b)
## 
## $Group
##        diff       lwr      upr     p adj
## 2-1 222.775 163.41452 282.1355 0.0000006
## 3-1 262.475 203.11452 321.8355 0.0000001
## 3-2  39.700 -28.84358 108.2436 0.3099506

This post-hoc test provided the differences between the diet groups. Group 1 & 2 were significantly different in mean weight. So were Group 1 & 3. Group 2 and 3 were not significantly different in the mean weight mean, influenced by the outlier.

6.5.4 Linear regression model 2 (mean ~ baseline + Group)

The difference of each group to the baselines:

The linear model with the mean as the response

rats_lrm2 <- lm(mean ~ baseline + Group, data = rats_l10s_b)
round(rats_lrm2$coefficients, digits = 2) %>% knitr::kable() %>% kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "center")

	x
(Intercept)	33.16
baseline	0.93
Group2	34.86
Group3	23.68

That was the regression model of mean weights compared to the weight baseline (the first weight measurement of each dietary group). The results shows that the weight difference between baseline and group 2 was 34g. The weight difference between baseline and group 3 and baseline was 23g. Therefore Group 2 has the highest weight increase rates during the measurement period.

6.5.5 ANOVA of Linear regression model 2 (mean ~ baseline + Group)

rats_anova3 <- anova(rats_lrm2)
rats_anova3

## Analysis of Variance Table
## 
## Response: mean
##           Df Sum Sq Mean Sq   F value   Pr(>F)    
## baseline   1 253625  253625 1859.8201 1.57e-14 ***
## Group      2    879     439    3.2219  0.07586 .  
## Residuals 12   1636     136                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The baseline had a significant influence in the weight increase in different diet groups. Group 2 had a significant weight increase relative to the baseline, compared to group 1 & 3. Diet 2 may lead to the highest weight increase.

Part 2 Analyses of chapter 9 of MABS using the BPRS data (bprs_l)

6.6-6.8

(10.12.2019)

Overview plot

str(bprs_l)

## 'data.frame':    360 obs. of  5 variables:
##  $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ subject  : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ weeks    : Factor w/ 9 levels "week0","week1",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BPRS     : int  42 58 54 55 72 48 71 30 41 57 ...
##  $ week     : int  0 0 0 0 0 0 0 0 0 0 ...

pairs(bprs_l, col = bprs_l$subject)

Plotting the bprs_l

ggplot(bprs_l, aes(x = week, y = BPRS, shape = subject, group = treatment)) +
  geom_point(color = "darkblue") + 
  scale_shape_manual(values = rep(1:10, times = 2)) +
  scale_x_continuous(name = "Weeks", breaks = seq(0,8,1)) + 
  scale_y_continuous(name = "BPRS value") + 
  theme(legend.position = "bottom") + 
  ggtitle("Overview of BPRS data") +
  theme(legend.box.background = element_rect(),legend.box.margin = margin(2, 2, 2, 2))

ggplot(bprs_l, aes(x = week, y = BPRS, linetype = subject)) +
  geom_line(color = "darkblue") +
  scale_linetype_manual(values = rep(1:6, times = 4)) +
  facet_grid(. ~ treatment, labeller = label_both) +
  scale_x_continuous(name = "Weeks", breaks = seq(0,8,1)) + 
  scale_y_continuous(name = "BPRS values observed", breaks = seq(10,100,5)) + 
  theme(legend.position = "bottom") +
  ggtitle("Overview of BPRS data by treatment") +
  theme(legend.box.background = element_rect(),legend.box.margin = margin(2, 2, 2, 2))

ggplot(bprs_l, aes(x = week, y = BPRS)) +
  geom_line(aes(linetype = treatment), color = "darkblue") +
  scale_linetype_manual(values = rep(1:2, times = 1)) +
  facet_grid(.~ subject) +
  scale_x_continuous(name = "Weeks", breaks = seq(0,8,4)) + 
  scale_y_continuous(name = "BPRS values observed", breaks = seq(10,100,5)) + 
  theme(legend.position = "right") +
  ggtitle("Overview of BPRS data by subject") +
  theme(legend.box.background = element_rect(),legend.box.margin = margin(2, 2, 2, 2))

There were 40 subjects in 2 treatment groups (20 in each group). After the first BPRS determination (week 0), the treatment started and 8 more BPRS determinations followed.

Linear model 1 (BPRS ~ week + treatment)

In the first linear regression model, the BPRS value was dependent varibale and week + treatment were independent variables (explanatory variables).

bprs_l_lrm1 <- lm(BPRS ~ week + treatment, data = bprs_l)
summary(bprs_l_lrm1)

## 
## Call:
## lm(formula = BPRS ~ week + treatment, data = bprs_l)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.4539     1.3670  33.982   <2e-16 ***
## week         -2.2704     0.2524  -8.995   <2e-16 ***
## treatment2    0.5722     1.3034   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16

Weekly measurement had a significant effect on the BPRS values. With every new BRPS determination per week, the BRPS value decreases 2.2 units. Treatment 2 had a better effect on the BPRS values of the subjects but it was not significant.

To explore whether treatments helped patients and whether the subjects (patients) affected the model, the subjects were added as a random effect term to a random intercept model.

6.6 Random intercept model (BPRS ~ week + treatment + (1 | subject))

In the random intercept model, “treatment” and the measurement periods “week” were the explanatory variables. The subjects (the 40 men which undergoing the treatment and measurements) was the random-effects term. This model explored whether linear regression fitted for each subject to the other subjects.

bprs_l_rim <- lmer(BPRS ~ week + treatment + (1 | subject), data = bprs_l, REML = FALSE)
summary(bprs_l_rim)

## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: BPRS ~ week + treatment + (1 | subject)
##    Data: bprs_l
## 
##      AIC      BIC   logLik deviance df.resid 
##   2748.7   2768.1  -1369.4   2738.7      355 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0481 -0.6749 -0.1361  0.4813  3.4855 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept)  47.41    6.885  
##  Residual             104.21   10.208  
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     1.9090  24.334
## week         -2.2704     0.2084 -10.896
## treatment2    0.5722     1.0761   0.532
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.437       
## treatment2 -0.282  0.000

According to the summary, the standard error of the measurement periods ‘week’ in the random intercept model (0.2084) was smaller than the standard error in the linear regression model (0.2524). Similarly, the standard error of the treatment 2 in the random intercept model (1.0761) was smaller than that in the linear regression model (1.3034). They meant that the effect of subjects was taken into account in this model.

Generally, the mixed model takes into account the correlations between the observations (‘subjects’). So it can be more accurate than linear regression model. The p-value was calculated by inverse function of the t-values (inverting the t value to the p value).

6.7 Random Intercept and Random Slope Model (BPRS ~ week + treatment + (week | subject))

Fit a random intercept and random slope model to the BPRS values. It allowed the linear regression to fit for every patient (‘subject’) to differ in the intercept and in the slope. ‘Subject’ differences was taken into account of the BPRS values and the measurement period. In this model, “week” and “subject” were random-effect terms.

A total of 9 (0-8 weeks) different regression models were defined for every subject for every time period (every week).

bprs_l_rim1 <- lmer(BPRS ~ week + treatment + (week | subject), data = bprs_l, REML = FALSE)
summary(bprs_l_rim1)

## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: BPRS ~ week + treatment + (week | subject)
##    Data: bprs_l
## 
##      AIC      BIC   logLik deviance df.resid 
##   2745.4   2772.6  -1365.7   2731.4      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8919 -0.6194 -0.0691  0.5531  3.7977 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.8222  8.0512        
##           week         0.9609  0.9803   -0.51
##  Residual             97.4304  9.8707        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     2.1052  22.066
## week         -2.2704     0.2977  -7.626
## treatment2    0.5722     1.0405   0.550
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.582       
## treatment2 -0.247  0.000

The results of this model showed not much difference with the model above. ANOVA was conducted to find differences in these models.

anova(bprs_l_rim1, bprs_l_rim)

## Data: bprs_l
## Models:
## bprs_l_rim: BPRS ~ week + treatment + (1 | subject)
## bprs_l_rim1: BPRS ~ week + treatment + (week | subject)
##             Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)  
## bprs_l_rim   5 2748.7 2768.1 -1369.4   2738.7                           
## bprs_l_rim1  7 2745.4 2772.6 -1365.7   2731.4 7.2721      2    0.02636 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA showed that the random Intercept and random Slope model (taking week and subject into account) gave a better fit for the data (p-value of 0.02) with a significant outcome.

6.8 Random intercept and random slope model with interaction (BPRS ~ week * treatment + (week | subject))

In the following random intercept and random slope model, the interaction of treatment and time (weeks) were included to see whether the treatment time period interacted with the BRPS value outcome.

bprs_l_rim2 <- lmer(BPRS ~ week * treatment + (week | subject), data = bprs_l, REML = FALSE)
summary(bprs_l_rim2)

## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: BPRS ~ week * treatment + (week | subject)
##    Data: bprs_l
## 
##      AIC      BIC   logLik deviance df.resid 
##   2744.3   2775.4  -1364.1   2728.3      352 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0512 -0.6271 -0.0768  0.5288  3.9260 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.9964  8.0620        
##           week         0.9687  0.9842   -0.51
##  Residual             96.4707  9.8220        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##                 Estimate Std. Error t value
## (Intercept)      47.8856     2.2521  21.262
## week             -2.6283     0.3589  -7.323
## treatment2       -2.2911     1.9090  -1.200
## week:treatment2   0.7158     0.4010   1.785
## 
## Correlation of Fixed Effects:
##             (Intr) week   trtmn2
## week        -0.650              
## treatment2  -0.424  0.469       
## wek:trtmnt2  0.356 -0.559 -0.840

anova(bprs_l_rim2, bprs_l_rim1)

## Data: bprs_l
## Models:
## bprs_l_rim1: BPRS ~ week + treatment + (week | subject)
## bprs_l_rim2: BPRS ~ week * treatment + (week | subject)
##             Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)  
## bprs_l_rim1  7 2745.4 2772.6 -1365.7   2731.4                           
## bprs_l_rim2  8 2744.3 2775.4 -1364.1   2728.3 3.1712      1    0.07495 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA shows that the random intercept and random slope model with interaction ‘time * treatment’ fitted better on the data (p-value = 0.075 < 0.1). So there was an interaction of treatments with the treament time.

Check the BPRS data again on a graph

ggplot(bprs_l, aes(x = week, y = BPRS, linetype = subject)) +
  geom_line(color = "darkblue") +
  scale_linetype_manual(values = rep(1:6, times = 4)) +
  facet_grid(. ~ treatment, labeller = label_both) +
  scale_x_continuous(name = "Weeks", breaks = seq(0,8,1)) + 
  scale_y_continuous(name = "BPRS values observed", breaks = seq(10,100,5)) + 
  theme(legend.position = "bottom") + 
  ggtitle("BPRS: original values by treatment") +
  theme(legend.box.background = element_rect(),legend.box.margin = margin(2, 2, 2, 2))

Add fitted BPRS values created by the random intercept and random slope model with interaction ‘time * treatment’ to the dataset

bprs_l_fit <- fitted(bprs_l_rim2)
bprs_l$fitted_bprs <- round(bprs_l_fit, digits = 2)
bprs_l %>% knitr::kable() %>% kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "center") %>% scroll_box(height = "300px")

treatment	subject	weeks	BPRS	week	fitted_bprs
1	1	week0	42	0	49.24
1	2	week0	58	0	46.97
1	3	week0	54	0	47.66
1	4	week0	55	0	49.85
1	5	week0	72	0	66.39
1	6	week0	48	0	42.59
1	7	week0	71	0	54.58
1	8	week0	30	0	47.93
1	9	week0	41	0	42.08
1	10	week0	57	0	53.35
1	11	week0	30	0	59.97
1	12	week0	55	0	48.42
1	13	week0	36	0	50.36
1	14	week0	38	0	40.93
1	15	week0	66	0	55.07
1	16	week0	41	0	44.43
1	17	week0	45	0	43.93
1	18	week0	39	0	36.82
1	19	week0	24	0	38.14
1	20	week0	38	0	39.01
2	1	week0	52	0	46.95
2	2	week0	30	0	44.68
2	3	week0	65	0	45.36
2	4	week0	37	0	47.56
2	5	week0	59	0	64.10
2	6	week0	30	0	40.30
2	7	week0	69	0	52.29
2	8	week0	62	0	45.64
2	9	week0	38	0	39.79
2	10	week0	65	0	51.05
2	11	week0	78	0	57.68
2	12	week0	38	0	46.13
2	13	week0	63	0	48.07
2	14	week0	40	0	38.64
2	15	week0	40	0	52.78
2	16	week0	54	0	42.14
2	17	week0	33	0	41.63
2	18	week0	28	0	34.53
2	19	week0	52	0	35.85
2	20	week0	47	0	36.71
1	1	week1	36	1	47.97
1	2	week1	68	1	43.72
1	3	week1	55	1	44.47
1	4	week1	77	1	47.41
1	5	week1	75	1	62.11
1	6	week1	43	1	40.04
1	7	week1	61	1	51.11
1	8	week1	36	1	46.36
1	9	week1	43	1	39.11
1	10	week1	51	1	50.79
1	11	week1	34	1	58.01
1	12	week1	52	1	45.55
1	13	week1	32	1	47.26
1	14	week1	35	1	38.73
1	15	week1	68	1	51.69
1	16	week1	35	1	41.50
1	17	week1	38	1	42.30
1	18	week1	35	1	35.03
1	19	week1	28	1	35.66
1	20	week1	34	1	36.35
2	1	week1	73	1	46.39
2	2	week1	23	1	42.14
2	3	week1	31	1	42.89
2	4	week1	31	1	45.83
2	5	week1	67	1	60.53
2	6	week1	33	1	38.47
2	7	week1	52	1	49.53
2	8	week1	54	1	44.79
2	9	week1	40	1	37.53
2	10	week1	44	1	49.21
2	11	week1	95	1	56.44
2	12	week1	41	1	43.97
2	13	week1	65	1	45.68
2	14	week1	37	1	37.16
2	15	week1	36	1	50.11
2	16	week1	45	1	39.92
2	17	week1	41	1	40.72
2	18	week1	30	1	33.45
2	19	week1	43	1	34.09
2	20	week1	36	1	34.78
1	1	week2	36	2	46.69
1	2	week2	61	2	40.46
1	3	week2	41	2	41.28
1	4	week2	49	2	44.96
1	5	week2	72	2	57.82
1	6	week2	41	2	37.49
1	7	week2	47	2	47.64
1	8	week2	38	2	44.79
1	9	week2	39	2	36.14
1	10	week2	51	2	48.23
1	11	week2	34	2	56.06
1	12	week2	49	2	42.68
1	13	week2	36	2	44.15
1	14	week2	36	2	36.53
1	15	week2	65	2	48.30
1	16	week2	45	2	38.56
1	17	week2	46	2	40.67
1	18	week2	27	2	33.24
1	19	week2	31	2	33.19
1	20	week2	27	2	33.70
2	1	week2	42	2	45.83
2	2	week2	32	2	39.60
2	3	week2	33	2	40.42
2	4	week2	27	2	44.10
2	5	week2	58	2	56.96
2	6	week2	37	2	36.63
2	7	week2	41	2	46.78
2	8	week2	49	2	43.93
2	9	week2	38	2	35.28
2	10	week2	31	2	47.37
2	11	week2	75	2	55.20
2	12	week2	36	2	41.82
2	13	week2	60	2	43.29
2	14	week2	31	2	35.67
2	15	week2	55	2	47.44
2	16	week2	35	2	37.70
2	17	week2	30	2	39.81
2	18	week2	29	2	32.38
2	19	week2	26	2	32.33
2	20	week2	32	2	32.84
1	1	week3	43	3	45.42
1	2	week3	55	3	37.20
1	3	week3	38	3	38.08
1	4	week3	54	3	42.52
1	5	week3	65	3	53.54
1	6	week3	38	3	34.94
1	7	week3	30	3	44.17
1	8	week3	38	3	43.22
1	9	week3	35	3	33.17
1	10	week3	55	3	45.68
1	11	week3	41	3	54.11
1	12	week3	54	3	39.81
1	13	week3	31	3	41.04
1	14	week3	34	3	34.33
1	15	week3	49	3	44.92
1	16	week3	42	3	35.63
1	17	week3	38	3	39.04
1	18	week3	25	3	31.46
1	19	week3	28	3	30.71
1	20	week3	25	3	31.04
2	1	week3	41	3	45.27
2	2	week3	24	3	37.06
2	3	week3	28	3	37.94
2	4	week3	31	3	42.37
2	5	week3	61	3	53.39
2	6	week3	33	3	34.79
2	7	week3	33	3	44.02
2	8	week3	39	3	43.08
2	9	week3	27	3	33.02
2	10	week3	34	3	45.53
2	11	week3	76	3	53.96
2	12	week3	27	3	39.66
2	13	week3	53	3	40.90
2	14	week3	38	3	34.18
2	15	week3	55	3	44.78
2	16	week3	27	3	35.48
2	17	week3	32	3	38.89
2	18	week3	33	3	31.31
2	19	week3	27	3	30.57
2	20	week3	29	3	30.90
1	1	week4	41	4	44.14
1	2	week4	43	4	33.95
1	3	week4	43	4	34.89
1	4	week4	56	4	40.07
1	5	week4	50	4	49.25
1	6	week4	36	4	32.38
1	7	week4	27	4	40.69
1	8	week4	31	4	41.65
1	9	week4	28	4	30.19
1	10	week4	53	4	43.12
1	11	week4	36	4	52.15
1	12	week4	48	4	36.94
1	13	week4	25	4	37.94
1	14	week4	25	4	32.13
1	15	week4	36	4	41.54
1	16	week4	31	4	32.69
1	17	week4	40	4	37.41
1	18	week4	29	4	29.67
1	19	week4	29	4	28.23
1	20	week4	25	4	28.39
2	1	week4	39	4	44.72
2	2	week4	20	4	34.52
2	3	week4	22	4	35.47
2	4	week4	31	4	40.65
2	5	week4	49	4	49.83
2	6	week4	28	4	32.95
2	7	week4	34	4	41.27
2	8	week4	55	4	42.23
2	9	week4	31	4	30.77
2	10	week4	39	4	43.70
2	11	week4	66	4	52.72
2	12	week4	29	4	37.51
2	13	week4	52	4	38.51
2	14	week4	35	4	32.70
2	15	week4	42	4	42.11
2	16	week4	25	4	33.26
2	17	week4	46	4	37.98
2	18	week4	30	4	30.24
2	19	week4	24	4	28.81
2	20	week4	25	4	28.96
1	1	week5	40	5	42.87
1	2	week5	34	5	30.69
1	3	week5	28	5	31.70
1	4	week5	50	5	37.63
1	5	week5	39	5	44.97
1	6	week5	29	5	29.83
1	7	week5	40	5	37.22
1	8	week5	26	5	40.08
1	9	week5	22	5	27.22
1	10	week5	43	5	40.57
1	11	week5	36	5	50.20
1	12	week5	43	5	34.06
1	13	week5	25	5	34.83
1	14	week5	27	5	29.92
1	15	week5	32	5	38.16
1	16	week5	31	5	29.76
1	17	week5	33	5	35.78
1	18	week5	28	5	27.88
1	19	week5	21	5	25.76
1	20	week5	27	5	25.73
2	1	week5	38	5	44.16
2	2	week5	20	5	31.98
2	3	week5	25	5	32.99
2	4	week5	26	5	38.92
2	5	week5	38	5	46.26
2	6	week5	26	5	31.12
2	7	week5	37	5	38.51
2	8	week5	51	5	41.37
2	9	week5	24	5	28.51
2	10	week5	34	5	41.86
2	11	week5	64	5	51.49
2	12	week5	27	5	35.35
2	13	week5	32	5	36.12
2	14	week5	30	5	31.21
2	15	week5	30	5	39.45
2	16	week5	22	5	31.04
2	17	week5	43	5	37.07
2	18	week5	26	5	29.17
2	19	week5	32	5	27.05
2	20	week5	23	5	27.02
1	1	week6	38	6	41.60
1	2	week6	28	6	27.43
1	3	week6	29	6	28.51
1	4	week6	47	6	35.18
1	5	week6	32	6	40.69
1	6	week6	33	6	27.28
1	7	week6	30	6	33.75
1	8	week6	26	6	38.51
1	9	week6	20	6	24.25
1	10	week6	43	6	38.01
1	11	week6	38	6	48.24
1	12	week6	37	6	31.19
1	13	week6	21	6	31.73
1	14	week6	25	6	27.72
1	15	week6	27	6	34.78
1	16	week6	29	6	26.82
1	17	week6	27	6	34.15
1	18	week6	21	6	26.09
1	19	week6	22	6	23.28
1	20	week6	21	6	23.08
2	1	week6	43	6	43.60
2	2	week6	19	6	29.44
2	3	week6	24	6	30.52
2	4	week6	24	6	37.19
2	5	week6	37	6	42.69
2	6	week6	27	6	29.28
2	7	week6	37	6	35.76
2	8	week6	55	6	40.52
2	9	week6	22	6	26.25
2	10	week6	41	6	40.02
2	11	week6	64	6	50.25
2	12	week6	21	6	33.20
2	13	week6	37	6	33.73
2	14	week6	33	6	29.73
2	15	week6	26	6	36.78
2	16	week6	22	6	28.82
2	17	week6	43	6	36.15
2	18	week6	36	6	28.10
2	19	week6	21	6	25.29
2	20	week6	23	6	25.08
1	1	week7	47	7	40.32
1	2	week7	28	7	24.18
1	3	week7	25	7	25.32
1	4	week7	42	7	32.74
1	5	week7	38	7	36.40
1	6	week7	27	7	24.72
1	7	week7	31	7	30.28
1	8	week7	25	7	36.94
1	9	week7	23	7	21.28
1	10	week7	39	7	35.46
1	11	week7	36	7	46.29
1	12	week7	36	7	28.32
1	13	week7	19	7	28.62
1	14	week7	26	7	25.52
1	15	week7	30	7	31.40
1	16	week7	26	7	23.88
1	17	week7	31	7	32.52
1	18	week7	25	7	24.31
1	19	week7	23	7	20.81
1	20	week7	19	7	20.43
2	1	week7	62	7	43.04
2	2	week7	18	7	26.90
2	3	week7	31	7	28.04
2	4	week7	26	7	35.46
2	5	week7	36	7	39.12
2	6	week7	23	7	27.44
2	7	week7	38	7	33.00
2	8	week7	59	7	39.66
2	9	week7	21	7	24.00
2	10	week7	42	7	38.18
2	11	week7	60	7	49.01
2	12	week7	22	7	31.04
2	13	week7	52	7	31.34
2	14	week7	30	7	28.24
2	15	week7	30	7	34.12
2	16	week7	22	7	26.60
2	17	week7	43	7	35.24
2	18	week7	33	7	27.03
2	19	week7	21	7	23.53
2	20	week7	23	7	23.15
1	1	week8	51	8	39.05
1	2	week8	28	8	20.92
1	3	week8	24	8	22.13
1	4	week8	46	8	30.29
1	5	week8	32	8	32.12
1	6	week8	25	8	22.17
1	7	week8	31	8	26.81
1	8	week8	24	8	35.37
1	9	week8	21	8	18.31
1	10	week8	32	8	32.90
1	11	week8	36	8	44.33
1	12	week8	31	8	25.45
1	13	week8	22	8	25.52
1	14	week8	26	8	23.32
1	15	week8	37	8	28.02
1	16	week8	30	8	20.95
1	17	week8	27	8	30.89
1	18	week8	20	8	22.52
1	19	week8	22	8	18.33
1	20	week8	21	8	17.77
2	1	week8	50	8	42.48
2	2	week8	20	8	24.35
2	3	week8	32	8	25.57
2	4	week8	23	8	33.73
2	5	week8	35	8	35.55
2	6	week8	21	8	25.61
2	7	week8	35	8	30.25
2	8	week8	66	8	38.81
2	9	week8	21	8	21.74
2	10	week8	39	8	36.34
2	11	week8	75	8	47.77
2	12	week8	23	8	28.89
2	13	week8	28	8	28.95
2	14	week8	27	8	26.75
2	15	week8	37	8	31.45
2	16	week8	22	8	24.38
2	17	week8	43	8	34.33
2	18	week8	30	8	25.96
2	19	week8	21	8	21.77
2	20	week8	23	8	21.21

Draw the plot with the fitted BPRS values

ggplot(bprs_l, aes(x = week, y = fitted_bprs, linetype = subject)) +
  geom_line(color = "darkblue") +
  scale_linetype_manual(values = rep(1:6, times = 4)) +
  facet_grid(. ~ treatment, labeller = label_both) +
  scale_x_continuous(name = "Weeks", breaks = seq(0,8,1)) + 
  scale_y_continuous(name = "BPRS values modeled", breaks = seq(10,100,5)) + 
  theme(legend.position = "bottom") + 
  ggtitle("BPRS: fitted values by treatment") +
  theme(legend.box.background = element_rect(),legend.box.margin = margin(2, 2, 2, 2))

Chapter 7 Final note: See you later, R

(10.12.2019)

I will do those analyses for my research to get better understanding of results as well as better tables and figures.

Very tired, very necessary.

IODS course project

yufanyin

Chapter 1 Start up

Chapter 2 Regression and model validation

2.1 Reading the data

2.2 Visualizations with ggplot2

2.3 Simple regression with lm()

2.4 Multiple regression

2.5 Diagnostic plots

Chapter 3 Logistic regression

3.1 Reading the data

3.2 Hypotheses about the relationships between alcohol consumption and specific variables

3.3 Exploring the relationships numerically and graphically

3.3.1 Alcohol consumption and gender

3.3.2 Alcohol consumption and number of class failures

3.3.3 Alcohol consumption and frequency of going out with friends

3.3.4 Alcohol consumption and number of school absences

3.4 Exploring the relationship using logistic regression

3.4.1 The fitted model

3.4.2 The coefficients of the model

3.4.3 The odds ratios and confidence intervals

3.5 The predictive power of you model

3.6 The 10-fold cross-validation on the model

3.7 Cross-validation to compare the performance of different logistic regression models

Chapter 4 Clustering and classification

4.1 Reading the data

4.2 Exploring the data numerically and graphically

4.2.1 Graphical overview of the data

4.2.2 Summaries of the variables

4.2.3 Relationships between the variables

4.3 Standardizing the dataset

4.3.1 Standardizing and printing out summaries

4.3.2 Creating a categorical variable of the crime rate

4.3.3 Dividing the dataset to train and test sets

4.4 Linear discriminant analysis

4.5 Predicting the classes with the LDA model

4.5.1 Predicting classes with test data

4.5.2 Cross tabulate the results with the crime categories

4.6 Reloading the dataset and run k-means algorithm

4.6.1 Standardizing the Boston dataset

4.6.2 Calculating the distances

4.6.3 K-means clustering

4.6.4 Determining the k

4.7 Performing k-means on the original data

4.8 3D plot

Chapter 5 Dimensionality reduction techniques

5.1 Exploring the data numerically and graphically

5.1.1 Reading the data

5.1.2 Summaries of the variables

5.1.3 Graphical overview of the data

5.1.4 Relationships between the variables

5.2 Principal component analysis (PCA) on non-standardized data

5.2.1 Variability captured by the principal components

5.2.2 PCA Biplot displaying the observations by the first two principal components

5.3 Principal component analysis (PCA) on standardized data

5.3.1 Scaling the data

5.3.2 Variability captured by the principal components

5.3.3 PCA Biplot

5.3.4 Interpretations of the first two principal component dimensions

5.4 Multiple Correspondence Analysis (MCA)

5.4.1 Structure and dimensions of ‘tea’ data

5.4.2 Visualization of ‘tea’ data

5.4.3 Multiple Correspondence Analysis and its summary

5.4.4 MCA biplot of tea consumption behaviour

5.4.4 MCA biplot of tea consumption behaviour

5.4.5 MCA of reason for drinking tea

Chapter 6 Analysis of longitudinal data

6.1 Loading the wide and long data sets

Part 1 Analyses of chapter 8 of MABS using the RATS data (rats_l)

6.2 Standardizing the dataset and exploring the data graphically

6.2.1 Exploring the data graphically

6.2.2 Standardizing the dataset

6.2.3 Plotting the standardised data

6.3 Profiles

6.3.1 Adding categorical variable

6.3.2 Boxplot

6.3.3 Plotting the mean profiles

6.4 Finding the outlier

6.4.1 Creating summary variable and calculating the mean weights of every group

6.4.2 Boxplot of mean weights of different diet groups excluding the first measurement