library(caret)
library(gbm)
library(dplyr)
library(randomForest)
library(ggplot2)
set.seed(90210)
<- 200
Ntree
download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", "training.csv")
download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv", "testing.csv")
<- read.csv2("training.csv", sep = ",")[,-c(1:7)]
train <- read.csv2("testing.csv", sep = ",")[,-c(1:7)] test
Fitness Tracker Modeling: ML
The original paper was written on 12/18/2020
Executive Summary
This report analyzes collected data on different users preforming barbell lifts performed at different levels of quality. A machine learning algorithm was used to create a model to determine the user’s rating based on data collected from multiple accelerometers. More information on the project can be found here.
Analysis
Initialization
The following code was used to initialize the required R libraries, as well as downloading the required data and store it into memory. There are some columns of the data that were not required for modelling which were excluded (ex. usernames).
Reducing predictors
The data contains way too many predictors (153 in total) to produce accurate and simple models. Some trimming is required. The first trim is performed with the near zero variance function from the caret library, which finds the predictors that exhibit near zero variation. These predictors would add little benefit to include in models.
<- nearZeroVar(train)
nz <- train[,-nz]
train <- test[-nz] test
From this step, the number of predictors is reduced to 94. There remains numerous NA values in the data. These values are examined in the next chunk of code.
<- length(train) - 1
maxi <- 1:maxi
valna
for (i in 1:maxi) {
<- as.numeric(train[,i])
train[,i] <- as.numeric(test[,i])
test[,i] <- mean(is.na(train[,i]))
valna[i]
}
table(valna)
valna
0 0.979308938946081
52 41
The code shows that there are 52 predictors that have no missing data and 41 predictors that are mostly missing values. These predictors would add little value to the modelling and are removed with the following code
<- train[, valna == 0]
train <- test[, valna == 0] test
The training was then divided to create a validation set which will be used for cross validation. Note that the random forest algorithm has built in cross validation with the “out of bag error”. About 1/3 of the data is used in a random forest.
<- createDataPartition(train$classe, p = 0.3)[[1]]
Valid <- train[Valid,]
valid <- train[-Valid,] train
The next step is to utilize the variable of importance function in the caret library to reduce the number of predictors even further. The train data is still very large, but by making a sample set from the training data and modelling from that we can get a reasonable approximation of the variables of importance.
<- rbind(sample_n(train[train$classe == "A",],round(mean(train$classe == "A")*200,0)),
strain sample_n(train[train$classe == "B",],round(mean(train$classe == "B")*200,0)),
sample_n(train[train$classe == "C",],round(mean(train$classe == "C")*200,0)),
sample_n(train[train$classe == "D",],round(mean(train$classe == "D")*200,0)),
sample_n(train[train$classe == "E",],round(mean(train$classe == "E")*200,0))
)
The sample set was set to ensure an accurate representation of the ‘classe’ variable in the training data. Two models were completed and their variables of importance were added together.
<- train(classe~., data = strain, method = "rf", ntree = Ntree)
mdl1 <- train(classe~., data = strain, method = "gbm", verbose = FALSE)
mdl2 <- varImp(mdl1)$importance > 50 | varImp(mdl2)$importance > 50
var <- order(varImp(mdl1)$importance, decreasing = TRUE)
varorder <- row.names(varImp(mdl1)$importance)[varorder[1:2]] Varimp
A value of 50 was used for a cut-off value. The total number of predictors has been reduced to 4.
<- valid[,var]
valid <- train[,var] %>% slice_sample(prop = 0.75)
train <- test[,var] test
Modelling
With the reduced predictors, the models can now be trained. Since these model will look at the entire training data set, it will require a lot of time. The models include:
- Random forest
- Generalized Boosted
- Linear Discriminant
- Combined
The randomForest function is used as it is more efficient than the train function. The data method is also more efficient than using the formula method.
<- randomForest(x = train[,1:(ncol(train) - 1)], y = as.factor(train[,ncol(train)]), ntree = Ntree, proximity = TRUE)
mdl11 <- train(classe~., data = train, method = "gbm", verbose = FALSE)
mdl21 <- train(classe~., data = train, method = "lda") mdl31
The following code constructs the combined model
<- predict(mdl11, valid)
pmdl11 <- predict(mdl21, valid)
pmdl21 <- predict(mdl31, valid)
pmdl31 <- data.frame(pmdl11, pmdl21, pmdl31, classe = valid$classe)
join <- randomForest(x = join[,1:3], y = as.factor(join$classe), ntree = Ntree) jmdl
Model Evaluation
The new models will need to be examined against the validation data set. The out of bag error for the random forest models were not used, as the validation data provides a uniform comparison for all models. The following function was used to test the models:
<- function(mdl, test){
Exacc mean(predict(mdl,test) == test$classe)
}
The model’s accuracy are summarized in the following dataframe when they are used to predict the results in the validation set:
Model accuracy
1 mdl11 0.8935303
2 mdl21 0.7989472
3 mdl31 0.3625403
4 joint 0.8897945
Conclusion
From the results from the model testing, it is clear that the random forest and the combined are the most accurate models for the validation set. The combined model has approximately the same level of accuracy as the random forest, meaning it is the most heavily weighted model. It also means that the inclusion of the boosted and linear discriminant models do not contribute to its accuracy. For simplification, the random forest model is the best model.
Plot
The centres of the model can be found from the proximity data. The proximity data is compared to two predictors, the most important predictors. The two most significant predictors would sometimes vary, so the code was changed to be flexible to it.
<- names(train) %in% Varimp
index <- classCenter(train[index], train$classe, mdl11$proximity)
mdlp <- as.data.frame(mdlp)
mdlp $classe <- rownames(mdlp) mdlp
This centre data can be included with the training data. There are distinctly different regions based off of the different classe values, but the other predictors also contribute to model accuracy.
<- names(train) %in% Varimp
index <- names(train[index])
names
<- function(name1, name2){
f <- sym(name1)
xval <- sym(name2)
yval ggplot(data = train, aes_string(x = xval, y = yval, col = "classe")) +
geom_point() +
geom_point(aes_string(x = xval, y = yval, col = "classe"), size = 10, shape = 4, data = mdlp) +
labs(title = "Model centers on two variables of importance")
}f(names[1], names[2])