Human Activity Recognition Machine Learning

Author: Brad Cable

Illinois State University

For Practical Machine Learning - Johns Hopkins University (Coursera)

Data Description

The data being analyzed was previously researched by Wallace Ugulino, Eduardo Velloso, and Hugo Fuks in the paper and data available here:

http://groupware.les.inf.puc-rio.br/har

The main part of their research and purpose of machine learning in their context is the following:

This human activity recognition research has traditionally focused on discriminating between different activities, i.e. to predict “which” activity was performed at a specific point in time (like with the Daily Living Activities dataset above). The approach we propose for the Weight Lifting Exercises dataset is to investigate “how (well)” an activity was performed by the wearer. The “how (well)” investigation has only received little attention so far, even though it potentially provides useful information for a large variety of applications,such as sports training.

This paper attempts to create a classifier that performs similar analysis and classifies by how well an individual performs actions and not which actions the individual is performing.

Preparation

Loading Libraries

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
library(rpart)
library(ggplot2)

Loading Data

training <- read.csv("pml-training.csv")
testing <- read.csv("pml-testing.csv")

Setting Seed for Reproducibility

set.seed(43121)

Data Cleaning and Variable Analysis

Proportion of Missing Values

This function calculates out what percentage of the data has missing values or not.

propMissing <- function(df, name){
    vect <- df[[name]]
    sum((is.na(vect) | vect == "")*1)/length(vect)
}

Calculating out the percentage of missing values for each field:

fieldsMissing <- sapply(names(training), function(name){ propMissing(training, name)})

Plotting the proportion of missing values, you can see that all variables only have two levels of missing data.

plot(fieldsMissing, ylab="Proportion Missing")

plot of chunk unnamed-chunk-7

Looking at the levels:

lvls <- levels(factor(fieldsMissing))
lvls
## [1] "0"                 "0.979308938946081"

Either a particular variable has one hundred percent of the values present, or it has a missing value proportion of: 0.9793, which is quite high.

My hypothesis is that these fields with missing values are all the same records and the same values missing in each case, which I will test next.

Testing Hypothesis About Missing Values

Now to test the hypothesis that all the values that are missing are the same records.

First we need to subset the original training set and view only the fields that are theorized to be missing under the same circumstances:

whichFieldsMissing <- fieldsMissing
names(whichFieldsMissing) <- NULL
whichFieldsMissing <- whichFieldsMissing != 0
sketchyTrainingFields <- training[whichFieldsMissing]

Now we need to calculate per column which records are present.

simpleSketchyTrainingFields <- sapply(names(sketchyTrainingFields), function(name){
    is.na(sketchyTrainingFields[[name]]) |
    sketchyTrainingFields[[name]] == ""
})

And loop through to see if they are all equal or not:

allSame <- TRUE
for (i in 2:ncol(simpleSketchyTrainingFields)){
    if(sum((
        simpleSketchyTrainingFields[,1] !=
        simpleSketchyTrainingFields[,i]
    )*1) > 0){
        allSame <- FALSE
        break
    }
}
allSame
## [1] TRUE

The result is that they are all the same records with missing values.

Converting to Presence

When the model trains it will likely with such a low quantity of actual data be only training on the presence of these values, and not the values themselves, so it might be important to provide the presence of these values to help train but not to provide the values themselves as they could produce strange results.

To do this, we can take any of the fields that have this missing values pattern and convert those fields into a single field that is a boolean value on the presense of data in those fields. This provides the ability to learn on the presense without skewing the data to these values that are infrequent enough to train on.

First we need to remove those fields from the training and test sets.

cleanTraining <- training[!whichFieldsMissing]

Now we can add a new field that is the presence of these fields or not:

cleanTraining$presence <- !is.na(training$max_roll_belt)

Timestamps

Taking a look at the timestamps provided there are some strange aspects about them. The fields in question are “raw_timestamp_part_1”, “raw_timestamp_part_2”, and “cvtd_timestamp”.

cvtd_timestamp

Looking at this variable you can see that there is very little variability and that each of the rows can be grouped into a small amount of timeframes.

lvls <- levels(training$cvtd_timestamp)
lvls
##  [1] "02/12/2011 13:32" "02/12/2011 13:33" "02/12/2011 13:34"
##  [4] "02/12/2011 13:35" "02/12/2011 14:56" "02/12/2011 14:57"
##  [7] "02/12/2011 14:58" "02/12/2011 14:59" "05/12/2011 11:23"
## [10] "05/12/2011 11:24" "05/12/2011 11:25" "05/12/2011 14:22"
## [13] "05/12/2011 14:23" "05/12/2011 14:24" "28/11/2011 14:13"
## [16] "28/11/2011 14:14" "28/11/2011 14:15" "30/11/2011 17:10"
## [19] "30/11/2011 17:11" "30/11/2011 17:12"

With only 20 levels detailing down to the minute, you would expect a lot more different types of results in this area.

raw_timestamp_part_1

lvls <- length(levels(factor(training$raw_timestamp_part_1)))
lvls
## [1] 837

This field has a bit more variability to it with 837 levels, but the format appears to have a strange property to it. Namely, the first half of the values appear to be in some way related to the cvtd_timestamp value and the second half appears to be a count. It would be difficult to display and analyze all 1 values here and reverse engineering this field is not desired when there are even stranger aspects of the rest of these fields.

raw_timestamp_part_2

lvls <- length(levels(factor(training$raw_timestamp_part_2)))
lvls
## [1] 16783

With 16783 levels, this can be seen as having too high of variability to predict anything of significance. On top of which, if the value is microseconds, this value can do little to predict how well the individual is doing at the action since the information would essentially be random noise.

What to do?

Because of the time information being essentially either too low of precision data or too high of precision data, all time data is going to be stripped out of the model.

cleanTraining$cvtd_timestamp <- NULL
cleanTraining$raw_timestamp_part_1 <- NULL
cleanTraining$raw_timestamp_part_2 <- NULL

X

X appears to be a simple row count, which does us no good since this isn't exactly timeseries data. The timestamps exist but the order of them are not particularly relevant and the values of them are not particularly relevant, so we'll strip this value out.

head(cleanTraining$X, 100)
##   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
##  [18]  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34
##  [35]  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51
##  [52]  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68
##  [69]  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85
##  [86]  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100
cleanTraining$X <- NULL

Variable Importance of Exploratory Random Forest

By calculating the variable importance of a small subset of the data using a random forest, we can gather more information about what variables are useful and what variables are not.

To start, we subset the training data into 1% of the original for a quick analysis of the data.

quickPartition <- createDataPartition(cleanTraining$classe, p=0.01, list=FALSE)

Now we generate a random forest model on the smaller dataset:

subsetTest <- cleanTraining[quickPartition,]
subsetModelRF <- train(classe ~ ., data=subsetTest, method="rf")

Calculating variable importance, cleaning it up, and graphing the results:

importance_graph <- function(model){
    subsetImportance <- varImp(model)
    importanceGraphData <- subsetImportance$importance
    importanceGraphData$Name <- row.names(importanceGraphData)
    importanceGraphData <- importanceGraphData[order(-importanceGraphData[[1]]),]
    importanceGraphData$Name <- factor(importanceGraphData$Name, levels=importanceGraphData$Name)
    importanceGraphData <- importanceGraphData[importanceGraphData$Overall > 0,]

    g <- ggplot(data.frame(x=importanceGraphData$Overall, y=importanceGraphData$Name), aes(y=y, x=x))
    g + geom_point() + xlab("Overall Importance Factor") + ylab("Name")
}
importance_graph(subsetModelRF)

plot of chunk unnamed-chunk-23

From this analysis we can see that the user, specificially who is exercising, has very little predictive value. Also to note is that the new_window variable has very little predictive value as is our calculation of presence in regards to the missing data. This should have been expected based on the proportion of missing values to actual information, so the hypothesis that the presence could have affected a prediction model is false.

These variables with an importance factor that have a significant drop and are below the value 1 can also be stripped out of the training data.

cleanTraining$new_window <- NULL
cleanTraining$user_name <- NULL
cleanTraining$presence <- NULL

CART Exploratory Model

While a Random Forest worked, it would be interesting to see how a CART model worked on this data, so using the same subset dataset we can take a quick look at the data before using up a large quantity of the test data for the final models.

Keep in mind since we are re-using the subsetTest variable, this is modeling with the user_name, new_window, and presence variables in mind, still, but this should not affect the overall exploratory model in the long run.

subsetModelCART <- train(classe ~ ., data=subsetTest, method="rpart")

The CART model itself looks like the following, with noticeably less variables and not very many leaves put into the model. For CART based classification trees, this is actually a fairly good sign as too many variables could produce large amounts of overfitting:

subsetModelCART$finalModel
## n= 199 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 199 143 A (0.28 0.19 0.18 0.17 0.19)  
##    2) roll_belt< 130.5 181 125 A (0.31 0.21 0.19 0.18 0.1)  
##      4) magnet_arm_x< 71 73  31 A (0.58 0.14 0.18 0.068 0.041) *
##      5) magnet_arm_x>=71 108  80 B (0.13 0.26 0.2 0.26 0.15)  
##       10) pitch_forearm< 24.9 74  51 B (0.15 0.31 0.3 0.11 0.14) *
##       11) pitch_forearm>=24.9 34  14 D (0.088 0.15 0 0.59 0.18) *
##    3) roll_belt>=130.5 18   0 E (0 0 0 0 1) *

An importance graph, as used above, of the same data:

importance_graph(subsetModelCART)

plot of chunk unnamed-chunk-27

Creating Test Models

Structure of Final Clean Testing Data

Taking a look at the final clean testing data to verify accuracy and to just in general realize what we are using:

str(cleanTraining)
## 'data.frame':    19622 obs. of  54 variables:
##  $ num_window          : int  11 11 11 12 12 12 12 12 12 12 ...
##  $ roll_belt           : num  1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
##  $ pitch_belt          : num  8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
##  $ yaw_belt            : num  -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
##  $ total_accel_belt    : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ gyros_belt_x        : num  0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
##  $ gyros_belt_y        : num  0 0 0 0 0.02 0 0 0 0 0 ...
##  $ gyros_belt_z        : num  -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
##  $ accel_belt_x        : int  -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
##  $ accel_belt_y        : int  4 4 5 3 2 4 3 4 2 4 ...
##  $ accel_belt_z        : int  22 22 23 21 24 21 21 21 24 22 ...
##  $ magnet_belt_x       : int  -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
##  $ magnet_belt_y       : int  599 608 600 604 600 603 599 603 602 609 ...
##  $ magnet_belt_z       : int  -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
##  $ roll_arm            : num  -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
##  $ pitch_arm           : num  22.5 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 ...
##  $ yaw_arm             : num  -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
##  $ total_accel_arm     : int  34 34 34 34 34 34 34 34 34 34 ...
##  $ gyros_arm_x         : num  0 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 ...
##  $ gyros_arm_y         : num  0 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 ...
##  $ gyros_arm_z         : num  -0.02 -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 ...
##  $ accel_arm_x         : int  -288 -290 -289 -289 -289 -289 -289 -289 -288 -288 ...
##  $ accel_arm_y         : int  109 110 110 111 111 111 111 111 109 110 ...
##  $ accel_arm_z         : int  -123 -125 -126 -123 -123 -122 -125 -124 -122 -124 ...
##  $ magnet_arm_x        : int  -368 -369 -368 -372 -374 -369 -373 -372 -369 -376 ...
##  $ magnet_arm_y        : int  337 337 344 344 337 342 336 338 341 334 ...
##  $ magnet_arm_z        : int  516 513 513 512 506 513 509 510 518 516 ...
##  $ roll_dumbbell       : num  13.1 13.1 12.9 13.4 13.4 ...
##  $ pitch_dumbbell      : num  -70.5 -70.6 -70.3 -70.4 -70.4 ...
##  $ yaw_dumbbell        : num  -84.9 -84.7 -85.1 -84.9 -84.9 ...
##  $ total_accel_dumbbell: int  37 37 37 37 37 37 37 37 37 37 ...
##  $ gyros_dumbbell_x    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ gyros_dumbbell_y    : num  -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 ...
##  $ gyros_dumbbell_z    : num  0 0 0 -0.02 0 0 0 0 0 0 ...
##  $ accel_dumbbell_x    : int  -234 -233 -232 -232 -233 -234 -232 -234 -232 -235 ...
##  $ accel_dumbbell_y    : int  47 47 46 48 48 48 47 46 47 48 ...
##  $ accel_dumbbell_z    : int  -271 -269 -270 -269 -270 -269 -270 -272 -269 -270 ...
##  $ magnet_dumbbell_x   : int  -559 -555 -561 -552 -554 -558 -551 -555 -549 -558 ...
##  $ magnet_dumbbell_y   : int  293 296 298 303 292 294 295 300 292 291 ...
##  $ magnet_dumbbell_z   : num  -65 -64 -63 -60 -68 -66 -70 -74 -65 -69 ...
##  $ roll_forearm        : num  28.4 28.3 28.3 28.1 28 27.9 27.9 27.8 27.7 27.7 ...
##  $ pitch_forearm       : num  -63.9 -63.9 -63.9 -63.9 -63.9 -63.9 -63.9 -63.8 -63.8 -63.8 ...
##  $ yaw_forearm         : num  -153 -153 -152 -152 -152 -152 -152 -152 -152 -152 ...
##  $ total_accel_forearm : int  36 36 36 36 36 36 36 36 36 36 ...
##  $ gyros_forearm_x     : num  0.03 0.02 0.03 0.02 0.02 0.02 0.02 0.02 0.03 0.02 ...
##  $ gyros_forearm_y     : num  0 0 -0.02 -0.02 0 -0.02 0 -0.02 0 0 ...
##  $ gyros_forearm_z     : num  -0.02 -0.02 0 0 -0.02 -0.03 -0.02 0 -0.02 -0.02 ...
##  $ accel_forearm_x     : int  192 192 196 189 189 193 195 193 193 190 ...
##  $ accel_forearm_y     : int  203 203 204 206 206 203 205 205 204 205 ...
##  $ accel_forearm_z     : int  -215 -216 -213 -214 -214 -215 -215 -213 -214 -215 ...
##  $ magnet_forearm_x    : int  -17 -18 -18 -16 -17 -9 -18 -9 -16 -22 ...
##  $ magnet_forearm_y    : num  654 661 658 658 655 660 659 660 653 656 ...
##  $ magnet_forearm_z    : num  476 473 469 469 473 478 470 474 476 473 ...
##  $ classe              : Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...

K-Folds

We will be using K-Folds (6 folds) to generate testing data and validation data from the original testing dataset to generate accuracy predictions before touching the final testing data provided.

finalModelFolds <- createFolds(cleanTraining$classe, k=6, list=FALSE)

Random Forest

Data

rfTrainData <- cleanTraining[finalModelFolds==1,]
rfPredictData <- cleanTraining[finalModelFolds==2,]

Building Model

rfModel <- train(classe ~ ., data=rfTrainData, method="rf")

Generating Predictions

rfPredictions <- predict(rfModel, newdata=rfPredictData)

Accuracy Checks

rfConfusion <- confusionMatrix(rfPredictions, reference=rfPredictData$classe)
rfConfusion
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   A   B   C   D   E
##          A 929   9   0   1   1
##          B   0 609  17   0   4
##          C   1  14 551  12   5
##          D   0   0   2 522   6
##          E   0   0   0   1 585
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9777         
##                  95% CI : (0.972, 0.9825)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9717         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9989   0.9636   0.9667   0.9739   0.9734
## Specificity            0.9953   0.9920   0.9881   0.9971   0.9996
## Pos Pred Value         0.9883   0.9667   0.9451   0.9849   0.9983
## Neg Pred Value         0.9996   0.9913   0.9929   0.9949   0.9940
## Prevalence             0.2845   0.1933   0.1744   0.1640   0.1838
## Detection Rate         0.2842   0.1863   0.1686   0.1597   0.1790
## Detection Prevalence   0.2875   0.1927   0.1783   0.1621   0.1793
## Balanced Accuracy      0.9971   0.9778   0.9774   0.9855   0.9865

CART

Data

cartTrainData <- cleanTraining[finalModelFolds==3,]
cartPredictData <- cleanTraining[finalModelFolds==4,]

Building Model

cartModel <- train(classe ~ ., data=cartTrainData, method="rpart")

Generating Predictions

cartPredictions <- predict(cartModel, newdata=cartPredictData)

Accuracy Checks

cartConfusion <- confusionMatrix(cartPredictions, reference=cartPredictData$classe)
cartConfusion
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   A   B   C   D   E
##          A 858 340 275 307  74
##          B   0 129  10  10   9
##          C  66 164 285 192 166
##          D   0   0   0   0   0
##          E   6   0   0  27 352
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4966          
##                  95% CI : (0.4794, 0.5139)
##     No Information Rate : 0.2844          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3387          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9226  0.20379  0.50000   0.0000   0.5857
## Specificity            0.5744  0.98900  0.78222   1.0000   0.9876
## Pos Pred Value         0.4628  0.81646  0.32646      NaN   0.9143
## Neg Pred Value         0.9492  0.83805  0.88110   0.8361   0.9137
## Prevalence             0.2844  0.19358  0.17431   0.1639   0.1838
## Detection Rate         0.2624  0.03945  0.08716   0.0000   0.1076
## Detection Prevalence   0.5670  0.04832  0.26697   0.0000   0.1177
## Balanced Accuracy      0.7485  0.59640  0.64111   0.5000   0.7867

Final Predictions and Analysis

Decided to use the random forest, an accuracy rate of 0.9777 is much better than the CART model which has an accuracy rate of 0.4966. Now time to predict on the testing data which hasn't been touched yet.

Final Random Forest

Building Model

finalRFModel <- train(classe ~ ., data=cleanTraining, method="rf")

Generating Predictions

finalPredictions <- predict(finalRFModel, newdata=testing)
finalPredictions
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Coursera Submission

Predictions submitted to Coursera with 20/20 resulting in correct values.

Hardware and OS Information

                                          
Memory: 4GB RAM
Processor: Intel® Core™ i5-3317U CPU @ 1.70GHz
Processor Core Count: 4 Logical Threads
OS: Linux 4.0.8-200.fc21.x86_64
R version: R version 3.2.2 (2015-08-14)
R Packages Versions: caret 6.0.58
randomForest 4.6.12
rpart 4.1.10
ggplot2 1.0.1
knitr 1.11

Report Generation

Report was generated using R Markdown and knitr in 137.360304 minutes at Sun 22 Nov 2015 05:56:48 PM CST

Citation/Reference

The data used is available here: http://groupware.les.inf.puc-rio.br/har

Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human '13) . Stuttgart, Germany: ACM SIGCHI, 2013.