Classification: Logistic Regression and Random Forest
In this tutorial-cum-note, I will demonstrate how to use Logistic Regression and Random Forest algorithms to predict sex of a penguin. The data penguins
comes from palmerpenguins
package in R. It was collected by Dr. Kristen Gorman on three species of penguins at the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.
The goal is to build a classifier model that predicts sex of the penguins given its physical characteristics.
The dataset can be installed from CRAN.
# If you don't have palmerpenguins package, first install it.
# install.packages("palmerpenguins")
# Loading Libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(palmerpenguins)
# setting my personal theme
theme_set(theme_h())
# Loading data
data("penguins")
penguins
## # A tibble: 344 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## 7 Adelie Torgersen 38.9 17.8 181 3625
## 8 Adelie Torgersen 39.2 19.6 195 4675
## 9 Adelie Torgersen 34.1 18.1 193 3475
## 10 Adelie Torgersen 42 20.2 190 4250
## # ℹ 334 more rows
## # ℹ 2 more variables: sex <fct>, year <int>
We see that there are many missing instances. Let’s see how many of them are missing.
sum(is.na(penguins))
## [1] 19
So, nineteen entries are missing. Most likely, I will exclude them from the analysis at present but before that, I want to explore the data as it is.
Exploration
One of the best methods to do it is via count()
from dplyr.
penguins %>%
count(species)
## # A tibble: 3 × 2
## species n
## <fct> <int>
## 1 Adelie 152
## 2 Chinstrap 68
## 3 Gentoo 124
penguins %>%
count(island)
## # A tibble: 3 × 2
## island n
## <fct> <int>
## 1 Biscoe 168
## 2 Dream 124
## 3 Torgersen 52
penguins %>%
count(species, island)
## # A tibble: 5 × 3
## species island n
## <fct> <fct> <int>
## 1 Adelie Biscoe 44
## 2 Adelie Dream 56
## 3 Adelie Torgersen 52
## 4 Chinstrap Dream 68
## 5 Gentoo Biscoe 124
penguins %>%
count(sex)
## # A tibble: 3 × 2
## sex n
## <fct> <int>
## 1 female 165
## 2 male 168
## 3 <NA> 11
penguins %>%
count(year)
## # A tibble: 3 × 2
## year n
## <int> <int>
## 1 2007 110
## 2 2008 114
## 3 2009 120
Cool. It looks pretty balanced.
penguins %>%
filter(!is.na(sex)) %>%
ggplot(aes(flipper_length_mm, bill_length_mm, colour = sex, size = body_mass_g)) +
geom_point(alpha = 0.6) +
facet_wrap(~species)
In general, there is significant difference between bill_length_mm
for both the sexes. Also, bill length for Adelie is shorter than other two species — for moth the sexes.
There are also packages like DataExplorer that can aid in the process. For this simplistic case, I want to demonstrate how to create classification model, so I will ignore the process.
The dataset has missing values as I noted earlier. So, I will remove the observations with missing values. I also do not need year and island for my classification model — there is no logical reason why should they be affecting sex of a penguin.
penguins_df = penguins %>%
filter(!is.na(sex)) %>%
select(-year, -island)
Modelling
Let’s start by loading tidymodels
and setting the seed for randomisation.
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
## ✔ broom 1.0.5 ✔ rsample 1.2.0
## ✔ dials 1.2.0 ✔ tune 1.1.2
## ✔ infer 1.0.4 ✔ workflows 1.1.3
## ✔ modeldata 1.2.0 ✔ workflowsets 1.0.1
## ✔ parsnip 1.1.1 ✔ yardstick 1.2.0
## ✔ recipes 1.0.8
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
## • Learn how to get started at https://www.tidymodels.org/start/
set.seed(1)
The first step of modelling is to create the training and testing split. Validation set will never be exposed to us during the modelling process. Once our final model is made, we can test it against the validation set.
In Tidy Models, this split is created using initial_split
function. I also have the option to stratify the split — which is necessary because we have multiple species and they are unbalanced. We only have 68 Chinstrap penguins but have 152 Adelie and 124 Gentoo. Let’s explore the help file for initial_split
too.
?initial_split
# Specify the split
penguins_split = initial_split(penguins_df, strata = sex)
The proportion is set by default to 75% in training and 25% in testing. The functions training()
and testing()
will give me the resulting datasets.
penguins_train = training(penguins_split)
penguins_test = testing(penguins_split)
penguins_train
## # A tibble: 249 × 6
## species bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## <fct> <dbl> <dbl> <int> <int> <fct>
## 1 Adelie 39.5 17.4 186 3800 female
## 2 Adelie 40.3 18 195 3250 female
## 3 Adelie 36.7 19.3 193 3450 female
## 4 Adelie 36.6 17.8 185 3700 female
## 5 Adelie 38.7 19 195 3450 female
## 6 Adelie 35.9 19.2 189 3800 female
## 7 Adelie 37.9 18.6 172 3150 female
## 8 Adelie 39.5 16.7 178 3250 female
## 9 Adelie 39.5 17.8 188 3300 female
## 10 Adelie 42.2 18.5 180 3550 female
## # ℹ 239 more rows
penguins_test
## # A tibble: 84 × 6
## species bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## <fct> <dbl> <dbl> <int> <int> <fct>
## 1 Adelie 38.9 17.8 181 3625 female
## 2 Adelie 39.2 19.6 195 4675 male
## 3 Adelie 41.1 17.6 182 3200 female
## 4 Adelie 34.4 18.4 184 3325 female
## 5 Adelie 46 21.5 194 4200 male
## 6 Adelie 37.8 18.3 174 3400 female
## 7 Adelie 37.7 18.7 180 3600 male
## 8 Adelie 35.3 18.9 187 3800 female
## 9 Adelie 40.6 18.6 183 3550 male
## 10 Adelie 40.5 17.9 187 3200 female
## # ℹ 74 more rows
The datasets looks good; our split worked well.
Bootstrapping Samples
Note that our training sample has only 249 rows. This is not a huge dataset and we are not sure if the model we will make from this will be generalisable.
One simple way to solve this is using Bootstrapped samples. Each bootstrap sample is of original sample size but is also different from each other. They are collected sample of observations with replacement. Again, see the help file to understand the function.
?bootstraps
penguins_boot = bootstraps(penguins_train)
penguins_boot
## # Bootstrap sampling
## # A tibble: 25 × 2
## splits id
## <list> <chr>
## 1 <split [249/85]> Bootstrap01
## 2 <split [249/93]> Bootstrap02
## 3 <split [249/85]> Bootstrap03
## 4 <split [249/93]> Bootstrap04
## 5 <split [249/84]> Bootstrap05
## 6 <split [249/87]> Bootstrap06
## 7 <split [249/88]> Bootstrap07
## 8 <split [249/92]> Bootstrap08
## 9 <split [249/93]> Bootstrap09
## 10 <split [249/86]> Bootstrap10
## # ℹ 15 more rows
So, we have 25 bootstrapped samples each with different resamples. Of course, if you have enough data or are confident enough about your data, you can skip this step.
Logistic Regression Pipeline
Logistic regression model is one of the simplest classification model. It is also the basic building block of neural networks; it dictates how a node behaves. Until 2010 when neural networks and support vector machines gained popularity, logistic regression was the model in force.
Even today, the model is widely used in variety of real world applications. The biggest benefit of logistic regression models is its ability to explain and linear implementation.
The first step will be to set up model pipeline. This only sets up how the model would work and neither training or test has happened yet.
# Simple Logistic Regression
glm_spec = logistic_reg() %>%
set_engine("glm")
glm_spec
## Logistic Regression Model Specification (classification)
##
## Computational engine: glm
There are other alternatives too. We can use Lasso regression (yes, Lasso can be used for classification as well. It “estimate[s] the parameters of the binomial GLM by optimising the binomial likelihood whilst imposing the lasso penalty on the parameter estimates”.) Or we can just use a regularised classification model.
# regularised regression
glm_spec = logistic_reg() %>%
set_engine("glmnet")
# LASSO regression
glm_spec = logistic_reg(mixture = 1) %>%
set_engine("glmnet")
But for this simple tutorial, I will stick to simple logistic regression model.
# Simple Logistic Regression
glm_spec = logistic_reg() %>%
set_engine("glm")
Random Forest Pipeline
Let’s set up a pipeline for random forest model as well. The good part about random forest model is that they do not require huge tuning efforts like neural networks.
Random forest models can be used for classification as well as regression. Furthermore, there are many implementations (packages) in R to choose from. randomForest
is probably the most known one. ranger
is a fast implementation of random forest models in R. I will use ranger
for this model.
# Engine could be spark
rand_forest() %>%
set_mode("classification") %>%
set_engine("spark")
## Random Forest Model Specification (classification)
##
## Computational engine: spark
# Or it could be randomForest
rand_forest() %>%
set_mode("classification") %>%
set_engine("randomForest")
## Random Forest Model Specification (classification)
##
## Computational engine: randomForest
# Or ranger
rf_spec = rand_forest() %>%
set_mode("classification") %>%
set_engine("ranger")
rf_spec
## Random Forest Model Specification (classification)
##
## Computational engine: ranger
Workflow
The next step in modelling pipeline is setting up the model with formula, model and data — in that order. Because I have multiple models that I want to compare, I will only set up formula in my workflow.
penguin_wf = workflow() %>%
add_formula(sex ~ .)
penguin_wf
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Formula
## Model: None
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## sex ~ .
As it is seen, there is no model set yet.
Training Logistic Regression
Let’s add logistic regression model. I can fit it directly to the training sample.
penguin_wf %>%
add_model(glm_spec) %>%
fit(data = penguins_train)
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Formula
## Model: logistic_reg()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## sex ~ .
##
## ── Model ───────────────────────────────────────────────────────────────────────
##
## Call: stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
##
## Coefficients:
## (Intercept) speciesChinstrap speciesGentoo bill_length_mm
## -95.852333 -6.932255 -8.535185 0.633832
## bill_depth_mm flipper_length_mm body_mass_g
## 2.014378 0.056401 0.006365
##
## Degrees of Freedom: 248 Total (i.e. Null); 242 Residual
## Null Deviance: 345.2
## Residual Deviance: 85.49 AIC: 99.49
I get the coefficients and other detains for my model which is great.
However, as I said before I can’t be absolutely sure of my model right away because of small sample. So, I will use bootstrapped samples that I created earlier. verbose = T
will show me all steps.
glm_rs = penguin_wf %>%
add_model(glm_spec) %>%
fit_resamples(resamples = penguins_boot,
control = control_resamples(save_pred = TRUE, verbose = F))
## → A | warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## There were issues with some computations A: x1
## There were issues with some computations A: x1
##
glm_rs
## # Resampling results
## # Bootstrap sampling
## # A tibble: 25 × 5
## splits id .metrics .notes .predictions
## <list> <chr> <list> <list> <list>
## 1 <split [249/85]> Bootstrap01 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
## 2 <split [249/93]> Bootstrap02 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
## 3 <split [249/85]> Bootstrap03 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
## 4 <split [249/93]> Bootstrap04 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
## 5 <split [249/84]> Bootstrap05 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
## 6 <split [249/87]> Bootstrap06 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
## 7 <split [249/88]> Bootstrap07 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
## 8 <split [249/92]> Bootstrap08 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
## 9 <split [249/93]> Bootstrap09 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
## 10 <split [249/86]> Bootstrap10 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
## # ℹ 15 more rows
##
## There were issues with some computations:
##
## - Warning(s) x1: glm.fit: fitted probabilities numerically 0 or 1 occurred
##
## Run `show_notes(.Last.tune.result)` for more information.
One bootstrapped sample had sampling issues (they training labels were unbalanced). To solve this, I could have specified strata = sex
in bootstraps()
. In this case it is acceptable because 24 of them worked well.
Training Random Forest
The process is almost the same as that for logistic regression.
rf_rs = penguin_wf %>%
add_model(rf_spec) %>%
fit_resamples(resamples = penguins_boot,
control = control_resamples(save_pred = TRUE, verbose = F))
rf_rs
## # Resampling results
## # Bootstrap sampling
## # A tibble: 25 × 5
## splits id .metrics .notes .predictions
## <list> <chr> <list> <list> <list>
## 1 <split [249/85]> Bootstrap01 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
## 2 <split [249/93]> Bootstrap02 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
## 3 <split [249/85]> Bootstrap03 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
## 4 <split [249/93]> Bootstrap04 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
## 5 <split [249/84]> Bootstrap05 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
## 6 <split [249/87]> Bootstrap06 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
## 7 <split [249/88]> Bootstrap07 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
## 8 <split [249/92]> Bootstrap08 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
## 9 <split [249/93]> Bootstrap09 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
## 10 <split [249/86]> Bootstrap10 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
## # ℹ 15 more rows
Notice that I did not get the same warning in random forest model. Why? Because random forest is not probabilistic in nature. Tree based models do not necessitate presence of balanced samples. They will simply give biased results and it is up to the researcher to investigate the flaws. That’s why they are little tricky.
Evaluation
How well do they compare against each other? The metrics to compare can be obtained using collect_metrics()
.
Logistic Regression Metrics
collect_metrics(glm_rs)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.905 25 0.00695 Preprocessor1_Model1
## 2 roc_auc binary 0.971 25 0.00291 Preprocessor1_Model1
Random Forest Metrics
collect_metrics(rf_rs)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.901 25 0.00629 Preprocessor1_Model1
## 2 roc_auc binary 0.970 25 0.00255 Preprocessor1_Model1
Logistic regression performs slightly better in both metrics: accuracy and AUC. Even if they were nearly equal and I had to choose, I would choose linear model. It is faster to implement, scalable and most importantly explainable.
Predictions
The predictions can be found using collect_predictions()
function.
glm_rs %>%
collect_predictions()
## # A tibble: 2,250 × 7
## id .pred_female .pred_male .row .pred_class sex .config
## <chr> <dbl> <dbl> <int> <fct> <fct> <chr>
## 1 Bootstrap01 0.580 0.420 6 female female Preprocessor1_M…
## 2 Bootstrap01 0.987 0.0125 7 female female Preprocessor1_M…
## 3 Bootstrap01 0.978 0.0219 9 female female Preprocessor1_M…
## 4 Bootstrap01 0.0277 0.972 10 male female Preprocessor1_M…
## 5 Bootstrap01 0.842 0.158 11 female female Preprocessor1_M…
## 6 Bootstrap01 1.00 0.000350 12 female female Preprocessor1_M…
## 7 Bootstrap01 0.999 0.000525 13 female female Preprocessor1_M…
## 8 Bootstrap01 1.00 0.00000806 14 female female Preprocessor1_M…
## 9 Bootstrap01 0.918 0.0824 16 female female Preprocessor1_M…
## 10 Bootstrap01 0.966 0.0341 19 female female Preprocessor1_M…
## # ℹ 2,240 more rows
The two important columns in this are .pred_female
and .pred_male
. This is the probability that they belong to a particular class. .pred_class
gives the class that our model predicts a penguin to be in; sex
is their true class.
Confusion Matrix
conf_mat_resampled()
with no arguments gives confusion matrix in tidy format.
glm_rs %>%
conf_mat_resampled()
## # A tibble: 4 × 3
## Prediction Truth Freq
## <fct> <fct> <dbl>
## 1 female female 39
## 2 female male 4.04
## 3 male female 4.44
## 4 male male 42.5
Let’s see it in conventional format.
glm_rs %>%
collect_predictions() %>%
conf_mat(sex, .pred_class)
## Truth
## Prediction female male
## female 975 101
## male 111 1063
The model looks pretty good.
ROC Curve
The ROC curve can be produced by roc_curve()
function. autoplot()
uses default settings.
glm_rs %>%
collect_predictions() %>%
group_by(id) %>% # -- to get 25 ROC curves, for each bootstrapped sample
roc_curve(sex, .pred_female) %>%
autoplot()
How does autoplot()
work?
glm_rs %>%
collect_predictions() %>%
group_by(id) %>% # -- to get 25 ROC curves, for each bootstrapped sample
roc_curve(sex, .pred_female)
## # A tibble: 2,296 × 4
## # Groups: id [25]
## id .threshold specificity sensitivity
## <chr> <dbl> <dbl> <dbl>
## 1 Bootstrap01 -Inf 0 1
## 2 Bootstrap01 6.67e-10 0 1
## 3 Bootstrap01 7.64e- 8 0.0250 1
## 4 Bootstrap01 8.21e- 8 0.0500 1
## 5 Bootstrap01 1.28e- 7 0.0750 1
## 6 Bootstrap01 1.58e- 7 0.1 1
## 7 Bootstrap01 3.33e- 6 0.125 1
## 8 Bootstrap01 1.42e- 5 0.15 1
## 9 Bootstrap01 1.49e- 5 0.175 1
## 10 Bootstrap01 2.19e- 5 0.2 1
## # ℹ 2,286 more rows
Let’s beautify our ROC curve.
glm_rs %>%
collect_predictions() %>%
group_by(id) %>% # -- to get 25 ROC curves, for each bootstrapped sample
roc_curve(sex, .pred_female) %>%
ggplot(aes(1 - specificity, sensitivity, col = id)) +
geom_abline(lty = 2, colour = "grey80", size = 1.5) +
geom_path(show.legend = FALSE, alpha = 0.6, size = 1.2) +
coord_equal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
I’m using geom_path
instead of geom_line
because I want to see discrete jumps. geom_line
would give me a continious plot as it connects the points in the order of the variable on the x-axis. Another option is geom_step
which only highlights changes — when a variable steps to take another value.
# Using geom_line()
glm_rs %>%
collect_predictions() %>%
group_by(id) %>% # -- to get 25 ROC curves, for each bootstrapped sample
roc_curve(sex, .pred_female) %>%
ggplot(aes(1 - specificity, sensitivity, col = id)) +
geom_abline(lty = 2, colour = "grey80", size = 1.5) +
geom_line(show.legend = FALSE, alpha = 0.6, size = 1.2) +
coord_equal()
# Using geom_step()
glm_rs %>%
collect_predictions() %>%
group_by(id) %>% # -- to get 25 ROC curves, for each bootstrapped sample
roc_curve(sex, .pred_female) %>%
ggplot(aes(1 - specificity, sensitivity, col = id)) +
geom_abline(lty = 2, colour = "grey80", size = 1.5) +
geom_step(show.legend = FALSE, alpha = 0.6, size = 1.2) +
coord_equal()
geom_abline()
can take two arguments: intercept and slope. If you provide none of these, it plots \(y = x\). So, the best plot is from geom_path()
.
glm_rs %>%
collect_predictions() %>%
group_by(id) %>% # -- to get 25 ROC curves, for each bootstrapped sample
roc_curve(sex, .pred_female) %>%
ggplot(aes(1 - specificity, sensitivity, col = id)) +
geom_abline(lty = 2, colour = "grey80", size = 1.5) +
geom_path(show.legend = FALSE, alpha = 0.6, size = 1.2) +
coord_equal()
Okay, enough on ggplot2
. From the above ROC curves, we can deduce the model is doing great for the training samples.
Testing Samples
The above metrics used only the test data. We need to check our model’s performance on test dataset too.
Let’s fit the model using last_fit()
. last_fit()
fits the final best model to the training set and evaluates the test set.
penguins_final = penguin_wf %>%
add_model(glm_spec) %>%
last_fit(penguins_split)
penguins_final
## # Resampling results
## # Manual resampling
## # A tibble: 1 × 6
## splits id .metrics .notes .predictions .workflow
## <list> <chr> <list> <list> <list> <list>
## 1 <split [249/84]> train/test split <tibble> <tibble> <tibble> <workflow>
Let’s check how good our final model is.
penguins_final %>%
collect_metrics()
## # A tibble: 2 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.905 Preprocessor1_Model1
## 2 roc_auc binary 0.966 Preprocessor1_Model1
penguins_final %>%
collect_predictions()
## # A tibble: 84 × 7
## id .pred_female .pred_male .row .pred_class sex .config
## <chr> <dbl> <dbl> <int> <fct> <fct> <chr>
## 1 train/test split 0.887 0.113 6 female female Preprocess…
## 2 train/test split 0.0000979 1.00 7 male male Preprocess…
## 3 train/test split 0.976 0.0238 8 female female Preprocess…
## 4 train/test split 0.996 0.00431 14 female female Preprocess…
## 5 train/test split 0.000000623 1.00 15 male male Preprocess…
## 6 train/test split 0.973 0.0273 16 female female Preprocess…
## 7 train/test split 0.772 0.228 17 female male Preprocess…
## 8 train/test split 0.662 0.338 21 female female Preprocess…
## 9 train/test split 0.434 0.566 22 male male Preprocess…
## 10 train/test split 0.961 0.0388 23 female female Preprocess…
## # ℹ 74 more rows
Our model is 90.5% accurate and has AUC of 0.966. These are very high accuracies.
Confusion Matrix for Test Set
penguins_final %>%
collect_predictions() %>%
conf_mat(sex, .pred_class)
## Truth
## Prediction female male
## female 39 5
## male 3 37
This does pretty good to be honest.
Final Workflow
Let’s see the final model.
penguins_final$.workflow[[1]]
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Formula
## Model: logistic_reg()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## sex ~ .
##
## ── Model ───────────────────────────────────────────────────────────────────────
##
## Call: stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
##
## Coefficients:
## (Intercept) speciesChinstrap speciesGentoo bill_length_mm
## -95.852333 -6.932255 -8.535185 0.633832
## bill_depth_mm flipper_length_mm body_mass_g
## 2.014378 0.056401 0.006365
##
## Degrees of Freedom: 248 Total (i.e. Null); 242 Residual
## Null Deviance: 345.2
## Residual Deviance: 85.49 AIC: 99.49
Can we tidy it up? (We need [[1]]
to get the element out as .workflow
is a list.
penguins_final$.workflow[[1]] %>%
tidy()
## # A tibble: 7 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -95.9 18.1 -5.30 0.000000115
## 2 speciesChinstrap -6.93 1.91 -3.62 0.000295
## 3 speciesGentoo -8.54 3.25 -2.63 0.00866
## 4 bill_length_mm 0.634 0.159 4.00 0.0000639
## 5 bill_depth_mm 2.01 0.455 4.43 0.00000940
## 6 flipper_length_mm 0.0564 0.0653 0.863 0.388
## 7 body_mass_g 0.00637 0.00138 4.63 0.00000374
The coefficients can be exponentiated to find the odds ratio.
penguins_final$.workflow[[1]] %>%
tidy(exponentiate = TRUE) %>%
arrange(estimate)
## # A tibble: 7 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 2.35e-42 18.1 -5.30 0.000000115
## 2 speciesGentoo 1.96e- 4 3.25 -2.63 0.00866
## 3 speciesChinstrap 9.76e- 4 1.91 -3.62 0.000295
## 4 body_mass_g 1.01e+ 0 0.00138 4.63 0.00000374
## 5 flipper_length_mm 1.06e+ 0 0.0653 0.863 0.388
## 6 bill_length_mm 1.88e+ 0 0.159 4.00 0.0000639
## 7 bill_depth_mm 7.50e+ 0 0.455 4.43 0.00000940
The coefficient of 3.75 means that for every one mm increase in bill depth, the odds of being male increases by almost eight times. So, bill depth is very important.
Flipper value are not very important. Remember that previously we explored the relationship between sex and flipper length. If flipper length is not important (high p-value), let’s see how the graph would look like with bill depth which is apparently very important.
penguins %>%
filter(!is.na(sex)) %>%
ggplot(aes(bill_depth_mm, bill_length_mm, colour = sex, size = body_mass_g)) +
geom_point(alpha = 0.6) +
facet_wrap(~species)
Conclusion
In this creating this short tutorial, I learnt how to classify data using tidymodels workflow with logistic regression and random forest. An important lesson was that logistic regression can outperform complicated trees like random forest too.
If you are interested in data science and R, check out my free weekly newsletter Next.
Next — Today I Learnt About R
A short and sweet curated collection of R-related works. Five stories. Four packages. Three jargons. Two tweets. One Meme.
You can read the past editions and subscribe to it here.