Here are some notes on machine learning models.

## Concepts Behind Decision Trees

• Bagging (boostrap aggregation): Randomly sample with replacement, and average the results.
• Majority vote: The most commonly-occuring prediction.
• Internal node: Where the splits occur.
• Branches: Segments that connect the nodes.
• Terminal node (leafs, regions): Where the observations end up. The average of the responses (or majority vote) is the prediction for future observations.
• Gini index: where $$m$$ is the leaf and $$k$$ is the class (0 or 1 for binary classification, but can be extedned for multiple classes). $$\hat{p}_{mk}$$ is the proportion of observations in leaf $$m$$, class $$k$$ that belong. Gini index will reduce if $$\hat{p}_{mk}$$ is close to 0 or 1. (Variance will be low). Worst-case variance = .25 (.5*.5). Best-case variance is 0 (all observations are of one class).

$G = \sum_{k \in K} \hat{p}_{mk} (1-\hat{p}_{mk})$

• Node purity: A node is pure if most of the observations come from one class. Gini index is a measure of purity.
library(ggplot2)
df = data.frame(p = seq(0, 1, length=100))

ggplot(df) +
geom_point(aes(x = p, y = p*(1-p), color='blue')) +
geom_point(aes(x = p, y = -p*log(p)))
## Warning: Removed 1 rows containing missing values (geom_point). ## Linear regression

Variance-bias: Low variance.

## Random forest

Why? * Decision trees suffer from high variance

How: * Randomly samples columns too (to de-correlate the trees). * Uses Bagging. Takes $$b$$ bootstrapped samples, fits a tree on each one, and averages the results. * For classification, the majority vote is use.

Concepts:

• OOB (out-of-bag) observations are those not selected during boostrapping. (For each tree). Use these observations (about 1/3) to explore. Keep track of each observation and whether it was bagged in a given tree. You estimate the prediction error across all the trees in which an observation was out-of-bag.

Questions I have:

• For classification, each leaf has $$n$$ observations, and the mean of those observations is the prediction. Since each tree is constructed differently, the algorithm likely has to yield a score fore each tree. So the average for each leaf depends on $$n$$. My guess is that majority vote is taking the 0-1 prediction from each tree (which would be a majority vote within a leaf), and then take the majority vote across the $$b$$ trees.