Machine Learning Notes

Posted May 24, 2021

By Jinyu Xie 23 min read

‎

1 Statistical Decision Theory

Let $X \in R^p$ denote a real valued random input vector, and $Y \in R$ a real valued random output variable, with joint distribution $Pr(X, Y)$. We seek a function $f(X)$ for predicting $Y$ given values of $X$. We can define a loss function $L(Y, f(X))$, and the most common loss function is squared error loss.

\[L(Y, f(X)) = (Y - f(X))^2\]

Expected (squared) Prediction Error \[EPE(f) = E(Y - f(X))^2\]

By conditioning on $X$, \[EPE(f) = E_X E_{Y|X}([Y - f(X)]^2|X)\]

We can minimize EPE pointwise at each $x$. \[f(x) = argmin_c E_{Y|X} ([Y-c]^2|X = x)\]

The solution is \[f(x) = E(Y|X=x)\]

K nearest neighbors is trying to approximate this function directly. It assumes $f(x)$ is well approximated by a locally constant function.
- If the training sample N -> ∞, and the neighbor number k -> ∞, and N/k -> ∞, the k nearest neighbor approximates the conditional mean well.
Linear regression assumes f(x) is linear, and then find the optimal parameters. It is model based approach. It assumes $f(x)$ is well approximated by a globally linear function.

We can also replace the $L_2$ loss to $L_1$: $E|Y - f(X)|$. The solution in this case is the conditional median \[\hat{f}(x) = median(Y|X=x)\]

It can be more robust to outliers that conditional mean. However, $L_1$ loss is discontinuous in its derivatives.

Similarly, for classification problem, the expected prediction error is \[EPE = E[L(G, \hat{G}(X))]\]

where the loss function can be represented as a $K \times K$ matrix $\mathbf{L}$, where $L(k,l)$ is the price paid for classifying an observation belonging to class $k$ as $l$. Most often we use the zero-one loss function, where all the misclassifications are charged 1 unit.

Given the joint distribution $Pr(G, X)$, we can write EPE as \[EPE = E_X \sum_{k=1}^{K}L[G_k, \hat{G}(X)]Pr(G_k|X)\]

The solution is \[\hat{G}(x) = argmin_{g \in G} \sum_{k=1}^{K}L(G_k, g)Pr(G_k|X=x)\]

With 0-1 loss, \[\hat{G}(x) = argmin_{g \in G} [1 - Pr(g|X=x)]\]

This solution is known as the Bayes classifier. The error rate of the Bayes classifier is called the Bayes Rate.

2 Evaluation Metrics

2.1 Precision and Recall

Precision (the first row of the table):

\[precision = \frac{\texttt{# of true positives}}{\texttt{# of predicted positives}}\]

Recall (the first column of the table):

\[recall = \frac{\texttt{# of true positives}}{\texttt{# of condition positives}}\]

2.2 ROC and AUC

3 Entropy

Very good visual explanation of information entropy, cross entropy, and mutual information. https://colah.github.io/posts/2015-09-Visual-Information/

3.1 Information Entropy

\[H(p) = \sum_{x}p(x)log_2(\frac{1}{p(x)})\]

Information entropy can be interpreted as a lower bound of the average message length that is necessary to communicate events from a certain probability distribution, $p$.
- Consider we are trying to use a varying length of bits to encode the events in the probability distribution.
- To avoid ambiguity, once we chose a code, e.g. 01 to encode event 1, then other codes should not use 01 as its prefix, otherwise there will be ambiguity to decode it. That is called prefix property. The codes generated in this way is called prefix codes.
- The cost of choosing a code of length $L(x)$ for event x is $\frac{1}{2^{L(x)}}$, because we cannot use the rest of $\frac{1}{2^{L(x)}}$ fraction of codes anymore.
- The optimal strategy is to make the encoding cost of event x proportional to its probability. The more frequently happened event uses shorter code. That is $\frac{1}{2^{L(x)}} = p(x)$, i.e. $L(x) = log_2(\frac{1}{p(x)})$.
- Therefore, the average length of the code under this optimal strategy is $\sum_{x}p(x)log_2(\frac{1}{p(x)})$, which is the information entropy.
It also quantifies how uncertain I am.
- If I know something is definitely going to happen, then its entropy is 0 (I don’t even need to send out a message).
- The more uncertain the outcome, the more information you will get once you know what happened.

3.2 Cross Entropy

\[H_p(q) = \sum_{x}q(x)log_2(\frac{1}{p(x)})\] Using the code that is optimized for $p$ to communicate $q$.

The more different between the distribution $p$ and $q$, the higher cross entropy.
The cross entropy is not symmetric. $H_p(q) \neq H_q(p)$.
This is a common loss function in Machine Learning. We would like to have the predicted distribution as close as possible to the ground truth distribution.

3.3 Kullback-Leibler Divergence

\[D_q(q) = H_q(p) - H(p)\] It is a distance measure between distribution $p$ and $q$. It is not symmetric as well.

3.4 Mutual Information

Before dive into the mutual information, let’s take a look at the entropy of a joint distribution. The formula is still the same: \[H(X, Y) = \sum_{x, y}p(x, y)log(\frac{1}{p(x, y)})\]

What is the averaged extra information you need to communicate on event x given that the listener already knows y?

The answer is the conditional entropy: \[H(X|Y) = \sum_{y}p(y) \sum_{x} p(x|y) log(\frac{1}{p(x|y)}) = \sum_{x, y} p(x, y) log(\frac{1}{p(x|y)})\]

There is part of the information that is shared by X and Y, which is quantified by the mutual information $I(X, Y)$. \[I(X, Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)\]

The joint information entropy: \[H(X, Y) = H(X) + H(Y) - I(X, Y) = H(X) + H(Y|X) = H(Y) + H(X|Y)\]

NOTE: difference between cross entropy and mutual information.
- Cross entropy is for the same random variable, but different distribution. How much more entropy do I need if I use the coding strategy that is optimized for $p$ to communicate on $q$?
- Mutual information is for different random variables. How much information is shared between X and Y?

4 Cost functions

4.1 Classification

Misclassification rate
Cross Entropy
Exponential Loss
SVM hinge loss

4.2 Regression

L2 loss (squared error loss)
- Less robust to outliers
- More robust to data points when the prediction and the target are close.
L1 loss (absolute loss)
- More robust to outliers
- Overfits the data when the prediction and target are close.
Huber loss
- Combines L1 and L2, when the prediction and the target are close, use L2, when the prediction and the target are far away, use L1.

Regression%20Cost%20Functions.png

5 Linear Regression

5.1 Least Squares

The prediction function \[f(X) = \beta_0 + \sum_{j = 1}^{p} X_j \beta_j\] It can be vectorized to \[f(X) = \mathbf{X\beta}\] where $\mathbf{X}$ contains a column with all elements equal to 1 for the bias term.

Least square tries to minimize the residual sum of squares (RSS) \[RSS(\beta) = (\mathbf{y} - \mathbf{X}\beta)^T(\mathbf{y} - \mathbf{X}\beta)\]

\[\frac{\partial RSS}{\partial \beta} = -2\mathbf{X}^T(y - \mathbf{X}\beta)\] \[\frac{\partial^2 RSS}{\partial \beta \partial \beta^T} = 2\mathbf{X}^T\mathbf{X}\]

If $\mathbf{X}$ has full column rank, then $\mathbf{X^TX}$ is positive definite. Then the solution is \[\hat{\beta} = (\mathbf{X^TX})^{-1}\mathbf{X^Ty}\]

Proof of the above estimator $\hat{\beta}$ is an unbiased estimator, and calculate the variance of the estimator.

\[Y = \beta\mathbf{X} + \epsilon\] where $\epsilon \sim N(0, \sigma^2)$.

Assuming $\mathbf{X}$ is fixed (conditioning on $\mathbf{X}$),

\[E[\hat{\beta}|X] = (\mathbf{X^TX})^{-1}\mathbf{X^T}E[\mathbf{y}|X] = (\mathbf{X^TX})^{-1}\mathbf{X^T}\mathbf{X}\beta = \beta\] \[Var[\hat{\beta} | X] = Var[(\mathbf{X^TX})^{-1}\mathbf{X^Ty}|X] = (\mathbf{X^TX})^{-1}\mathbf{X^TX}(\mathbf{X^TX})^{-1}Var[y|X] = (\mathbf{X^TX})^{-1} \sigma^2\]

NOTE: the least square estimator has the smallest variance among all the linear unbiased estimators. To be noted that, it might not be wise to choose only among unbiased estimators.
We can use t-test to check whether certain coefficient is significant or not.
We can do forward- or backward- stepwise predictor selection.

5.2 Ridge Regression

\[\hat{\beta}^{ridge} = argmin_{\beta} ||\mathbf{y} - \mathbf{X}\beta||_2^2 + \lambda ||\beta||_2^2\] \[\hat{\beta}^{ridge} = (\mathbf{X^TX + \lambda I})^{-1}\mathbf{X^Ty}\]

Given the SVD of matrix $\mathbf{X}$, \[X = UDV^T\]

Then \[\mathbf{X}\hat{\beta}^{ridge} = \mathbf{UD(D^2 + \lambda I)^{-1}DU^Ty} = \sum_{j=1}^{p}\mathbf{u_j} \frac{d_j^2}{d_j^2 + \lambda} \mathbf{u_j}^T y\]

The algorithm tries to shrink the coefficient more on the principal component with less variance.

NOTE: Ridge shrinks the coefficients continuously.
Consider its feasible region is a sphere, and the optimal solution on the sphere will contain nonzero values in all the dimensions.

5.3 Lasso

\[\hat{\beta}^{lasso} = argmin_{\beta} ||\mathbf{y} - \mathbf{X}\beta||_2^2 + \lambda ||\beta||_1\]

Lasso will select the features automatically. It tends to make the features used in prediction more sparse.
Consider its feasible region is a diamond, and the optimal solution is at the vertices of the diamond, where some of the coefficients are zero.

5.4 ElasticNet

Between L1 and L2 norm. It uses the following penalty \[\lambda \sum_{j=1}^{p}(\alpha \beta_j^2 + (1 - \alpha)|\beta_j|)\]

5.5 Principal Components Regression

Regression on the direction of principal components. It seeks directions that have high variance.

5.6 Partial Least Squares

Partial Least Squares seeks directions that have high variance and have high correlation with the response $\mathbf{y}$.

6 Linear Methods for Classification

6.1 Discriminant Analysis

Bayes rule: \[Pr(G=k|X = x) = \frac{Pr(X=x|G=k) P(G=k)}{\sum_{l=1}^{K} Pr(X=x|G=l) Pr(G=l)} = \frac{f_k(x) \pi_k}{\sum_{l=1}^{K} f_l(x) \pi_l}\]

Assuming the distribution of x in each class is a Gaussian distribution, \[f_k(x) = \frac{1}{(2\pi)^{p/2} |\Sigma_k|^{1/2}} e^{-\frac{1}{2} (x - \mu_k)^T \Sigma_k^{-1} (x - \mu_k)}\]

6.1.1 Linear Discriminant Analysis

Assume all the variance across the classes are the same $\Sigma_k = \Sigma$.

6.2 Logistic Regression

Sigmoid Function

\[f(x) = \frac{1}{1 + e^{-x}}\]

Cross Entropy

\[H(p, q) = -\sum_{i=1}^{K}p_{i}log(q_i)\] where $p_i$ is the true distribution, $q_i$ is the predicted distribution.

Assume the condition probability yields to a sigmoid function. \[Pr(Y=0|X = x) = \frac{1}{1 + e^{-\beta^T x}}\]

Then we try to maximize the log likelihood, or minimize the cross entropy. \[l(\theta) = \sum_{i=1}^N \{\}\]

7 Bayesian Model

7.1 Naive Bayes

Conditional independence (between features) assumption

\[P(Y = k|X) = \frac{P(X|Y=k) P(Y=k)}{P(X)} = \frac{\prod_{i=1}^{p}P(X_i|Y=k) P(Y=k)}{P(X)}\]

For discrete feature $X_i$, directly use the sampled mass probability.
For continuous feature, fit a Gaussian distribution to the data. This makes the decision boundary linear.

8 Nearest Neighbors

Find k nearest neighbors under certain distance metric, and output the prediction based on the majority vote.

NOTE: as the dimension of feature space becomes higher, the nearest neighbors of a point can be very far away, which causes bias.
- So we need to consider adapting the distance metric.
- The neighborhood could stretch out in directions for which the class probabilities don’t change much.
- In high-dimension space, the class probabilities might change only a low-dimensional subspace. Hence adapting the distance metric could benefit.
- Discriminant Adaptive Nearest Neighbor. Make the neighborhood into adaptive ellipsoid, which stretches along the decision boundary.

9 KKT Condition

Primal problem: \[\min_{x} f(x)\] \[\text{s.t. } g_i(x) \le 0\]

Lagrangian: \[L(x, \mu) = f(x) + \sum_{i} \mu_i g_i(x)\]

KKT Condition:

Stationarity

\[\nabla_x L(x^*, \mu) = 0\] i.e. \[\nabla_x f(x^*) + \sum_{i} \mu_i \nabla_x g_i(x^*) = 0\]

Primal Feasibility \[g_i(x^*) \le 0\]
Dual Feasibility \[\mu_i \ge 0\]
Complementary Slackness \[\mu_i g_i(x^*) = 0\]

10 Support Vector Machine

10.1 Support Vector Regression

11 Regression and Decision Tree

For N observations $(x_i, y_i)$ for $i = 1,2,...,N$, with $x_i = (x_{i1}, x_{i2}, ..., x_{ip})$, that is we have $p$ features for each observation.

Suppose we split the whole input space into $M$ regions R₁, R2, …, R_M, and we model the response as a constant at each region. \[f(x) = \sum_{m=1}^M c_m I(x \in R_m)\]

For regression case, we can minimize the sum of squares \[\min \sum(y_i - f(x_i))^2\]

It is generally computationally infeasible to find the global optimal binary partition. So we proceed with a greedy algorithm. Consider a splitting variable $j$ and splitting point $s$, we split the plane into two pieces \[R_1(j, s) = \{X|X_j \leq s\}, R_2(j, s) = \{X|X_j > s\}\]

We seek the splitting variable $j$ and split point $s$ that \[\min_{j, s} [\min_{c_1} \sum_{x_i \in R_1(j, s)} (y_i - c_1)^2 + \min_{c_2} \sum_{x_i \in R_2(j,s)} (y_i - c_2)^2]\]

Once found the best split variable $j$ and split point $s$, we partition the data into two regions, and repeat the process on each of the two regions.

Tree size is a tuning parameter. Large tree could overfit the data, while small tree might not capture the structure. It governs the model complexity.
A good strategy is to prune the tree using cost-complexity pruning techniques. Let $|T|$ denote the number of terminal nodes (leaf nodes) in $T$. Regularize the tree size (terminal nodes). Here is the cost-complexity criterion. \[\min \sum_{m=1}^{|T|} \sum_{x_i \in R_m} (y_i - \hat{c}_m)^2 + \alpha |T|\]

For classification problem, different cost function is used for deciding the split variable $j$ and split point $s$. Consider within a node $m$, there is $N_m$ observations in the region $R_m$, the proportion of class $k$ in the region is \[\hat{p}_{mk} = \frac{1}{N_m} \sum_{x_i \in R_m} I(y_i = k)\] We classify the observations in node $m$ to class $k(m) = \arg \max_k \hat{p}_{mk}$. There are different cost functions (measure of impurity of a node):

Misclassification error

\[\frac{1}{N_m} \sum_{i \in R_m} I(y_i \neq k(m)) = 1 - \hat{p}_{mk(m)}\]

Gini index: how often a randomly chosen point in the set is incorrectly labeled if we randomly label that point according to the distribution of labels in the set.

\[\sum_{k=1}^K \hat{p}_{mk}(1 - \hat{p}_{mk})\]

Mutual information (or information gain): conditioning on the region $R_m$, how much information on the label $Y$ I can get. Recall the mutual information $I(X, Y) = H(Y) - H(Y|X)$. To maximize the information gain, we need to minimize the conditional entropy $H(Y|X)$, i.e. minimize the following

\[-\sum_{k=1}^K \hat{p}_{mk} \log \hat{p}_{mk}\]

12 Bootstrap

Sampling with replacement.

13 Bagging

Bootstrap aggregation: averages the prediction over a collection of bootstrap samples, thereby reduce its variance.

For each bootstrap sample $Z^{*b}, b = 1, 2, ..., B$, we fit our model, giving the prediction $\hat{f}^{*b}(x)$. The bagging estimate is defined by \[\hat{f}_{bag} (x) = \frac{1}{B}\sum_{b=1}^B\hat{f}^{*b}(x)\]

For classification problem, we can do a majority vote instead, or average the probability of each class (this could be more robust).

Why bagging works?
- It helps reducing the variance and keeps the bias unchanged.
Bagging lost the interpretability.

14 Stacking

Let $\hat{f}_m^{-i}(x)$ be the prediction at $x$, using model $m$, applied to the dataset with the $i$th training observation removed (leave-one-out). In fact, we could also leave one partition out in the cross-validation setup. Then we find the stacking weights by minimizing certain cost function, for example least squares. \[\hat{w}^{st} = argmin_w \sum_{i=1}^N \{ y_i - \sum_{m=1}^M w_m \hat{f}_m^{-i}(x_i) \}^2\]

The final prediction is $\sum_m \hat{w}_m^{st} \hat{f}_m(x)$. In fact, you can use other models for the final stacking instead of least squares. That’s how you stack models together.

15 Boosting

15.1 AdaBoost

Combines the outputs of many “weak” classifiers to produce a powerful “committee”.
Sequentially apply the weak classification algorithm to repeatedly modified versions of the data, thereby producing a sequence of weak classifiers $G_m(x), m=1,2,...M$
For a two-class problem, combine the predictions from all the weak classifiers through a weighted majority vote to produce the final prediction:

\[G(x) = sign(\sum_{m=1}^M \alpha_m G_m(x))\] The weights $\alpha_1, ..., \alpha_M$ are computed by the boosting algorithm. They try to give higher influence to the more accurate classifiers in the sequence.

The data modification step consist of applying weights $w_1, ..., w_N$ to each of the training observations $(x_i, y_i), i=1,2,...,N$.
- Initially, the weights are equal $1/N$.
- At each step, those observations that were misclassified by the previous classifier will have their weights increased, while the weights are decreased for those that were classified correctly.

15.2 General Boosting Idea and Additive Model

Boosting is a way of fitting an additive expansion in a set of elementary “basis” functions. Here the basis functions are the individual classifiers $G_m(x) \in \{1, -1\}$. More generally, basis function expansions take the form

\[f(x) = \sum_{m=1}^M \beta_m b(x; \gamma_m)\]

In single-hidden-layer neural networks, $b(x; \gamma) = \sigma(\gamma_0 + \gamma_1^Tx)$
In signal processing, wavelets with $\gamma$ parameterizing the location and scale shift of a “mother” wavelet.
For trees, $\gamma$ parameterizes the split variables and split points at the internal nodes, and the predictions at the terminal nodes.

Generally, we want to minimize a loss function averaged over the training data. \[\min_{\beta_1, \gamma_1, ..., \beta_M, \gamma_M} \sum_{i=1}^{N} L(y_i, \sum_{m=1}^M \beta_m b(x_i; \gamma_m))\]

It is usually computationally untrackable to minimize the loss globally. We can do that Forward Stagewise in a greedy way to approximate the solution.

For $m=1$ to $M$, \[\beta_m, \gamma_m = \arg \min_{\beta, \gamma} \sum_{i=1}^{N} L(y_i, f_{m-1} + \beta b(x_i; \gamma))\]

15.3 AdaBoost revisit

AdaBoost is in fact a forward stagewise additive modeling with exponential loss. \[L(y, f(x)) = exp(-yf(x))\]
Consider two-class problem, $G_m(x) \in \{1, -1\}$. \[\beta_m, G_m = \arg \min_{\beta, G} \sum_{i=1}^N \exp [-y_i(f_{m-1}(x_i) + \beta G(x_i))]\] \[\beta_m, G_m = \arg \min_{\beta, G} \sum_{i=1}^N w_i^{(m)}\exp [-y_i\beta G(x_i)]\] where $w_i^{(m)} = \exp (-y_i f_{m-1}(x_i))$ are the weights applied to each observation at each iteration, since the weights depends on $f_{m-1}(x_i)$. For $\beta > 0$, \[G_m = \arg \min_{G} \sum_{i=1}^{N} w_i^{(m)} I(y_i \neq G(x_i))\] Then plugging in $G_m$, we can solve $\beta$, \[\beta_m = \frac{1}{2} \log \frac{1 - err_m}{err_m}\] \[err_m = \frac{\sum_{i = 1}^{N} w_i^{(m)} I(y_i \neq G_m(x_i))}{\sum_{i=1}^{N} w_i^{(m)}}\] Update the weights \[w_i^{(m+1)} = w_i^{(m)} e^{-\beta_m y_i G_m(x_i)}\] Note that \[-y_i G_m(x_i) = 2 I(y_i \neq G_m(x_i)) - 1\] Hence \[w_i^{(m+1)} = w_i^{(m)} \cdot e^{\alpha_m I(y_i \neq G_m(x_i))} \cdot e^{-\beta_m}\] where \[\alpha_m = 2\beta_m = \log \frac{1 - err_m}{err_m}\]
The classification rule is \[G(x) = sign[f_m(x)]\]
Disadvantages of AdaBoost
- Exponential Loss is not robust to noise. It is not robust to overlapping class distribution and mislabeling of the training data.
  - It penalizes exponentially for misclassified data as the data’s negative margin increases (margin is represented as $yf(x)$, where $y$ is the label, $f(x)$ is the prediction, $sign(f(x))$ is the class prediction).
  - Hence it is not robust to noise. For classification problem with large Bayes error rate, or for data with misspecification of class labels. AdaBoost dramatically degrades under this situation. AdaBoost dramatically degrades under this situation.
  - For classification problem, cross-entropy and SVM hinge loss are more robust loss functions. They penalizes negative margin linearly, and there is little loss for correct classifications.
  - Directly using the misclassification rate as cost is not robust. The prediction is unstable. We are more interested in the probability than the actual predicted class label.

15.4 Gradient Boosting Trees

15.4.1 General Idea

Gradient Boosting Trees are able to incorporate any cost functions, which overcomes the disadvantages of AdaBoost.

Consider optimizing the loss in a space of functions (represented by decision trees). \[L(f) = \sum_{i=1}^N L(y_i, f(x_i))\]
We can approach the optimal $f$ by gradient descent. In fact, here we are approaching the optimum in a Steepest Descent fashion. \[f_m = f_{m - 1} - \rho_m g_m\] where the step size $\rho_m$ satisfies \[\rho_m = \arg \min_{\rho} L(f_{m - 1} - \rho g_m)\] $g_m$ is the gradient of $L(f)$ evaluated at $f_{m-1}$ \[g_{im} = [\frac{\partial L(y_i, f(x_i))}{\partial f(x_i)}]_{f(x_i)=f_{m-1}(x_i)}\]
Note that we can only evaluate the values of the gradient at the given training data points $x_i$. However, for the prediction, we need the values in the whole feature space. Hence, we can fit a tree on the negative gradient to make the prediction.
Then we can figure out the optimal step size that minimizes the loss.
Then we repeat the process.

15.4.2 Algorithm

The overall algorithm is as follows:

Initialize $f_0(x) = \arg \min_{\gamma} \sum_{i=1}^{N} L(y_i, \gamma)$.
For $m$ = 1 to $M$
- For $i=1,2,...,N$ compute the negative gradient (pseudo residuals). \[r_{im} = - [\frac{\partial L(y_i, f(x_i))}{\partial f(x_i)}]_{f(x_i)=f_{m-1}(x_i)}\]
- Fit a regression tree to the targets $r_{im}$. The regression tree gives terminal regions $R_{jm}, j=1,2,...,J_m$
- For $j=1,2,...,J_m$ compute the following (steepest descent): \[\gamma_{jm} = \arg \min_{\gamma} \sum_{x_i \in R_{jm}} L(y_i, f_{m-1}(x_i) + \gamma\]
- Update $f_m(x) = f_{m-1}(x) + \sum_{j=1}^{J_m} \gamma_{jm} I(x\in R_{jm})$

15.4.3 Tuning Parameters

Tree Size (number of leaf nodes) $J$
- Restrict the tree size to be the same size across all the trees.
- If the tree size is set to $J$, there can be only at most $J-1$ interaction effects among the features.
- Most of the case, tree size of 4 to 8 works well. Shouldn’t be larger than 10 in most of the cases.
Boosting Iteration $M$
- Could use an early stopping strategy.
Learning rate $\nu$ \[f_m(x) = f_{m-1}(x) + \nu \cdot \sum_{j=1}^{J_m} \gamma_{jm} I(x\in R_{jm})\]
- A good strategy will be set $\nu$ to be very small (ν < 0.1), and choose $M$ using early stopping.
Subsampling, sampling portion $\eta$
- Stochastic gradient boosting: sample the training data to fit the tree.

15.4.4 Relative Importance of Features

Feature Importance for a single decision tree

The feature importance of feature $X_l$ in a decision tree $T$ \[Z_l^2(T) = \sum_{t=1}^{J-1} z_t^2 I(v(t)=l)\] The sum of $J - 1$ internal nodes (consider there are $J$ leaf nodes). At each internal node, the feature $X_{v(t)}$ is used as the split variable. \[z_t^2 = SquaredErrorAfterSplit - SquaredErrorBeforeSplit\]

For additive trees
- Just average the score over all the trees.

16 Random Forest

The major two improvements made upon the basic decision tree:

Bootstrap sample (bagging)
When training a tree, randomly select $m$ features from the total $p$ features to make the split. This is the major difference between Random Forest and bag of trees.

16.0.1 How these two improvements help?

Consider the following bagging procedure \[\hat{f}(x) = \frac{1}{B} \sum_{i=1}^B T_b(x)\] Notice that the bias is not changed. The bagging tries to reduce the variance.

NOTE that the trees generated from bagging are identically distributed (i.d.).
- If the trees are independently identically distributed (i.i.d.), and assumes the variance of each tree is $\sigma^2$, then the variance can be computed as \[Var(\hat{f}(x)) = \frac{1}{B} \sigma^2\]
- However, the trees are not independent in fact. Assuming the correlation between any of the two trees is $\rho$, the variance of the bagged trees is \[Var(\hat{f}(x)) = \frac{1}{B^2} Var(\sum_{i=1}^B T_i) = \frac{1}{B^2} (B\sigma^2 + B(B-1) Cov(T_i, T_j)) = \frac{1}{B^2} (B\sigma^2 + B(B-1) \rho \sigma^2) \] \[=\rho \sigma^2 + \frac{1 - \rho}{B} \sigma^2\]
  - When $B$ is large enough, the variance is dominated by the correlation $\rho$. We need to think about ways to reduce the correlation between trees.
  - That is achieved by random selection of features when growing the trees.

16.0.2 Out of Bag Error

For each observation $(x_i, y_i)$, construct its predictor by averaging those trees corresponding to bootstrap samples which do not contain the observation $(x_i, y_i)$.

OOB error is almost identical to $N$ fold Cross Validation. Hence Random Forest can do the training and cross-validation at the same time. Once OOB error stabilizes (as the algorithm increases the number of trees), the training can be terminated.

17 Clustering

17.1 K means

Pick initial centroids.
Assign each observation to the closest centroids.
Compute the means of each cluster, and treat them as the new centroids for each cluster.
Repeat step 2 and step 3 until the cost converges.
The cost is the sum of distance of each observation to its cluster mean.
K means could yield to local minimum.
Try random initial points several times and choose the clusters with the lowest cost.

17.2 Gaussian Mixture

17.2.1 EM algorithm

Consider a training dataset ${x_1, x_2, ..., x_N}$, there is a latent random variable $z_i$ associated with each training example $x_i$ indicating which cluster it belongs to. Note that $z_i$ is unobservable. We can try to maximize the log likelihood \[l(\theta) = \sum_{i=1}^N log(p(x_i|\theta))= \sum_{i=1}^N log(\sum_z p(x, z | \theta))\]

Directly optimizing a log of sum is difficult, let’s approach it by maximizing its lower bound! The lower bound can be found by Jensen’s Inequality. \[E(f(X)) \geq f(E(X))\] where $f$ is strictly convex, $X$ is a random variable. The equality holds true if and only if $X = E(X)$ with probability of 1.

Now we can derive a lower bound for the maximum likelihood \[l(\theta) = \sum_{i=1}^N log(\sum_z p(x^{(i)}, z^{(i)} | \theta)) = \sum_{i=1}^N log(\sum_z q(z^{(i)}) \frac{p(x^{(i)}, z^{(i)} | \theta)}{q(z^{(i)})})\] \[\geq \sum_{i=1}^N \sum_z q(z^{(i)}) log(\frac{p(x^{(i)}, z^{(i)} | \theta)}{q(z^{(i)})})\] for any given distribution $q(z)$.

If $\theta$ is known (we can make some initial guess of $\theta$), we want to make sure the lower bound equals to the original log likelihood $l(\theta)$ at the guessed point $\theta$. Then we let $\frac{p(x^{(i)}, z^{(i)} | \theta)}{q(z^{(i)})}$ to be constant for all $z$. Then we have \[q(z^{(i)}) = \frac{p(x^{(i)}, z^{(i)}|\theta)}{\sum_{z} p(x^{(i)}, z)} = p(z^{(i)}|x^{(i)}, \theta)\] This is known as the E-step.

M-step is just to maximize the lower bound assuming $\theta$ is unknown, but you know $q$. \[\theta = argmax_\theta \sum_{i=1}^N \sum_z q(z^{(i)}) log(\frac{p(x^{(i)}, z^{(i)} | \theta)}{q(z^{(i)})})\]

The overall algorithm:

Make a initial guess of $\theta$.
E-step, compute the distribution $q$ based on the guess $\theta$

\[q(z^{(i)}) = \frac{p(x^{(i)}, z^{(i)}|\theta)}{\sum_{z} p(x^{(i)}, z)} = p(z^{(i)}|x^{(i)}, \theta)\]

M-step, assuming $\theta$ is unknown, maximize the sum of log likelihood based on the computed distribution $q$.

\[\theta = argmax_\theta \sum_{i=1}^N \sum_z q(z^{(i)}) log(\frac{p(x^{(i)}, z^{(i)} | \theta)}{q(z^{(i)})})\]

Iterate step 2 and step 3 until convergence.

17.3 Principal Component Analysis

NOTE: remember to center and scale the data (z-score) before performing PCA.

17.4 Spectral Clustering

Graph Laplacian matrix

18 Recommender System

18.1 Collaborative Filtering

18.1.1 User-based collaborative filtering

Find the similar users based on certain similar score. Your predicted rating is the average rating weighted on similarity scores.

19 Deep Learning

19.1 Gradient Descent

\[\theta_{n + 1} = \theta_n - \eta \frac{\partial L}{\partial \theta}\]

19.2 Stochastic Gradient Descent (mini-batch)

The gradient is computed by the whole training data. Instead of computing the gradient from the whole training dataset, we shuffle the training dataset and get k samples (mini-batch) to approximate the true gradient. \[\theta_{n + 1} = \theta_n - \eta \hat{G}_k(\theta)\]

19.3 Momentum Stochastic Gradient Descent

Keep a portion of the previous speed (step difference), and adjust the speed with the latest gradient. \[v_{n + 1} = \gamma v_{n} - \eta \nabla_\theta J(\theta)\] \[\theta_{n+1} = \theta_n + v_{n + 1}\]

19.4 Why Rectified Linear Unit (ReLU)?

Advantages:
- No gradient vanishing issue as with sigmoid and tanh
- Promotes sparse representation (0 for negative input), induced regularization
- Efficient in computation
Disadvantages:
- Asymmetric
- Unbounded, this can be overcome by batch normalization
- Nondifferentiable at 0

20 Feature Selection

20.1 Filter

20.2 Wrapper

20.3 Embedded

21 TODO Hypothesis Test

21.1 T test

21.2 F test

21.3 Z test

Author: Jinyu Xie

Created: 2021-05-24 Mon 10:30

Notes, MachineLearning

machine_learning

This post is licensed under CC BY 4.0 by the author.