One of the most terrible textbook I have read

Textbook for the course: EHSC-GA 2306 Data Mining (Fall 2011)

Textbook for the course: EHSC-GA 2306 Data Mining (Spring 2013)

Title: The Elements of Statistical Learning: Data Mining, Inference, and Prediction 2nd ed Authors: Trevor Hastie, Robert Tibshirani, Jerome Friedman Edition: 2 Finished Date Rating: 1 Language: English Genres: Data Science, Data Mining Level: Advanced Publishers: Springer Publication Date: 2016 ISBN: 978-0387848570 Format: Pdf Pages: 745 Download: Pdf |

## Ch 1: introduction

我一直对interpretation感兴趣.“There is no true interpretation of anything; interpretation is a vehicle in the service of human comprehension. The value of interpretation is in enabling others to fruitfully think about an idea” - Andreas Buja

网上没有搜到这句话原始的出处.

数学公式的解释会有分正确与错误之分. p value的解释也会有正确与错误之分

什么是learning?

**a learner**: a prediction model from a training set of a data

## Ch 2: overview of supervised learning

### 2.1 introduction

**supervised learning**: use the inputs to predict the values of the outputs

machine learning

- inputs
- outputs

statistics

- predictors/independent variables
- dependent variables

pattern recognition

- features
- responses

### 2.2 variable types

variable type

qualitative/categorical/discrete variables/factors

represent as

**targets**- 0 and 1
- -1 and 1

quantitative

ordered categorical

dummy variables (2 types)

- K-level qualitative variable is represented by a vector of K binary variables
- K-level qualitative variable is represented by a vector of K-1 binary variables

distinction in output type

**classificatin**: predict qualitative outputs**regression**: predict quantitative outputs

distinction in input type => disctinction in the types of methods that are used for prediction

- qualitative
- quantative

### terminology

$X$: input variable(s)

if $X$ is vector

- $X_j$: components

outputs

`Y`

: quatiative outputs`G`

: qualitative outputs

observed values are written in lower case

- matrices are represented by bold uppercase letters
**vectors will be bold, if they have N components**, such as, $\textbf{y}$

input

a variable | observed |
---|---|

$X$ | $x_i$ |

. | a vector of variables | a variable in the vecotor of variables | |
---|---|---|---|

. | $X$ | $X_j$ | |

observed | $x_i$: $i$th observed $X$ (p-vector) | $\mbox{x}_j$ (N-vector) |

observed matrix: $\mbox{X}$

quantitative output

variable | observed | predicted |
---|---|---|

Y | y | $\widehat{Y}$ |

qualitative output

variable | observed | preditced |
---|---|---|

G | g | $\widehat{G}$ |

($x_i$, $y_i$) or ($x_i$, $g_i$), $i = 1, …, N$: **trainig data**

### 2.3 2 simple approaches to prediction: least squares and k-nearest neighbors

$X^T = \begin{bmatrix}

X_1 \

X_2 \

\vdots \

X_p

\end{bmatrix}

$

$f(X)=X^T \beta$ including intercept, i.e.,

- $\mbox{x}_1 = \begin{bmatrix} 1\ \vdots \ 1\end{bmatrix}$
- the first in $\beta$ is $\beta_0$

**least squares**: pick the coefficients $\beta$ to minimize the residual sum of squares $\mbox{RSS}(\beta) = \sum\limits_{i=1}^N(y_i-x_i^T \beta)^2 = (\textbf{y}-\textbf{X}\beta)^T(\textbf{y}-\textbf{X}\beta)$

RSS($\beta$) is a quadratic function of the parameters, and hence its minimum always exists, but may not be unique.

# examples used throughout the book

## example 1 supervised learning of classification: Email Spam

page 2

- 4601 email messages
- outcome:
`email`

or`spam`

- the relative frequencies of 57 of the most commonly occurring words and punctuation marks in the email message

goal: design an automatic spam detector to predict whether the email was junk email, or “spam”, and then filter out spam

- decide which features to use and how

## example 2 supervised learning of regression: prostate cancer

- 97 men who were about to receive a radical prostatectomy
goal: predict the log of PSA (

`lpsa`

) from a number of measurements including- log cancer volume
`lcavol`

- log prostate wight
`lweight`

- age
- log of benign prostatic hyperplasia amount
`lbph`

- seminal vesicle invasion
`svi`

- log of capsular penetration
`lcp`

- Gleason score
`gleason`

percent of Gleason scores 4 or 5

`pgg45`

- log cancer volume

## example 3 supervised learning of classification & keep the error rates very low: handwritten digit recognition

- data is from the handwritten ZIP codes on envelopes from U.S. postal mail
- each image is a segment from a five digit ZIP code, isolating a single digit
- 16 $\times$ 16 eight-bit grayscale maps
- each pixel ranging in intensity from 0 to 255
- normalized to have approximately the same size and orientation

goal: prediction the identity of each image (0, 1, …, 9, don’t know) from the 16 $\times$ 16 matrix of pixel intensities quickly and accurately

- method to keep the error rates very low
`don't know`

mails will be sorted by hand

## example 4: DNA Expression Microarrays

- 64 cancer tumors from different patients
- 6830 genes

goal

(a) which samples are most similar to each other, in terms of their expression profiles across genes?

```
unsupervised
```

(b) which genes are most similar to each other, in terms of their expression profiles across samples?

```
unsupervised
```

(c) do certain genes show very high (or low) expression for certain cancer samples?

我认为是descriptive