One of the most terrible textbook I have read

Textbook for the course: EHSC-GA 2306 Data Mining (Fall 2011)

Textbook for the course: EHSC-GA 2306 Data Mining (Spring 2013)

 Title: The Elements of Statistical Learning: Data Mining, Inference, and Prediction 2nd ed Authors: Trevor Hastie, Robert Tibshirani, Jerome Friedman Edition: 2 Finished Date Rating: 1 Language: English Genres: Data Science, Data Mining Level: Advanced Publishers: Springer Publication Date: 2016 ISBN: 978-0387848570 Format: Pdf Pages: 745 Download: Pdf

## Ch 1: introduction

“There is no true interpretation of anything; interpretation is a vehicle in the service of human comprehension. The value of interpretation is in enabling others to fruitfully think about an idea” - Andreas Buja

a learner: a prediction model from a training set of a data

## Ch 2: overview of supervised learning

### 2.1 introduction

supervised learning: use the inputs to predict the values of the outputs

• machine learning

• inputs
• outputs
• statistics

• predictors/independent variables
• dependent variables
• pattern recognition

• features
• responses

### 2.2 variable types

• variable type

1. qualitative/categorical/discrete variables/factors

• represent as targets

• 0 and 1
• -1 and 1
2. quantitative

3. ordered categorical

• dummy variables (2 types)

1. K-level qualitative variable is represented by a vector of K binary variables
2. K-level qualitative variable is represented by a vector of K-1 binary variables
• distinction in output type

• classificatin: predict qualitative outputs
• regression: predict quantitative outputs
• distinction in input type => disctinction in the types of methods that are used for prediction

• qualitative
• quantative

### terminology

• $X$: input variable(s)

• if $X$ is vector

• $X_j$: components
• outputs

• Y: quatiative outputs
• G: qualitative outputs
• observed values are written in lower case

• matrices are represented by bold uppercase letters
• vectors will be bold, if they have N components, such as, $\textbf{y}$

input

a variable observed
$X$ $x_i$
. a vector of variables a variable in the vecotor of variables
. $X$ $X_j$
observed $x_i$: $i$th observed $X$ (p-vector) $\mbox{x}_j$ (N-vector)

observed matrix: $\mbox{X}$

quantitative output

variable observed predicted
Y y $\widehat{Y}$

qualitative output

variable observed preditced
G g $\widehat{G}$

($x_i$, $y_i$) or ($x_i$, $g_i$), $i = 1, …, N$: trainig data

### 2.3 2 simple approaches to prediction: least squares and k-nearest neighbors

$X^T = \begin{bmatrix} X_1 \ X_2 \ \vdots \ X_p \end{bmatrix}$

$f(X)=X^T \beta$ including intercept, i.e.,

• $\mbox{x}_1 = \begin{bmatrix} 1\ \vdots \ 1\end{bmatrix}$
• the first in $\beta$ is $\beta_0$

least squares: pick the coefficients $\beta$ to minimize the residual sum of squares $\mbox{RSS}(\beta) = \sum\limits_{i=1}^N(y_i-x_i^T \beta)^2 = (\textbf{y}-\textbf{X}\beta)^T(\textbf{y}-\textbf{X}\beta)$

RSS($\beta$) is a quadratic function of the parameters, and hence its minimum always exists, but may not be unique.

# examples used throughout the book

## example 1 supervised learning of classification: Email Spam

page 2

• 4601 email messages
• outcome: email or spam
• the relative frequencies of 57 of the most commonly occurring words and punctuation marks in the email message

goal: design an automatic spam detector to predict whether the email was junk email, or “spam”, and then filter out spam

• decide which features to use and how

## example 2 supervised learning of regression: prostate cancer

• goal: predict the log of PSA (lpsa) from a number of measurements including

• log cancer volume lcavol
• log prostate wight lweight
• age
• log of benign prostatic hyperplasia amount lbph
• seminal vesicle invasion svi
• log of capsular penetration lcp
• Gleason score gleason
• percent of Gleason scores 4 or 5 pgg45

## example 3 supervised learning of classification & keep the error rates very low: handwritten digit recognition

• data is from the handwritten ZIP codes on envelopes from U.S. postal mail
• each image is a segment from a five digit ZIP code, isolating a single digit
• 16 $\times$ 16 eight-bit grayscale maps
• each pixel ranging in intensity from 0 to 255
• normalized to have approximately the same size and orientation

goal: prediction the identity of each image (0, 1, …, 9, don’t know) from the 16 $\times$ 16 matrix of pixel intensities quickly and accurately

• method to keep the error rates very low
• don't know mails will be sorted by hand

## example 4: DNA Expression Microarrays

• 64 cancer tumors from different patients
• 6830 genes

goal

(a) which samples are most similar to each other, in terms of their expression profiles across genes?

unsupervised


(b) which genes are most similar to each other, in terms of their expression profiles across samples?

unsupervised


(c) do certain genes show very high (or low) expression for certain cancer samples?