The Elements of Statistical Learning: Data Mining, Inference, and Prediction 2nd ed

One of the most terrible textbook I have read

Textbook for the course: EHSC-GA 2306 Data Mining (Fall 2011)

Textbook for the course: EHSC-GA 2306 Data Mining (Spring 2013)

Title: The Elements of Statistical Learning: Data Mining, Inference, and Prediction 2nd ed
Authors: Trevor Hastie, Robert Tibshirani, Jerome Friedman
Edition: 2
Finished Date
Rating: 1
Language: English
Genres: Data Science, Data Mining
Level: Advanced
Publishers: Springer
Publication Date: 2016
ISBN: 978-0387848570
Format: Pdf
Pages: 745
Download: Pdf

Ch 1: introduction

“There is no true interpretation of anything; interpretation is a vehicle in the service of human comprehension. The value of interpretation is in enabling others to fruitfully think about an idea” - Andreas Buja



数学公式的解释会有分正确与错误之分. p value的解释也会有正确与错误之分

a learner: a prediction model from a training set of a data

Ch 2: overview of supervised learning

2.1 introduction

supervised learning: use the inputs to predict the values of the outputs

  • machine learning

    • inputs
    • outputs
  • statistics

    • predictors/independent variables
    • dependent variables
  • pattern recognition

    • features
    • responses

2.2 variable types

  • variable type

    1. qualitative/categorical/discrete variables/factors

      • represent as targets

        • 0 and 1
        • -1 and 1
    2. quantitative

    3. ordered categorical

      • dummy variables (2 types)

        1. K-level qualitative variable is represented by a vector of K binary variables
        2. K-level qualitative variable is represented by a vector of K-1 binary variables
  • distinction in output type

    • classificatin: predict qualitative outputs
    • regression: predict quantitative outputs
  • distinction in input type => disctinction in the types of methods that are used for prediction

    • qualitative
    • quantative


  • $X$: input variable(s)

    • if $X$ is vector

      • $X_j$: components
  • outputs

    • Y: quatiative outputs
    • G: qualitative outputs
  • observed values are written in lower case

  • matrices are represented by bold uppercase letters
  • vectors will be bold, if they have N components, such as, $\textbf{y}$


a variable observed
$X$ $x_i$
. a vector of variables a variable in the vecotor of variables
. $X$ $X_j$
observed $x_i$: $i$th observed $X$ (p-vector) $\mbox{x}_j$ (N-vector)

observed matrix: $\mbox{X}$

quantitative output

variable observed predicted
Y y $\widehat{Y}$

qualitative output

variable observed preditced
G g $\widehat{G}$

($x_i$, $y_i$) or ($x_i$, $g_i$), $i = 1, …, N$: trainig data

2.3 2 simple approaches to prediction: least squares and k-nearest neighbors

$X^T = \begin{bmatrix}
X_1 \
X_2 \
\vdots \

$f(X)=X^T \beta$ including intercept, i.e.,

  • $\mbox{x}_1 = \begin{bmatrix} 1\ \vdots \ 1\end{bmatrix}$
  • the first in $\beta$ is $\beta_0$

least squares: pick the coefficients $\beta$ to minimize the residual sum of squares $\mbox{RSS}(\beta) = \sum\limits_{i=1}^N(y_i-x_i^T \beta)^2 = (\textbf{y}-\textbf{X}\beta)^T(\textbf{y}-\textbf{X}\beta)$

RSS($\beta$) is a quadratic function of the parameters, and hence its minimum always exists, but may not be unique.

examples used throughout the book

example 1 supervised learning of classification: Email Spam

page 2

  • 4601 email messages
  • outcome: email or spam
  • the relative frequencies of 57 of the most commonly occurring words and punctuation marks in the email message

goal: design an automatic spam detector to predict whether the email was junk email, or “spam”, and then filter out spam

  • decide which features to use and how

example 2 supervised learning of regression: prostate cancer

  • 97 men who were about to receive a radical prostatectomy
  • goal: predict the log of PSA (lpsa) from a number of measurements including

    • log cancer volume lcavol
    • log prostate wight lweight
    • age
    • log of benign prostatic hyperplasia amount lbph
    • seminal vesicle invasion svi
    • log of capsular penetration lcp
    • Gleason score gleason
    • percent of Gleason scores 4 or 5 pgg45

example 3 supervised learning of classification & keep the error rates very low: handwritten digit recognition

  • data is from the handwritten ZIP codes on envelopes from U.S. postal mail
  • each image is a segment from a five digit ZIP code, isolating a single digit
    • 16 $\times$ 16 eight-bit grayscale maps
    • each pixel ranging in intensity from 0 to 255
    • normalized to have approximately the same size and orientation

goal: prediction the identity of each image (0, 1, …, 9, don’t know) from the 16 $\times$ 16 matrix of pixel intensities quickly and accurately

  • method to keep the error rates very low
    • don't know mails will be sorted by hand

example 4: DNA Expression Microarrays

  • 64 cancer tumors from different patients
  • 6830 genes


(a) which samples are most similar to each other, in terms of their expression profiles across genes?


(b) which genes are most similar to each other, in terms of their expression profiles across samples?


(c) do certain genes show very high (or low) expression for certain cancer samples?