The Elements of Statistical Learning

Trevor Hastie,Robert

文学

机器学习 统计学习 统计 数据挖掘 Statistics 统计学 Data-Mining 数学

December 2008

Springer

目录
1 Introduction 2 Overview of Supervised Learning 2.1 Introduction 2.2 Variable Types and Terminology 2.3 Two Simple Approaches to Prediction: Least Squares and Nearest Neighbors 2.3.1 Linear Models and Least Squares 2.3.2 Nearest-Neighbor Methods 2.3.3 From Least Squares to Nearest Neighbors 2.4 Statistical Decision Theory 2.5 Local Methods in High Dimensions 2.6 Statistical Models, Supervised Learning and Function Approximation 2.6.1 A Statistical Model for the Joint Distribution Pr(X, Y ) 2.6.2 Supervised Learning 2.6.3 Function Approximation 2.7 Structured Regression Models 2.7.1 Difficulty of the Problem 2.8 Classes of Restricted Estimators 2.8.1 Roughness Penalty and Bayesian Methods 2.8.2 Kernel Methods and Local Regression 2.8.3 Basis Functions and Dictionary Methods 2.9 Model Selection and the Bias–Variance Tradeoff Bibliographic Notes Exercises 3 Linear Methods for Regression 3.1 Introduction 3.2 Linear Regression Models and Least Squares 3.2.1 Example: Prostate Cancer 3.2.2 The Gauss–Markov Theorem 3.2.3 Multiple Regression from Simple Univariate Regression 3.2.4 Multiple Outputs 3.3 Subset Selection 3.3.1 Best-Subset Selection 3.3.2 Forward- and Backward-Stepwise Selection 3.3.3 Forward-Stagewise Regression 3.3.4 Prostate Cancer Data Example (Continued) 3.4 Shrinkage Methods 3.4.1 Ridge Regression 3.4.2 The Lasso 3.4.3 Discussion: Subset Selection, Ridge Regression and the Lasso 3.4.4 Least Angle Regression 3.5 Methods Using Derived Input Directions 3.5.1 Principal Components Regression 3.5.2 Partial Least Squares 3.6 Discussion: A Comparison of the Selection and Shrinkage Methods 3.7 Multiple Outcome Shrinkage and Selection 3.8 More on the Lasso and Related Path Algorithms 3.8.1 Incremental Forward Stagewise Regression 3.8.2 Piecewise-Linear Path Algorithms 3.8.3 The Dantzig Selector 3.8.4 The Grouped Lasso 3.8.5 Further Properties of the Lasso 3.8.6 Pathwise Coordinate Optimization 3.9 Computational Considerations Bibliographic Notes Exercises 4 Linear Methods for Classification 4.1 Introduction 4.2 Linear Regression of an Indicator Matrix 4.3 Linear Discriminant Analysis 4.3.1 Regularized Discriminant Analysis 4.3.2 Computations for LDA 4.3.3 Reduced-Rank Linear Discriminant Analysis 4.4 Logistic Regression 4.4.1 Fitting Logistic Regression Models 4.4.2 Example: South African Heart Disease 4.4.3 Quadratic Approximations and Inference 4.4.4 L1 Regularized Logistic Regression 4.4.5 Logistic Regression or LDA? 4.5 Separating Hyperplanes 4.5.1 Rosenblatt’s Perceptron Learning Algorithm . 4.5.2 Optimal Separating Hyperplanes Bibliographic Notes Exercises 5 Basis Expansions and Regularization 5.1 Introduction 5.2 Piecewise Polynomials and Splines 5.2.1 Natural Cubic Splines 5.2.2 Example: South African Heart Disease (Continued) 5.2.3 Example: Phoneme Recognition 5.3 Filtering and Feature Extraction 5.4 Smoothing Splines 5.4.1 Degrees of Freedom and Smoother Matrices 5.5 Automatic Selection of the Smoothing Parameters 5.5.1 Fixing the Degrees of Freedom 5.5.2 The Bias–Variance Tradeoff 5.6 Nonparametric Logistic Regression 5.7 Multidimensional Splines 5.8 Regularization and Reproducing Kernel Hilbert Spaces 5.8.1 Spaces of Functions Generated by Kernels 5.8.2 Examples of RKHS 5.9 Wavelet Smoothing 5.9.1 Wavelet Bases and the Wavelet Transform 5.9.2 Adaptive Wavelet Filtering Bibliographic Notes Exercises Appendix: Computational Considerations for Splines Appendix: B-splines Appendix: Computations for Smoothing Splines 6 Kernel Smoothing Methods 6.1 One-Dimensional Kernel Smoothers 6.1.1 Local Linear Regression 6.1.2 Local Polynomial Regression 6.2 Selecting the Width of the Kernel 6.3 Local Regression in IRp 6.4 Structured Local Regression Models in IRp 6.4.1 Structured Kernels 6.4.2 Structured Regression Functions 6.5 Local Likelihood and Other Models 6.6 Kernel Density Estimation and Classification 6.6.1 Kernel Density Estimation 6.6.2 Kernel Density Classification 6.6.3 The Naive Bayes Classifier 6.7 Radial Basis Functions and Kernels 6.8 Mixture Models for Density Estimation and Classification 6.9 Computational Considerations Bibliographic Notes Exercises 7 Model Assessment and Selection 7.1 Introduction 7.2 Bias, Variance and Model Complexity 7.3 The Bias–Variance Decomposition 223 7.3.1 Example: Bias–Variance Tradeoff 7.4 Optimism of the Training Error Rate 7.5 Estimates of In-Sample Prediction Error 7.6 The Effective Number of Parameters 7.7 The Bayesian Approach and BIC 7.8 Minimum Description Length 7.9 Vapnik–Chervonenkis Dimension 7.9.1 Example (Continued) 7.10 Cross-Validation 7.10.1 K-Fold Cross-Validation 7.10.2 The Wrong and Right Way to Do Cross-validation 7.10.3 Does Cross-Validation Really Work? 7.11 Bootstrap Methods 7.11.1 Example (Continued) 7.12 Conditional or Expected Test Error? Bibliographic Notes Exercises 8 Model Inference and Averaging 8.1 Introduction 8.2 The Bootstrap and Maximum Likelihood Methods 8.2.1 A Smoothing Example 8.2.2 Maximum Likelihood Inference 8.2.3 Bootstrap versus Maximum Likelihood 8.3 Bayesian Methods 8.4 Relationship Between the Bootstrap and Bayesian Inference 8.5 The EM Algorithm 8.5.1 Two-Component Mixture Model 8.5.2 The EM Algorithm in General 8.5.3 EM as a Maximization–Maximization Procedure 8.6 MCMC for Sampling from the Posterior 8.7 Bagging 8.7.1 Example: Trees with Simulated Data 8.8 Model Averaging and Stacking 8.9 Stochastic Search: Bumping Bibliographic Notes Exercises 9 Additive Models, Trees, and Related Methods 9.1 Generalized Additive Models 9.1.1 Fitting Additive Models 9.1.2 Example: Additive Logistic Regression 9.1.3 Summary 9.2 Tree-Based Methods 9.2.1 Background 9.2.2 Regression Trees 9.2.3 Classification Trees 9.2.4 Other Issues 9.2.5 Spam Example (Continued) 9.3 PRIM: Bump Hunting 9.3.1 Spam Example (Continued) 9.4 MARS: Multivariate Adaptive Regression Splines 9.4.1 Spam Example (Continued) 9.4.2 Example (Simulated Data) 9.4.3 Other Issues 9.5 Hierarchical Mixtures of Experts 9.6 Missing Data 9.7 Computational Considerations Bibliographic Notes Exercises 10 Boosting and Additive Trees 10.1 Boosting Methods 10.1.1 Outline of This Chapter 10.2 Boosting Fits an Additive Model 10.3 Forward Stagewise Additive Modeling 10.4 Exponential Loss and AdaBoost 10.5 Why Exponential Loss? 10.6 Loss Functions and Robustness 10.7 “Off-the-Shelf” Procedures for Data Mining 10.8 Example: Spam Data 10.9 Boosting Trees 10.10 Numerical Optimization via Gradient Boosting 10.10.1 Steepest Descent 10.10.2 Gradient Boosting 10.10.3 Implementations of Gradient Boosting 10.11 Right-Sized Trees for Boosting 10.12 Regularization 10.12.1 Shrinkage 10.12.2 Subsampling 10.13 Interpretation 10.13.1 Relative Importance of Predictor Variables 10.13.2 Partial Dependence Plots 10.14 Illustrations 10.14.1 California Housing 10.14.2 New Zealand Fish 10.14.3 Demographics Data Bibliographic Notes Exercises 11 Neural Networks 11.1 Introduction 11.2 Projection Pursuit Regression 11.3 Neural Networks 11.4 Fitting Neural Networks 11.5 Some Issues in Training Neural Networks 11.5.1 Starting Values 11.5.2 Overfitting 11.5.3 Scaling of the Inputs 11.5.4 Number of Hidden Units and Layers 11.5.5 Multiple Minima 11.6 Example: Simulated Data 11.7 Example: ZIP Code Data 11.8 Discussion 11.9 Bayesian Neural Nets and the NIPS 2003 Challenge 11.9.1 Bayes, Boosting and Bagging 11.9.2 Performance Comparisons 11.10 Computational Considerations Bibliographic Notes Exercises 12 Support Vector Machines and Flexible Discriminants 12.1 Introduction 12.2 The Support Vector Classifier 12.2.1 Computing the Support Vector Classifier 12.2.2 Mixture Example (Continued) 12.3 Support Vector Machines and Kernels 12.3.1 Computing the SVM for Classification 12.3.2 The SVM as a Penalization Method 12.3.3 Function Estimation and Reproducing Kernels 12.3.4 SVMs and the Curse of Dimensionality 12.3.5 A Path Algorithm for the SVM Classifier 12.3.6 Support Vector Machines for Regression 12.3.7 Regression and Kernels 12.3.8 Discussion 12.4 Generalizing Linear Discriminant Analysis 12.5 Flexible Discriminant Analysis 12.5.1 Computing the FDA Estimates 12.6 Penalized Discriminant Analysis 12.7 Mixture Discriminant Analysis 12.7.1 Example: Waveform Data Bibliographic Notes Exercises 13 Prototype Methods and Nearest-Neighbors 13.1 Introduction 13.2 Prototype Methods 13.2.1 K-means Clustering 13.2.2 Learning Vector Quantization 13.2.3 Gaussian Mixtures 13.3 k-Nearest-Neighbor Classifiers 13.3.1 Example: A Comparative Study 13.3.2 Example: k-Nearest-Neighbors and Image Scene Classification 13.3.3 Invariant Metrics and Tangent Distance 13.4 Adaptive Nearest-Neighbor Methods 13.4.1 Example 13.4.2 Global Dimension Reduction for Nearest-Neighbors 13.5 Computational Considerations Bibliographic Notes Exercises 14 Unsupervised Learning 14.1 Introduction 14.2 Association Rules 14.2.1 Market Basket Analysis 14.2.2 The Apriori Algorithm 14.2.3 Example: Market Basket Analysis 14.2.4 Unsupervised as Supervised Learning 14.2.5 Generalized Association Rules 14.2.6 Choice of Supervised Learning Method 14.2.7 Example: Market Basket Analysis (Continued) 14.3 Cluster Analysis 14.3.1 Proximity Matrices 14.3.2 Dissimilarities Based on Attributes 14.3.3 Object Dissimilarity 14.3.4 Clustering Algorithms 14.3.5 Combinatorial Algorithms 14.3.6 K-means 14.3.7 Gaussian Mixtures as Soft K-means Clustering 14.3.8 Example: Human Tumor Microarray Data 14.3.9 Vector Quantization 14.3.10 K-medoids 14.3.11 Practical Issues 14.3.12 Hierarchical Clustering 14.4 Self-Organizing Maps 14.5 Principal Components, Curves and Surfaces 14.5.1 Principal Components 14.5.2 Principal Curves and Surfaces 14.5.3 Spectral Clustering 14.5.4 Kernel Principal Components 14.5.5 Sparse Principal Components 14.6 Non-negative Matrix Factorization 14.6.1 Archetypal Analysis 14.7 Independent Component Analysis and Exploratory Projection Pursuit 14.7.1 Latent Variables and Factor Analysis 14.7.2 Independent Component Analysis 14.7.3 Exploratory Projection Pursuit 14.7.4 A Direct Approach to ICA 14.8 Multidimensional Scaling 14.9 Nonlinear Dimension Reduction and Local Multidimensional Scaling 14.10 The Google PageRank Algorithm Bibliographic Notes Exercises 15 Random Forests 15.1 Introduction 15.2 Definition of Random Forests 15.3 Details of Random Forests 15.3.1 Out of Bag Samples 15.3.2 Variable Importance 15.3.3 Proximity Plots 15.3.4 Random Forests and Overfitting 15.4 Analysis of Random Forests 15.4.1 Variance and the De-Correlation Effect 15.4.2 Bias 15.4.3 Adaptive Nearest Neighbors Bibliographic Notes Exercises 16 Ensemble Learning 16.1 Introduction 16.2 Boosting and Regularization Paths 16.2.1 Penalized Regression 16.2.2 The “Bet on Sparsity” Principle 16.2.3 Regularization Paths, Over-fitting and Margins 16.3 Learning Ensembles 16.3.1 Learning a Good Ensemble 16.3.2 Rule Ensembles Bibliographic Notes Exercises 17 Undirected Graphical Models 17.1 Introduction 17.2 Markov Graphs and Their Properties 17.3 Undirected Graphical Models for Continuous Variables 17.3.1 Estimation of the Parameters when the Graph Structure is Known 17.3.2 Estimation of the Graph Structure 17.4 Undirected Graphical Models for Discrete Variables 17.4.1 Estimation of the Parameters when the Graph Structure is Known 17.4.2 Hidden Nodes 17.4.3 Estimation of the Graph Structure 17.4.4 Restricted Boltzmann Machines Exercises 18 High-Dimensional Problems: p ≫ N 18.1 When p is Much Bigger than N 18.2 Diagonal Linear Discriminant Analysis and Nearest Shrunken Centroids 18.3 Linear Classifiers with Quadratic Regularization 18.3.1 Regularized Discriminant Analysis 18.3.2 Logistic Regression with Quadratic Regularization 18.3.3 The Support Vector Classifier 18.3.4 Feature Selection 18.3.5 Computational Shortcuts When p ≫ N 18.4 Linear Classifiers with L1 Regularization 18.4.1 Application of Lasso to Protein Mass Spectroscopy 18.4.2 The Fused Lasso for Functional Data 18.5 Classification When Features are Unavailable 18.5.1 Example: String Kernels and Protein Classification 18.5.2 Classification and Other Models Using Inner-Product Kernels and Pairwise Distances . 18.5.3 Example: Abstracts Classification 18.6 High-Dimensional Regression: Supervised Principal Components 18.6.1 Connection to Latent-Variable Modeling 18.6.2 Relationship with Partial Least Squares 18.6.3 Pre-Conditioning for Feature Selection 18.7 Feature Assessment and the Multiple-Testing Problem 18.7.1 The False Discovery Rate 18.7.2 Asymmetric Cutpoints and the SAM Procedure 18.7.3 A Bayesian Interpretation of the FDR 18.8 Bibliographic Notes Exercises
【展开】
内容简介
During the past decade there has been an explosion in computation and information technology. With it have come vast amounts of data in a variety of fields such as medicine, biology, finance, and marketing. The challenge of understanding these data has led to the development of new tools in the field of statistics, and spawned new areas such as data mining, machine learning, and bioinformatics. Many of these tools have common underpinnings but are often expressed with different terminology. This book describes the important ideas in these areas in a common conceptual framework. While the approach is statistical, the emphasis is on concepts rather than mathematics. Many examples are given, with a liberal use of color graphics. It is a valuable resource for statisticians and anyone interested in data mining in science or industry. The book's coverage is broad, from supervised learning (prediction) to unsupervised learning. The many topics include neural networks, support vector machines, classification trees and boosting---the first comprehensive treatment of this topic in any book. This major new edition features many topics not covered in the original, including graphical models, random forests, ensemble methods, least angle regression & path algorithms for the lasso, non-negative matrix factorization, and spectral clustering. There is also a chapter on methods for "wide" data (p bigger than n), including multiple testing and false discovery rates.
【展开】
下载说明

1、追日是作者栎年创作的原创作品,下载链接均为网友上传的的网盘链接!

2、相识电子书提供优质免费的txt、pdf等下载链接,所有电子书均为完整版!

下载链接