Lecture Applied data science: Validation
Số trang: 23
Loại file: pdf
Dung lượng: 624.88 KB
Lượt xem: 35
Lượt tải: 0
Xem trước 3 trang đầu tiên của tài liệu này:
Thông tin tài liệu:
Lecture "Applied data science: Validation" includes content: validation set approach; overfitting; cross-validation; data leakage; nested cross-validation; bootstrapping;... We invite you to consult!
Nội dung trích xuất từ tài liệu:
Lecture Applied data science: Validation Validation Overview 1. Introduction 8. Validation 2. Application 9. Regularisation 3. EDA 10. Clustering 4. Learning Process 11. Evaluation 5. Bias-Variance Tradeoff 12. Deployment 6. Regression (review) 13. Ethics 7. Classification Lecture outline - Validation set approach - Overfitting - Cross-validation - Data leakage - Nested cross-validation - Bootstrapping Validation set approach Validation set approach - Randomly split the original data into two, a training set and a test set. - Fit the OLS model on the training set and predict the responses in the validation set - Calculate the test MSE (MSE from using the model on the test set) Validation set approach Overfitting … is the tendency of data mining procedures to tailor models to the training data, at the expense of generalisation to previously unseen data. … as a model gets more complex it is allowed to pick up harmful false correlations (noise)... The harm occurs when these false correlations produce incorrect generalisations in the model. Validation set approach (continued) Pros. simple and easy to implement Cons. - Highly variable (in multiple runs) - Tend to overestimate the test error (because we used only roughly half of the original dataset for training) Cross-validation Cross-validation is a resampling method, which - Repeatedly and randomly draws subsets of data from a sample - Refits a model (e.g. OLS) on these subsets of data to reveal information unknown if fitted the model only once, e.g. variability of the fitted model - Is computationally expensive - Has 2 common methods - Cross-validation: model selection and model evaluation - Bootstrapping: evaluating the variability of a parameter estimate Leave one out cross validation (LOOCV) - Use only 1 observation for testing, and fit the OLS regression on the remaining of the original data - Repeat the procedure n times so that each observation is used for testing once. - Calculate CV error Leave one out cross validation (LOOCV) Pros - Unbiased estimate of the test error (because we used almost all data points for training) - Very stable (identical CV error from multiple runs)! Cons - Very time consuming (especially when n is large and/or fitting a complex model) K-fold cross validation - Randomly divide the original data into k groups (folds) - Train the model on k-1 folds, use the last fold for testing - Repeat the procedure k times, so that each fold is used for testing once - (Repeat the above 3 steps multiple times) - Calculate CV error K-fold cross validation Pros - Less computationally demanding than LOOCV Cons - Test error tend to be more biased compared to LOOCV (but much less so compared to validation set approach) Cross validation for model selection Model selection chooses the model with smallest test MSE => LOOCV (being the most stable) allows for the unambiguous choice. … or we can use the one standard error rule (Occam’s razor principle) Cross validation for model evaluation Model evaluation estimates the expected range of error in real life applications - LOOCV: least biased error estimate but with largest variance - Leave one out: most biased error estimate but with smallest variance (only 1 value of the test error) - K-fold CV: a balance between bias and variance of the error estimate Test error estimate by different CV strategies for the regression model mpg ~ horsepower + horsepower^2 Cross validation for time series data Must preserve the chronological order of the data… https://medium.com/@soumyachess1496/cross-validation-in-time-series-566ae4981ce4 https://towardsdatascience.com/time-series-nested-cross-validation-76adba623eb9 Data leakage mpg ~ horsepower + horsepower^2 Data leakage … is when information from outside the training dataset is used to create the model. This additional information can allow the model to learn or know something that it otherwise would not know and in turn invalidate the estimated performance of the model being constructed. https://machinelearningmastery.com/data-leakage-machine-learning/ Column-wise leakage If any other feature whose value would not actually be available in practice at the time you’d want to use the model to make a prediction, is a feature that can introduce leakage to your model. https://en.wikipedia.org/wiki/Leakage_(machine_learning) Examples: ‘MinutesLate’ is included in the training of a model to predict ‘IsLate’ Row-wise leakage … is caused by improper sharing of information between rows of data. https://en.wikipedia.org/wiki/Leakage_(machine_learning) Examples: data rows used for both training and validation/testing or premature featurisation Data leakage Minimising risks of leakage - Perform data preparations within each CV fold - Column-wise: centering, standardising, one-hot- encoding, etc. of columns in each fold - Row-wise: oversampling and undersampling - Completely separate a hold-out dataset for final testing of the model. https://stats.stackexchange.com/questions/351638/random-sampling- methods-for-handling-class-imbalance Nested cross validation In order to overcome the bias in performance evaluation, model selection should be viewed as an integral part of the model fitting procedure, and should be conducted independently in each trial in order to p ...
Nội dung trích xuất từ tài liệu:
Lecture Applied data science: Validation Validation Overview 1. Introduction 8. Validation 2. Application 9. Regularisation 3. EDA 10. Clustering 4. Learning Process 11. Evaluation 5. Bias-Variance Tradeoff 12. Deployment 6. Regression (review) 13. Ethics 7. Classification Lecture outline - Validation set approach - Overfitting - Cross-validation - Data leakage - Nested cross-validation - Bootstrapping Validation set approach Validation set approach - Randomly split the original data into two, a training set and a test set. - Fit the OLS model on the training set and predict the responses in the validation set - Calculate the test MSE (MSE from using the model on the test set) Validation set approach Overfitting … is the tendency of data mining procedures to tailor models to the training data, at the expense of generalisation to previously unseen data. … as a model gets more complex it is allowed to pick up harmful false correlations (noise)... The harm occurs when these false correlations produce incorrect generalisations in the model. Validation set approach (continued) Pros. simple and easy to implement Cons. - Highly variable (in multiple runs) - Tend to overestimate the test error (because we used only roughly half of the original dataset for training) Cross-validation Cross-validation is a resampling method, which - Repeatedly and randomly draws subsets of data from a sample - Refits a model (e.g. OLS) on these subsets of data to reveal information unknown if fitted the model only once, e.g. variability of the fitted model - Is computationally expensive - Has 2 common methods - Cross-validation: model selection and model evaluation - Bootstrapping: evaluating the variability of a parameter estimate Leave one out cross validation (LOOCV) - Use only 1 observation for testing, and fit the OLS regression on the remaining of the original data - Repeat the procedure n times so that each observation is used for testing once. - Calculate CV error Leave one out cross validation (LOOCV) Pros - Unbiased estimate of the test error (because we used almost all data points for training) - Very stable (identical CV error from multiple runs)! Cons - Very time consuming (especially when n is large and/or fitting a complex model) K-fold cross validation - Randomly divide the original data into k groups (folds) - Train the model on k-1 folds, use the last fold for testing - Repeat the procedure k times, so that each fold is used for testing once - (Repeat the above 3 steps multiple times) - Calculate CV error K-fold cross validation Pros - Less computationally demanding than LOOCV Cons - Test error tend to be more biased compared to LOOCV (but much less so compared to validation set approach) Cross validation for model selection Model selection chooses the model with smallest test MSE => LOOCV (being the most stable) allows for the unambiguous choice. … or we can use the one standard error rule (Occam’s razor principle) Cross validation for model evaluation Model evaluation estimates the expected range of error in real life applications - LOOCV: least biased error estimate but with largest variance - Leave one out: most biased error estimate but with smallest variance (only 1 value of the test error) - K-fold CV: a balance between bias and variance of the error estimate Test error estimate by different CV strategies for the regression model mpg ~ horsepower + horsepower^2 Cross validation for time series data Must preserve the chronological order of the data… https://medium.com/@soumyachess1496/cross-validation-in-time-series-566ae4981ce4 https://towardsdatascience.com/time-series-nested-cross-validation-76adba623eb9 Data leakage mpg ~ horsepower + horsepower^2 Data leakage … is when information from outside the training dataset is used to create the model. This additional information can allow the model to learn or know something that it otherwise would not know and in turn invalidate the estimated performance of the model being constructed. https://machinelearningmastery.com/data-leakage-machine-learning/ Column-wise leakage If any other feature whose value would not actually be available in practice at the time you’d want to use the model to make a prediction, is a feature that can introduce leakage to your model. https://en.wikipedia.org/wiki/Leakage_(machine_learning) Examples: ‘MinutesLate’ is included in the training of a model to predict ‘IsLate’ Row-wise leakage … is caused by improper sharing of information between rows of data. https://en.wikipedia.org/wiki/Leakage_(machine_learning) Examples: data rows used for both training and validation/testing or premature featurisation Data leakage Minimising risks of leakage - Perform data preparations within each CV fold - Column-wise: centering, standardising, one-hot- encoding, etc. of columns in each fold - Row-wise: oversampling and undersampling - Completely separate a hold-out dataset for final testing of the model. https://stats.stackexchange.com/questions/351638/random-sampling- methods-for-handling-class-imbalance Nested cross validation In order to overcome the bias in performance evaluation, model selection should be viewed as an integral part of the model fitting procedure, and should be conducted independently in each trial in order to p ...
Tìm kiếm theo từ khóa liên quan:
Lecture Applied data science Applied data science Validation set approach Cross-validation Nested cross-validation Data leakageTài liệu có liên quan:
-
Lecture Applied data science: Exploratory data analysis
35 trang 46 0 0 -
Lecture Applied data science: Classification
18 trang 43 0 0 -
Statistical consequences of staging exploration and confirmation
11 trang 42 0 0 -
Lecture Applied data science: Application
12 trang 39 0 0 -
Lecture Applied data science: Linear regression (review)
20 trang 36 0 0 -
Lecture Applied data science: Regularisation
34 trang 33 0 0 -
Lecture Applied data science: Clustering
21 trang 30 0 0 -
Lecture Applied data science: Introduction
20 trang 27 0 0 -
10 trang 22 0 0
-
Lecture Applied data science: Evaluation, deployment, ethics
19 trang 21 0 0