IE 7275 Data Mining in Engineering • Department of Mechanical and Industrial Engineering • Northeastern University • © 2020 3/19/2024 1 Variable Selection Using Multiple Linear Regression Sagar Kamarthi Northeastern University
[email protected] IE 7275 Data Mining in Engineering • Department of Mechanical and Industrial Engineering • Northeastern University • © 2020 2 Example problem We are building a model for car price prediction Let our training data contain 4 predictors (d = 4) Miles driven (X1) Number of seats (X2) Miles per gallon (X3) Age of the car (X4) The target variable (y) is the price of the car
IE 7275 Data Mining in Engineering • Department of Mechanical and Industrial Engineering • Northeastern University • © 2020 3/19/2024 3 Exhaustive Search Sagar Kamarthi Northeastern University
[email protected] IE 7275 Data Mining in Engineering • Department of Mechanical and Industrial Engineering • Northeastern University • © 2023 4 Exhaustive search Step 1: Let S0 be the null model with no predictors. It simply returns the sample mean of response variable values of observations in the dataset. Step 2: For k = 1, 2, ..., d: Fit all possible models SSkk rr with k predictors, where r = 1, 2, ..., dd kk =dCk ; then select the model with SSkk ∗ with the lowest SSE or biggest R2 Step 3: Using cross-validation test error, R2 adj, Mallow’s CP, AIC, or BIC, select the winning model among {SSkk ∗ | k = 0, 1, 2, ..., d}