DSI-US-5 Project 2 Regression Challenge¶

Predict the price of home sales for the Aimes Iowa Housing dataset¶

https://www.kaggle.com/c/dsi-us-5-project-2-regression-challenge/data

For this project we are tasked with making sales price predictions for house sales in Ames, Iowa, based on certain characteristics.

Links¶

Gather and process the data¶

As with all data science projects, the first step is to manage the data that we are given. We:

set up our Python environment by importing the necessary libraries and prepare fthe plotting environment.
create some reusable functions for examining the data.
create some reusable functions for cleaning the data.
read in the dataset from a CSV file.
clean, check, and explore the data.

Set up the environment¶

# Let's get the administrative stuff done first
# import all the libraries and set up the plotting

import datetime
import pandas as pd
import numpy as np
import scipy.stats as stats
from itertools import combinations
from sklearn.linear_model import Ridge, Lasso, ElasticNet, LinearRegression, RidgeCV, LassoCV, ElasticNetCV
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score, cross_val_predict, train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures, StandardScaler

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn')
sns.set(style="white", color_codes=True)
colors_palette = sns.color_palette("GnBu_d")
sns.set_palette(colors_palette)

# "GnBu_d" color palette
colors_str = ['#37535e', '#3b748a', '#4095b5', '#52aec9', '#72bfc4', '#93d0bf', '#c5e5eaff']

# Seaborn is behind matplotlib on updates
import warnings
warnings.filterwarnings('ignore')

Data checking functions¶

These functions, when called, check the consistency of the data and determine how many NaNs we have to deal with.

# Check that sales prices are striclty positive

def check_y(y):
    y_min = y.min()
    if y_min <= 0:
        print("y should be strictly positive.")
    n_nan = y.isnull().sum()
    if n_nan > 0:
        print("y has {} NaNs".format(n_nan))
    print("y: min: {} mean: {} max: {} NaNs: {}".format(y_min, y.mean(), y.max(),n_nan))

    return

# Check that all the PIDs and Ids are unique

def check_ids(df):
    # Unique PIDs
    n_rows = df.shape[0]
    n_pids = len(df['PID'].unique())
    if n_pids != n_rows:
        print("# PIDs {} but {} rows.".format(n_pids, n_rows))

    # Unique IDs
    n_ids = len(df['Id'].unique())
    if n_ids != n_rows:
        print("# Ids {} but {} rows.".format(n_ids, n_rows))

    return

# Check which non-numeric columns are missing values and what the possible values are for each object column

def check_cols(df):
    check_cols = df.select_dtypes([np.object]).columns
    for col in check_cols:
#        print("Values of {} are {}.".format(col,df[col].unique()))
        n_nan = df[col].isnull().sum()
        if n_nan > 0:
            print("{} has {} NaNs".format(col,n_nan))
    return

# Check which numeric columns are missing values

def check_data(df):
    s = df.shape
    print("Rows: {} Cols: {}".format(s[0],s[1]))

    # Check for null values
    null_data = df.isnull().sum()
    null_data_count = sum(df.isnull().sum())
    if  null_data_count > 0:
        print("There are {} null data.".format(null_data_count))
        print("Columns with NaN: {}".format(list(null_data[null_data > 0].index)))

    check_cols(df)

    return

Data cleaning functions¶

These functions fill the NaNs with appropriate values. Given more time, we could make these funcions more robust.

# Change the NaNs in numeric columns to a value ?

def clean_data(df, val):
    numeric_cols = df.select_dtypes([np.int64,np.float64,np.uint64]).columns
    df[numeric_cols] = df[numeric_cols].fillna(val)
    return

# Check values of categries and fill the NaNs with a specified value.

def clean_check_col(df, col,possible_values,fill_value):
    df[col] = df[col].fillna(fill_value)
    values = set(df[col])
    if not values.issubset(possible_values):
         print("Problem with {}.".format(col))
    return

# Turn "quality" columns into numeric ones

def clean_quality_col(df, col, dict_values,fill_value):
    possible_values = dict_values.keys()
    clean_check_col(df, col,possible_values,fill_value)

    for key,value in dict_values.items():
        df[col] = df[col].map(lambda cell: cell.replace(key,str(value)))
    df[col] = df[col].astype(int)
    return

Read in the data¶

The training data is in ../data/train.csv and the test data is in ../data/test.csv.

Read both sets of data into dataframes.
To reduce confusion, I call the "test" data df_kaggle.
Also check that the IDs and PIDs are unique in each dataframe.
Use the Id column as the index.
Split the SalePrice column from the training dataframe.

# Read in the data from the csv files
# Check the Ids and set Id as the index

train_csv = '../data/train.csv'
df_train = pd.read_csv(train_csv)
check_ids(df_train)
df_train.set_index("Id", inplace=True)

kaggle_csv = '../data/test.csv'
df_kaggle = pd.read_csv(kaggle_csv)
check_ids(df_kaggle)
df_kaggle.set_index("Id", inplace=True)

# For the training data:
# Check y: if y has negative values or NaNs should throw error

y_label = 'SalePrice'
y = df_train[y_label]
check_y(y)
df_train.drop(columns=y_label, inplace=True)

y: min: 12789 mean: 181469.70160897123 max: 611657 NaNs: 0

Examine both sets of data¶

Report the sizes of the dataframes
Report thue number of NaNs.
Use describe to locate any other obvious issues.

print("Training data:")
check_data(df_train)
print("")
print("Target Kaggle data:")
check_data(df_kaggle)

Training data:
Rows: 2051 Cols: 79
There are 9822 null data.
Columns with NaN: ['Lot Frontage', 'Alley', 'Mas Vnr Type', 'Mas Vnr Area', 'Bsmt Qual', 'Bsmt Cond', 'Bsmt Exposure', 'BsmtFin Type 1', 'BsmtFin SF 1', 'BsmtFin Type 2', 'BsmtFin SF 2', 'Bsmt Unf SF', 'Total Bsmt SF', 'Bsmt Full Bath', 'Bsmt Half Bath', 'Fireplace Qu', 'Garage Type', 'Garage Yr Blt', 'Garage Finish', 'Garage Cars', 'Garage Area', 'Garage Qual', 'Garage Cond', 'Pool QC', 'Fence', 'Misc Feature']
Alley has 1911 NaNs
Mas Vnr Type has 22 NaNs
Bsmt Qual has 55 NaNs
Bsmt Cond has 55 NaNs
Bsmt Exposure has 58 NaNs
BsmtFin Type 1 has 55 NaNs
BsmtFin Type 2 has 56 NaNs
Fireplace Qu has 1000 NaNs
Garage Type has 113 NaNs
Garage Finish has 114 NaNs
Garage Qual has 114 NaNs
Garage Cond has 114 NaNs
Pool QC has 2042 NaNs
Fence has 1651 NaNs
Misc Feature has 1986 NaNs

Target Kaggle data:
Rows: 879 Cols: 79
There are 4175 null data.
Columns with NaN: ['Lot Frontage', 'Alley', 'Mas Vnr Type', 'Mas Vnr Area', 'Bsmt Qual', 'Bsmt Cond', 'Bsmt Exposure', 'BsmtFin Type 1', 'BsmtFin Type 2', 'Electrical', 'Fireplace Qu', 'Garage Type', 'Garage Yr Blt', 'Garage Finish', 'Garage Qual', 'Garage Cond', 'Pool QC', 'Fence', 'Misc Feature']
Alley has 821 NaNs
Mas Vnr Type has 1 NaNs
Bsmt Qual has 25 NaNs
Bsmt Cond has 25 NaNs
Bsmt Exposure has 25 NaNs
BsmtFin Type 1 has 25 NaNs
BsmtFin Type 2 has 25 NaNs
Electrical has 1 NaNs
Fireplace Qu has 422 NaNs
Garage Type has 44 NaNs
Garage Finish has 45 NaNs
Garage Qual has 45 NaNs
Garage Cond has 45 NaNs
Pool QC has 875 NaNs
Fence has 707 NaNs
Misc Feature has 838 NaNs

Use describe to locate any other obvious issues.

df_train.describe()
df_kaggle.describe()

We see that there are lots of NaNs to deal with as well as some outliers in the data.

Data cleaning and initial feature engineering¶

Modify "quality" categorical columns to be integers instead of strings as the quality categroeis have "order" to them.
Clean any categorical columns that can't be changed to integers.
Fix the NaNs. Perhaps we should do some deeper checking before filling the NaNs.
Drop 'Utilities' and other boring features.

# Keep track of columns that have been cleaned
cleaned_cols = []
na_val = -999

# These could probably be even more generalized ...
quality_cols = ['Exter Qual','Exter Cond','Kitchen Qual','Bsmt Qual','Heating QC','Fireplace Qu',
                'Garage Qual','Garage Cond','Bsmt Cond','Pool QC']
dict_values = {'Ex': 5,
               'Gd' : 4,
               'TA' : 3,
               'Fa' : 2,
               'Po' : 1,
               'NA' : na_val }
for col in quality_cols:
    clean_quality_col(df_train, col, dict_values, "NA")
    clean_quality_col(df_kaggle, col, dict_values, "NA")
cleaned_cols = cleaned_cols + quality_cols

col = 'Land Slope'
dict_values = {'Gtl' : 3,
               'Mod' : 2,
               'Sev' : 1,
               'NA'  : na_val }
clean_quality_col(df_train, col, dict_values, "NA")
clean_quality_col(df_kaggle, col, dict_values, "NA")
cleaned_cols.append(col)

col = 'Bsmt Exposure'
dict_values = {'Gd' : 4,
               'Av' : 3,
               'Mn' : 2,
               'No' : 1,
               'NA' : na_val }
clean_quality_col(df_train, col, dict_values, "NA")
clean_quality_col(df_kaggle, col, dict_values, "NA")
cleaned_cols.append(col)

col = 'Garage Finish'
dict_values = {'Fin' : 3,
               'RFn' : 2,
               'Unf' : 1,
               'NA' : na_val }
clean_quality_col(df_train, col, dict_values, "NA")
clean_quality_col(df_kaggle, col, dict_values, "NA")
cleaned_cols.append(col)

col = 'Central Air'
dict_values = {'Y' : 1,
               'N' : na_val }
clean_quality_col(df_train, col, dict_values, "N")
clean_quality_col(df_kaggle, col, dict_values, "N")
cleaned_cols.append(col)

Check other categories and cleanup NaNs.¶

# Check categories
# Could be expanded to all categorical columns

col = 'Mas Vnr Type'
clean_check_col(df_train,col, {'None', 'BrkCmn', 'BrkFace', 'Stone', 'CBlock'},'None')
clean_check_col(df_kaggle,col, {'None', 'BrkCmn', 'BrkFace', 'Stone', 'CBlock'},'None')
cleaned_cols.append(col)

col = 'Alley'
clean_check_col(df_train,col,{'None', 'Pave', 'Grvl'},'None')
clean_check_col(df_kaggle,col,{'None', 'Pave', 'Grvl'},'None')
cleaned_cols.append(col)

col = 'BsmtFin Type 1'
clean_check_col(df_train,'BsmtFin Type 1',{'None', 'ALQ', 'BLQ', 'GLQ', 'Rec', 'LwQ', 'Unf'},'None')
clean_check_col(df_kaggle,'BsmtFin Type 1',{'None', 'ALQ', 'BLQ', 'GLQ', 'Rec', 'LwQ', 'Unf'},'None')
cleaned_cols.append(col)

col = 'BsmtFin Type 2'
clean_check_col(df_train,col,{'None', 'ALQ', 'BLQ', 'GLQ', 'Rec', 'LwQ', 'Unf'},'None')
clean_check_col(df_kaggle,col,{'None', 'ALQ', 'BLQ', 'GLQ', 'Rec', 'LwQ', 'Unf'},'None')
cleaned_cols.append(col)

col = 'Garage Type'
clean_check_col(df_train,col,{'None', 'Detchd', 'Attchd', 'BuiltIn', 'Basment', '2Types', 'CarPort'},'None')
clean_check_col(df_kaggle,col,{'None', 'Detchd', 'Attchd', 'BuiltIn', 'Basment', '2Types', 'CarPort'},'None')
cleaned_cols.append(col)

col = 'Fence'
clean_check_col(df_train,col,{'None', 'MnWw', 'MnPrv', 'GdWo', 'GdPrv'},'None')
clean_check_col(df_kaggle,col,{'None', 'MnWw', 'MnPrv', 'GdWo', 'GdPrv'},'None')
cleaned_cols.append(col)

col = 'Electrical'
clean_check_col(df_train,col,{'SBrkr', 'FuseP', 'FuseF', 'Mix', 'FuseA','NA'},'NA')
clean_check_col(df_kaggle,col,{'SBrkr', 'FuseP', 'FuseF', 'Mix', 'FuseA','NA'},'NA')
cleaned_cols.append(col)

col = 'Misc Feature'
clean_check_col(df_train,col,{'Elev', 'Gar2', 'Othr', 'Shed', 'TenC'},'Othr')
clean_check_col(df_kaggle,col,{'Elev', 'Gar2', 'Othr', 'Shed', 'TenC'},'Othr')
cleaned_cols.append(col)

# Update Lot Frontage
# Probably should have a better way to impute the value for missing data
col = 'Lot Frontage'
lf_mean = df_train[col].mean()
df_train.loc[:, col] = df_train.loc[:, col].fillna(lf_mean)
df_kaggle.loc[:, col] = df_kaggle.loc[:, col].fillna(lf_mean)
cleaned_cols.append(col)

# Change year, SF to float - should generalize this
cols = ['Gr Liv Area', '1st Flr SF', 'Year Built', 'Year Remod/Add']
for c in cols:
    df_train[c] = df_train[c].astype(float)

What columns are left that are categorical? Can we make any into integers?¶

# Determine what columns haven't cleaned and make sure it's ok
obj_cols =  set(df_train.select_dtypes([np.object]).columns)
all_cleaned_cols = set(cleaned_cols)
not_cleaned = list(set(obj_cols) - set(cleaned_cols))
print("You worked hard on the cleaning, but not hard enough. Here's what's left....\n {}".format(not_cleaned))

You worked hard on the cleaning, but not hard enough. Here's what's left....
 ['House Style', 'Lot Config', 'Utilities', 'Roof Matl', 'Exterior 2nd', 'Paved Drive', 'Functional', 'Sale Type', 'Land Contour', 'Bldg Type', 'Condition 1', 'Roof Style', 'Exterior 1st', 'Neighborhood', 'Lot Shape', 'Street', 'Condition 2', 'MS Zoning', 'Foundation', 'Heating']

How clean are we???¶

# Drop Utilities, MSSubClass
cols_drop = ['Utilities', 'MS SubClass', 'Garage Yr Blt','PID']
df_train.drop(cols_drop, axis=1, inplace=True)
df_kaggle.drop(cols_drop, axis=1, inplace=True)

check_data(df_train)
check_data(df_kaggle)

Rows: 2051 Cols: 75
There are 32 null data.
Columns with NaN: ['Mas Vnr Area', 'BsmtFin SF 1', 'BsmtFin SF 2', 'Bsmt Unf SF', 'Total Bsmt SF', 'Bsmt Full Bath', 'Bsmt Half Bath', 'Garage Cars', 'Garage Area']
Rows: 879 Cols: 75
There are 1 null data.
Columns with NaN: ['Mas Vnr Area']

Cleaner, but let's get rid of the numerical NaNs while we are here ....¶

clean_data(df_train, 0) # or 0
clean_data(df_kaggle, 0) # or 0
check_data(df_train)
check_data(df_kaggle)

Rows: 2051 Cols: 75
Rows: 879 Cols: 75

df_train.head()

Exploratory data analysis (EDA)¶

How many "categorical" features remain?
Which features should we drop and/or dummy?
Are there any interesting linear relationships?
What happens if we log the sales price?

Make dummies for all categorical data¶

# check_cols = df.select_dtypes([np.object]).columns
# df_train.select_dtypes([np.object]).columns
print("There are {} columns and {} categorial columns."
      .format(df_train.shape[1],
              df_train.select_dtypes([np.object]).shape[1]))

There are 75 columns and 27 categorial columns.

# Let's make dummies for the categorical data
df_train = pd.get_dummies(df_train)
df_kaggle = pd.get_dummies(df_kaggle)

# Get dummies could leave us with mismatched columns
# Make sure we have mathcing columns in the train and kaggle DataFrames

cols_train = df_train.columns
cols_kaggle = df_kaggle.columns

for c in cols_train:
    if c not in cols_kaggle:
        df_kaggle[c] = 0

for c in cols_kaggle:
    if c not in cols_train:
            df_train[c] = 0

columns = sorted(df_train.columns)
df_train = df_train[columns]
df_kaggle = df_kaggle[columns]

if df_train.shape[1] != df_kaggle.shape[1]:
    print("Train and Kaggle don't have same # columns: {} {}".format(df_train.shape[1], df_kaggle.shape[1]))

# Having made dummies, we now have lots and lots of columns
print("There are {} columns and {} categorial columns."
      .format(df_train.shape[1],
              df_train.select_dtypes([np.object]).shape[1]))

There are 253 columns and 0 categorial columns.

Data visualization¶

Plot a heat map to see which columns are most highly correlated with sale price.
Can we spot any outliers or other things of note?

# Let's find columns that are most highly correlated with price
top_corr = list(pd.concat([df_train, y], axis=1).corr()['SalePrice'].sort_values(ascending=False).index[1:16])

# Heat map of colmns with highest correlation with Sales Price
plt.figure(figsize=(10,10))
sns.heatmap(pd.concat([df_train, y],axis=1)[['SalePrice'] + top_corr].corr().abs().sort_values(by='SalePrice',ascending=False),
            vmin=-1,vmax=1, cmap = colors_str, annot=True);

# Pair plot of features to see any trends

top_corr_sub = list(pd.concat([df_train, y], axis=1).corr()['SalePrice'].sort_values(ascending=False).index[1:8])
sns.pairplot(df_train[top_corr_sub]);

Plot distributions of the top correlated variables¶

We can plot some of the columns to see if there are any trends in the distributions of values.

# Plots for data analysis
df_y = pd.concat([y,df_train[top_corr]],axis=1)
cols_y = ['SalePrice'] + top_corr

# Set up the matplotlib figure
f, axes = plt.subplots(5, 2, figsize=(15, 10)) #, sharex=True)
f.suptitle('Distributions of features correlated with Sales Price')
sns.despine(left=True)
plt.subplots_adjust(hspace = .5)
color = colors_str[3]

n_rows = 5
n_cols = 2

for r in range(n_rows):
    for c in range(n_cols):
        idx = r*2+c
        ax=axes[r,c]
        ax.set_title(cols_y[idx])
        sns.distplot(df_y.iloc[:,idx], kde=False, color=color, ax=ax)
        ax.set_xlabel('')
        ax.set_ylabel('');

Scatter/box plots of variables compared to Sale price¶

What are the trends? Are there any outliers?

# Set up the matplotlib figure
f, axes = plt.subplots(5, 2) #, sharex=True)
f.set_figheight(15)
f.set_figwidth(20)
f.suptitle('Comparisons')
sns.despine(left=True)
plt.subplots_adjust(hspace = .5)
color = colors_str[3]

n_rows = 5
n_cols = 2

for r in range(n_rows):
    for c in range(n_cols):
        idx = r*2+c
        ax=axes[r,c]
        ax.set_title(top_corr[idx] + " / Price")
        # Scatter or box based on type
        if df_train[top_corr[idx]].dtype == "float64":
            sns.regplot(x=top_corr[idx], y=y, data=df_train, color=color, ax=ax)
        else:
            sns.boxplot(x=top_corr[idx], y=y, data=df_train, color=color, ax=ax)
            sns.regplot(x=top_corr[idx], y=y, data=df_train, scatter=False, color=color, ax=ax)
        ax.set_xlabel(top_corr[idx])
        ax.set_ylabel('Price')

Conclusions:¶

Outliers in square footage.
Outliers in houses with large garages.

Drop outliers¶

# Let's drop the outliers
df_train[y_label] = y
df_train = df_train[df_train['1st Flr SF'] <= 4000]
y = df_train[y_label]
df_train.drop(columns=y_label, inplace=True)

Feature engineering¶

Let's create a column called Total SF as a sum of various SF measures.
Log of Year Built.

# Total square footage
df_train["Total SF"] = df_train["1st Flr SF"] + df_train["2nd Flr SF"] + df_train["BsmtFin SF 1"] + df_train["BsmtFin SF 2"]
df_kaggle["Total SF"] = df_kaggle["1st Flr SF"] + df_kaggle["2nd Flr SF"] + df_kaggle["BsmtFin SF 1"] + df_kaggle["BsmtFin SF 2"]

# Log of year built?
df_train["Log year"] = np.log(df_train["Year Built"])
df_kaggle["Log year"] = np.log(df_kaggle["Year Built"])

Future explorations¶

Perhaps some numerical columns should be categorical, such as month.
Perhaps some categorical columns should be numerical.

Model the data¶

Feature selection: We need to decided which columns to use in our linear regressions.
Model testing: What models score well?

Model pipeline function¶

Split the full train set into a train and test set.
Option to transform the target using a log to reduce the skew.
Use piplines to improve workflow.
Test if PolynomialFeatures helps.
Scale the data for better performance.
Four regression choices: Linear Regression, Ridge, Lasso, and ElasticNet.

# Train the regression models on subest of columns
def model_pipeline(df_train, df_kaggle, X_cols, y, reg_type, poly=False, plot=False):
    X = df_train[X_cols]

    # Transofrm the prices?
    transform_y = False
    if transform_y:
        y = np.log1p(y)

    # Test/train split of "full training" data
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

    # Set up the Pipline based on input arguments
    pf = ('pf', PolynomialFeatures())
    ss = ('ss', StandardScaler())
    lr = ('lr', LinearRegression())
    rcv = ( 'rcv', RidgeCV(alphas=np.logspace(0,5,200), cv=8))
    lcv = ('lcv', LassoCV(n_alphas=500, max_iter=100000))
    ecv = ('ecv', ElasticNetCV(n_alphas=100, l1_ratio=[.1, .5, .7, .9, .95, .99, 1], cv=10))

    pipe_list = []

    # Polynomial Features?
    if poly:
        pipe_list.append(pf)

    # Scale the features
    pipe_list.append(ss)

    # Choose a regression algorithm
    if reg_type == 'lr':
        print("LinearRegression:")
        model = lr
    elif reg_type == 'rcv':
        print("RidgeCV:")
        model = rcv
    elif reg_type == 'lcv':
        print("LassoCV:")
        model = lcv
    else: #ElasticNet
        print("ElasticNetCV:")
        model = ecv
    pipe_list.append(model)

    # Fit the pipline to the training data
    pipe = Pipeline(pipe_list)
    pipe.fit(X_train,y_train)
    train_score = pipe.score(X_train, y_train)

    # Score the pipeline model on the test set
    test_score = pipe.score(X_test, y_test)
    y_test_hat = pipe.predict(X_test)

    # Refit on FULL test set
    pipe.fit(X,y)
    full_score = pipe.score(X, y)
    y_hat = pipe.predict(X)
    y_hat_train = pipe.predict(X_train)
    y_hat_test = pipe.predict(X_test)

    print("   Train: {} Test: {} Full: {}".format(train_score, test_score, full_score))

    # Predict the sales price of the Kaggle data
    X_kaggle = df_kaggle[X_cols]
    y_kaggle = pipe.predict(X_kaggle)

    # "un" transform y
    if transform_y:
        y_kaggle = np.expm1(y_kaggle)

    # Create dataframe for Kaggle submission
    df_soln = pd.DataFrame(df_kaggle.index)
    df_soln['SalePrice'] = y_kaggle
    df_soln.set_index(['Id'], inplace=True)

    if plot:
        # Plot predictions
        plt.figure(figsize=(10, 8))
        plt.title("Predictions for training and test sets")
        if transform_y:
            sns.regplot(x = np.expm1(y_hat_train), y = np.expm1(y_train), data=X_train, color = colors_str[0], label = "Training data")
            sns.regplot(x = np.expm1(y_hat_test), y = np.expm1(y_test), data=X_test, color = colors_str[5], label = "Testing data")
        else:
            sns.regplot(x = y_hat_train, y = y_train, data=X_train, color = colors_str[0], label = "Training data")
            sns.regplot(x = y_hat_test, y = y_test, data=X_test, color = colors_str[5], label = "Testing data")
        plt.xlabel("Predicted prices")
        plt.ylabel("Actual sale price")
        plt.legend()

    return pipe, df_soln

Evaluate the model¶

Choose a subset of features
Run initial tests
Update columns

First test¶

Choose 16 and 32 top correlated features.
Try all four regression models with 16 columns and PolynomialFeatures.
Try all four regression models with 32 columns and not PolynomialFeatures.

# Feature choices - all the columns or ones highly correlated with price
X_cols = df_train.columns
X_cols_corr_16 = list(pd.concat([df_train[X_cols], y], axis=1).corr()['SalePrice'].sort_values(ascending=False).index[1:16])
X_cols_corr_32 = list(pd.concat([df_train[X_cols], y], axis=1).corr()['SalePrice'].sort_values(ascending=False).index[1:32])

# Try pipline with four types of models
reg_type = ['lr','rcv','lcv','ecv']
print("Test with 16 correlated columns and PolynomialFeatures.")
poly = True
for reg in reg_type:
    pipe, df_soln = model_pipeline(df_train, df_kaggle, X_cols_corr_16, y, reg, poly)

print("")
print("Test with 32 correlated columns but not PolynomialFeatures.")
poly = False
for reg in reg_type:
    pipe, df_soln = model_pipeline(df_train, df_kaggle, X_cols_corr_32, y, reg, poly)

# Test with 16 correlated columns and PolynomialFeatures.
# LinearRegression:
#    Train: 0.9237060276826285 Test: 0.9104119554144706 Full: 0.9240686101171751
# RidgeCV:
#    Train: 0.9082049558489937 Test: 0.9182713100564817 Full: 0.9126412891489258
# LassoCV:
#    Train: 0.9075836962668186 Test: 0.9174018718403385 Full: 0.9119122701289253
# ElasticNetCV:
#     Train: 0.9085340145571048 Test: 0.9176748695103848 Full: 0.9128014954661423

# Test with 32 correlated columns but not PolynomialFeatures.
# LinearRegression:
#    Train: 0.887891014167355 Test: 0.8837176230519301 Full: 0.8887715167205408
# RidgeCV:
#    Train: 0.8871598323113536 Test: 0.8854053254428426 Full: 0.8883266576770845
# LassoCV:
#    Train: 0.8874196067096932 Test: 0.8846977558669055 Full: 0.8883275258306575
# ElasticNetCV:
#    Train: 0.8874917522365515 Test: 0.8846394212576741 Full: 0.8884179846426734

Test with 16 correlated columns and PolynomialFeatures.
LinearRegression:
   Train: 0.9237060276826285 Test: 0.9104119554144706 Full: 0.9240686101171751
RidgeCV:
   Train: 0.9082049558489937 Test: 0.9182713100564817 Full: 0.9126412891489258
LassoCV:
   Train: 0.9075836962668186 Test: 0.9174018718403385 Full: 0.9119122701289253
ElasticNetCV:
   Train: 0.9085340145571048 Test: 0.9176748695103848 Full: 0.9128014954661423

Test with 32 correlated columns but not PolynomialFeatures.
LinearRegression:
   Train: 0.887891014167355 Test: 0.8837176230519301 Full: 0.8887715167205408
RidgeCV:
   Train: 0.8871598323113536 Test: 0.8854053254428426 Full: 0.8883266576770845
LassoCV:
   Train: 0.8874196067096932 Test: 0.8846977558669055 Full: 0.8883275258306575
ElasticNetCV:
   Train: 0.8874917522365515 Test: 0.8846394212576741 Full: 0.8884179846426734

First test reuslts¶

The test scores on the 16 top correlated columns with PolynomialFeatures look good at over 90%!

Next steps:¶

Let's see which columns Lasso zeros out -- we should probably just save the information from the previous run of Lasso in order to be more efficient.
Let's plot the residuals to see if we can spot any trends.
Any columns that Lasso zeros out in the non-Poly runs get dropped, and we choose new correlated columns.
This could be automated to run until we don't have columns zeroed out by Lasso.

# Run Lasso and Plot the residuals
poly=False
pipe, df_soln = model_pipeline(df_train, df_kaggle, X_cols_corr_32, y, "lcv", poly, True)

# LassoCV:
#    Train: 0.8874196067096932 Test: 0.8846977558669055 Full: 0.8883275258306575

LassoCV:
   Train: 0.8874196067096932 Test: 0.8846977558669055 Full: 0.8883275258306575

# Drop the columns that Lasso zeros out and run all the models again

df_lcv_coefs = pd.DataFrame({
    'coefs': X_cols_corr_32,
    'vals': pipe.named_steps['lcv'].coef_
}).set_index('coefs').sort_values('vals', ascending=False)

# Resample the correlated columns without the ones we are dropping
X_cols_drop = list(df_lcv_coefs[abs(df_lcv_coefs["vals"]) <= .001].index)
print("Dropping {} cols.".format(len(X_cols_drop)))
X_cols = [c for c in X_cols if c not in X_cols_drop]
X_cols_corr_16 = list(pd.concat([df_train[X_cols], y], axis=1).corr()['SalePrice'].sort_values(ascending=False).index[1:16])
X_cols_corr_32 = list(pd.concat([df_train[X_cols], y], axis=1).corr()['SalePrice'].sort_values(ascending=False).index[1:32])

print("Test with 16 correlated columns and PolynomialFeatures.")
poly = True
for reg in reg_type:
    pipe, df_soln = model_pipeline(df_train, df_kaggle, X_cols_corr_16, y, reg, poly)

print("")
print("Test with 32 correlated columns but not PolynomialFeatures.")
poly = False
for reg in reg_type:
    pipe, df_soln = model_pipeline(df_train, df_kaggle, X_cols_corr_32, y, reg, poly)

# Dropping 5 cols.
# Test with 16 correlated columns and PolynomialFeatures.
# LinearRegression:
#    Train: 0.924656172862715 Test: 0.9102131066565704 Full: 0.9247377205942611
# RidgeCV:
#    Train: 0.9126125651209006 Test: 0.9226946127601622 Full: 0.9161536786862323
# LassoCV:
#    Train: 0.9007090632975836 Test: 0.9168593529560066 Full: 0.915073748596963
# ElasticNetCV:
#      Train: 0.9124750143765084 Test: 0.9212219388967573 Full: 0.9162140013589439

# Test with 32 correlated columns but not PolynomialFeatures.
# LinearRegression:
#    Train: 0.8878501960725624 Test: 0.8846577925136798 Full: 0.8887389940166835
# RidgeCV:
#    Train: 0.8875644474569013 Test: 0.8860700482417241 Full: 0.8884603811274605
# LassoCV:
#    Train: 0.887497488209082 Test: 0.8850722158098991 Full: 0.8886017100179591
# ElasticNetCV:
#    Train: 0.887582287156061 Test: 0.8850357397814944 Full: 0.8886518360157969

Dropping 5 cols.
Test with 16 correlated columns and PolynomialFeatures.
LinearRegression:
   Train: 0.924656172862715 Test: 0.9102131066565704 Full: 0.9247377205942611
RidgeCV:
   Train: 0.9126125651209006 Test: 0.9226946127601622 Full: 0.9161536786862323
LassoCV:
   Train: 0.9007090632975836 Test: 0.9168593529560066 Full: 0.915073748596963
ElasticNetCV:
   Train: 0.9124750143765084 Test: 0.9212219388967573 Full: 0.9162140013589439

Test with 32 correlated columns but not PolynomialFeatures.
LinearRegression:
   Train: 0.8878501960725624 Test: 0.8846577925136798 Full: 0.8887389940166835
RidgeCV:
   Train: 0.8875644474569013 Test: 0.8860700482417241 Full: 0.8884603811274605
LassoCV:
   Train: 0.887497488209082 Test: 0.8850722158098991 Full: 0.8886017100179591
ElasticNetCV:
   Train: 0.887582287156061 Test: 0.8850357397814944 Full: 0.8886518360157969

Second test reuslts¶

The test scores on the 16 top correlated columns with PolynomialFeatures look slightly better!

Next steps: (just as before):¶

Let's see which columns Lasso zeros out -- we should probably just save the information from the previous run of Lasso in order to be more efficient.
Any columns that Lasso zeros out in the non-Poly runs get dropped, and we choose new correlated columns.

# Run Lasso again
poly=False
pipe, df_soln = model_pipeline(df_train, df_kaggle, X_cols_corr_32, y, "lcv", poly)

df_lcv_coefs = pd.DataFrame({
    'coefs': X_cols_corr_32,
    'vals': pipe.named_steps['lcv'].coef_
}).set_index('coefs').sort_values('vals', ascending=False)

# Resample the correlated columns without the ones we are dropping
X_cols_drop = list(df_lcv_coefs[abs(df_lcv_coefs["vals"]) <= .001].index)
print("Dropping {} cols.".format(len(X_cols_drop)))
X_cols = [c for c in X_cols if c not in X_cols_drop]
X_cols_corr_16 = list(pd.concat([df_train[X_cols], y], axis=1).corr()['SalePrice'].sort_values(ascending=False).index[1:16])
X_cols_corr_32 = list(pd.concat([df_train[X_cols], y], axis=1).corr()['SalePrice'].sort_values(ascending=False).index[1:32])

# LassoCV:
#    Train: 0.887497488209082 Test: 0.8850722158098991 Full: 0.8886017100179591
# Dropping 2 cols.

LassoCV:
   Train: 0.887497488209082 Test: 0.8850722158098991 Full: 0.8886017100179591
Dropping 2 cols.

# Run one really big model now.
print("Test with 32 columns, LassoCV, and Polynomial Features.")
poly = True
pipe, df_soln = model_pipeline(df_train, df_kaggle, X_cols_corr_32, y, "lcv", poly, True)

# Test with 32 columns, LassoCV, and Polynomial Features.
# LassoCV:
#    Train: 0.9295135127811229 Test: 0.9280722123044919 Full: 0.9303993065460205

Test with 32 columns, LassoCV, and Polynomial Features.
LassoCV:
   Train: 0.9257961408285457 Test: 0.9200391758533722 Full: 0.925784952605761

print("The columns used were: {}.".format(X_cols_corr_32))

The columns used were: ['Overall Qual', 'Total SF', 'Gr Liv Area', 'Exter Qual', 'Kitchen Qual', 'Total Bsmt SF', 'Garage Area', '1st Flr SF', 'Log year', 'Year Remod/Add', 'Full Bath', 'Foundation_PConc', 'Mas Vnr Area', 'Fireplaces', 'BsmtFin Type 1_GLQ', 'Heating QC', 'Neighborhood_NridgHt', 'BsmtFin SF 1', 'Sale Type_New', 'Garage Type_Attchd', 'Exterior 1st_VinylSd', 'Lot Frontage', 'Open Porch SF', 'Wood Deck SF', 'Mas Vnr Type_Stone', 'Lot Area', 'Paved Drive_Y', 'Central Air', 'Lot Shape_IR1', 'Roof Style_Hip', 'Neighborhood_NoRidge'].

No longer getting an improvment in the scores.

Answer the question¶

Output the results to upload to Kaggle.¶

# Write predicted Kaggle solution out to a file
now = str(datetime.datetime.now())
f'predictions_{now}'
df_soln.to_csv(f'../data/model_preds_{now}.csv')

Interpretation¶

For some reason, my Kaggle scores were never very good with this model. I am not sure why this was the case. I should investigate this further, myabe by entering the public Kaggle competion for the Ames Housing data problem.

The best model I had used 32 columns, PolynomialFeatures, and LassoCV. The columns were:
['Overall Qual', 'Total SF', 'Gr Liv Area', 'Exter Qual', 'Kitchen Qual', 'Total Bsmt SF', 'Garage Area', '1st Flr SF', 'Log year', 'Year Remod/Add', 'Full Bath', 'Foundation_PConc', 'Mas Vnr Area', 'Fireplaces', 'BsmtFin Type 1_GLQ', 'Heating QC', 'Neighborhood_NridgHt', 'BsmtFin SF 1', 'Sale Type_New', 'Garage Type_Attchd', 'Lot Frontage', 'Open Porch SF', 'Wood Deck SF', 'Mas Vnr Type_Stone', 'Lot Area', 'Lot Shape_IR1', 'Roof Style_Hip', 'Neighborhood_NoRidge', 'Mas Vnr Type_BrkFace', 'Neighborhood_StoneBr', 'Electrical_SBrkr']

Interpretation¶

Try negative correlated columns, not just positive
Try averaging results from a number of models
Try different features or modeling different features in other ways.
Try other models and add grid search.

	PID	MS SubClass	Lot Frontage	Lot Area	Overall Qual	Overall Cond	Year Built	Year Remod/Add	Mas Vnr Area	BsmtFin SF 1	...	Garage Area	Wood Deck SF	Open Porch SF	Enclosed Porch	3Ssn Porch	Screen Porch	Pool Area	Misc Val	Mo Sold	Yr Sold
count	8.790000e+02	879.000000	719.000000	879.000000	879.000000	879.000000	879.000000	879.000000	878.000000	879.000000	...	879.000000	879.000000	879.000000	879.000000	879.000000	879.000000	879.000000	879.000000	879.000000	879.000000
mean	7.165050e+08	58.270762	69.630042	10340.920364	6.054608	5.565415	1970.533561	1984.444824	106.982916	443.397042	...	470.832765	93.560865	47.478953	24.037543	2.594994	14.813424	1.882821	48.443686	6.207053	2007.824801
std	1.889135e+08	42.211389	23.625372	10047.335167	1.374756	1.128422	30.403527	20.454546	188.356829	442.480223	...	213.070155	121.174306	69.209179	73.212237	24.948416	52.975963	29.899698	549.858353	2.644097	1.327396
min	5.263021e+08	20.000000	21.000000	1477.000000	2.000000	1.000000	1880.000000	1950.000000	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	2006.000000
25%	5.284861e+08	20.000000	59.000000	7298.500000	5.000000	5.000000	1954.000000	1967.000000	0.000000	0.000000	...	323.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	5.000000	2007.000000
50%	5.354542e+08	50.000000	68.000000	9453.000000	6.000000	5.000000	1972.000000	1992.000000	0.000000	374.000000	...	473.000000	0.000000	27.000000	0.000000	0.000000	0.000000	0.000000	0.000000	6.000000	2008.000000
75%	9.071921e+08	70.000000	80.000000	11606.500000	7.000000	6.000000	2000.000000	2003.000000	173.500000	735.500000	...	576.000000	171.000000	70.000000	0.000000	0.000000	0.000000	0.000000	0.000000	8.000000	2009.000000
max	1.007100e+09	190.000000	182.000000	215245.000000	10.000000	9.000000	2010.000000	2010.000000	1378.000000	2288.000000	...	1488.000000	690.000000	742.000000	1012.000000	360.000000	576.000000	555.000000	15500.000000	12.000000	2010.000000

	MS Zoning	Lot Frontage	Lot Area	Street	Alley	Lot Shape	Land Contour	Lot Config	Land Slope	Neighborhood	...	3Ssn Porch	Screen Porch	Pool Area	Pool QC	Fence	Misc Feature	Misc Val	Mo Sold	Yr Sold	Sale Type
Id
109	RL	69.0552	13517	Pave	None	IR1	Lvl	CulDSac	3	Sawyer	...	0	0	0	-999	None	Othr	0	3	2010	WD
544	RL	43.0000	11492	Pave	None	IR1	Lvl	CulDSac	3	SawyerW	...	0	0	0	-999	None	Othr	0	4	2009	WD
153	RL	68.0000	7922	Pave	None	Reg	Lvl	Inside	3	NAmes	...	0	0	0	-999	None	Othr	0	1	2010	WD
318	RL	73.0000	9802	Pave	None	Reg	Lvl	Inside	3	Timber	...	0	0	0	-999	None	Othr	0	4	2010	WD
255	RL	82.0000	14235	Pave	None	IR1	Lvl	Inside	3	SawyerW	...	0	0	0	-999	None	Othr	0	3	2010	WD