Background:¶
Two Sigma created a Kaggle competition featuring rental listing data from RentHop. In this project, we will predict the number of inquiries a new listing receives based on the listing’s creation date and other features. Doing so will help RentHop better handle fraud control, identify potential listing quality issues, and allow owners and agents to better understand renters’ needs and preferences.¶
This project was completed for George Washington University, Machine Learning I, with Lee Eyler, Jacob McKay, and Mikko He¶
Part One: Load Data and Libraries¶
In [1]:
# import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# inline plotting
% matplotlib inline
# block scientific notation; round to numeric variables to two decimal places
pd.set_option('display.float_format', lambda x: '%.2f' % x)
# set display options
pd.set_option('display.max_columns', 400)
pd.set_option('display.max_rows',400)
In [3]:
# read in data: train, test, and sampel submission files
train_df = pd.read_json('train.json')
test_df = pd.read_json('test.json')
sample_sub_df = pd.read_csv('sample_submission.csv')
train_df.shape, test_df.shape
Out[3]:
In [4]:
# combine test and train, look at the data
full_df = pd.concat([train_df,test_df], axis=0, ignore_index=True).copy()
full_df.shape
full_df.info()
Part Two: Feature Engineering¶
In [5]:
# convert to datetime data type
# break data into individual columns
from datetime import datetime
full_df['created'] = pd.to_datetime(full_df['created'], format='%Y-%m-%d %H:%M:%S')
full_df['year'] = full_df['created'].dt.year
full_df['month'] = full_df['created'].dt.month
full_df['day'] = full_df['created'].dt.day
full_df['hour'] = full_df['created'].dt.hour
full_df['weekday'] = full_df['created'].dt.weekday
In [6]:
# links to photos are stored in array, taking the length of the array provides the count of photos per observation
full_df['photos_count'] = full_df.photos.apply(len)
In [7]:
# feature names are stored in array, taking the length of the array provides the count of features per observation
full_df['features_count'] = full_df.features.apply(len)
In [9]:
# calculate price divided by bedrooms
full_df['price_per_bed'] = full_df.apply(lambda x: x.price if x.bedrooms == 0\
else x.price / x.bedrooms, axis=1)
In [10]:
# calculate price divided by (bedrooms + bathrooms)
full_df['price_per_bedbath'] = full_df.apply(lambda x: x.price if x.bedrooms + x.bathrooms == 0\
else x.price / (x.bedrooms + x.bathrooms), axis=1)
In [11]:
# dummy variable for elevator value in 'features' column
full_df['Elevator'] = 0
elevator_list = ['elevator', 'elevators', 'Elevators', 'Elevator']
for row in range(0,len(full_df)):
for feature in full_df.iloc[row]['features']:
if feature in elevator_list:
full_df.ix[row,'Elevator'] = 1
In [12]:
# dummy variable for doorman value in 'features' column
full_df['Doorman'] = 0
door_list = ['Doorman','doorman','door man', 'Door man', 'Door Man']
for row in range(0,len(full_df)):
for feature in full_df.iloc[row]['features']:
if feature in door_list:
full_df.ix[row,'Doorman'] = 1
In [13]:
# dummy variable for hardwood value in 'features' column
full_df['Hardwood'] = 0
hardwood_list = ['Hardwood','hardwood','hard wood', 'Hard wood', 'Hard Wood']
for row in range(0,len(full_df)):
for feature in full_df.iloc[row]['features']:
if feature in hardwood_list:
full_df.ix[row,'Hardwood'] = 1
In [14]:
# dummy variable for laundry value in 'features' column
full_df['Laundry'] = 0
laundry_list = ['Laundry','In Unit Laundry', 'laundry']
for row in range(0,len(full_df)):
for feature in full_df.iloc[row]['features']:
if feature in laundry_list:
full_df.ix[row,'Laundry'] = 1
In [15]:
# dummy variable for dishwasher value in 'features' column
full_df['Dishwasher'] = 0
dish_list = ['Dishwasher', 'dishwasher']
for row in range(0,len(full_df)):
for feature in full_df.iloc[row]['features']:
if feature in dish_list:
full_df.ix[row,'Dishwasher'] = 1
In [16]:
# reorder columns to make response variable the first column
col_order = full_df.columns.tolist()
col_order
Out[16]:
In [17]:
new_col_order = [ 'interest_level','bathrooms',
'bedrooms',
'building_id',
'created',
'description',
'display_address',
'features',
'latitude',
'listing_id',
'longitude',
'manager_id',
'photos',
'price',
'street_address',
'year',
'month',
'day',
'hour',
'photos_count',
'weekday',
'price_per_bed',
'price_per_bedbath',
'Elevator',
'Doorman',
'Hardwood',
'Laundry',
'Dishwasher',
'features_count']
In [18]:
full_df = full_df[new_col_order]
In [19]:
#look at data again!
full_df.info()
In [20]:
# once data cleaning and feature engineering is complete, use this code to split the combined dataset
# back into train and test; the indices indicate the lengths of the original dataframes
train_df = full_df.iloc[:49352].copy()
test_df = full_df.iloc[49352:].copy()
# encode string labels as numeric values
from sklearn.preprocessing import LabelEncoder
class_le = LabelEncoder()
X_train = train_df.iloc[:,1:].values # data only; no labels
y_train = class_le.fit_transform(train_df.iloc[:,:1].values.ravel())# no data; labels only
X_test = test_df.iloc[:,1:].values # data only; no labels
In [21]:
X_train.shape, y_train.shape
Out[21]:
Part Three: Visualizing the Data¶
In [22]:
# the "low" interest category accounts for a majority of the class labels.
sns.countplot(train_df.interest_level, order=['low', 'medium', 'high']);
In [23]:
# import random forest algorithm
from sklearn.ensemble import RandomForestClassifier
# select variables; leaving out categorical data for now
X_train_rf_features = X_train[:,[0,1,7,9,12,15,16,17,18,19,20,21,22,23,24,25,26,27]]
y_train_rf_features = y_train
X_test_rf_features = X_test[:,[0,1,7,9,12,15,16,17,18,19,20,21,22,23,24,25,26,27]]
feature_labels = full_df.columns[[1,2,8,10,13,16,17,18,19,20,21,22,23,24,25,26,27,28]]
# set features for RF
forest = RandomForestClassifier(n_estimators=250,
random_state=1,
n_jobs=-1)
# fit the model
forest.fit(X_train_rf_features,y_train_rf_features)
# obtain feature importances
feat_importance = forest.feature_importances_
# create indices
indices = np.argsort(feat_importance)[::-1]
# print variables and standardized importance score
for f in range(X_train_rf_features.shape[1]):
print("%2d) %-*s %f" % (f + 1, 30, feature_labels[indices[f]], feat_importance[indices[f]]))
In [24]:
# this is a visual representation of the results from above
sns.set(style="whitegrid")
sns.set_color_codes("pastel")
sns.set_context('talk')
plt.bar(range(X_train_rf_features.shape[1]), feat_importance[indices],
color='r',
align='center')
plt.xticks(range(X_train_rf_features.shape[1]), feature_labels[indices],
rotation=90)
plt.xlim([-1, X_train_rf_features.shape[1]])
plt.title('Feature Importance')
plt.tight_layout()
plt.show()
Part Six: Exploring dimensionality reduction using Principle Component Analysis¶
Principal Component Analysis (unsupervised dimensionality reduction) and Linear Discriminant Analysis (supervised feature extraction) agree that projecting the existing variables onto a smaller subspace can still be highly representative of the original data set¶
In [25]:
from sklearn.decomposition import PCA
pca = PCA(n_components=None)
In [26]:
# only using numeric and categorical data w/ dummy variables
X_train_for_pca = X_train[:,[0,1,7,9,12,15,16,17,18,19,20,21,22,23,24,25,26,27]]
# standardize the data
from sklearn.preprocessing import StandardScaler
stdsc = StandardScaler()
X_train_for_pca_std = stdsc.fit_transform(X_train_for_pca)
# fitting pca to the data
X_train_pca = pca.fit_transform(X_train_for_pca_std)
In [27]:
# obtain feature importances
pca_var_explained = pca.explained_variance_ratio_
feature_labels = full_df.columns[[1,2,8,10,13,16,17,18,19,20,21,22,23,24,25,26,27,28]]
# cumulative sum of explained ration
cumulative_var_explained = np.cumsum(pca_var_explained)
# cumulative distribution function for principle componets
# based on explained variance ratio
sns.set(style="whitegrid")
sns.set_color_codes("pastel")
sns.set_context('talk')
plt.plot(range(X_train_for_pca.shape[1]), cumulative_var_explained)
plt.xlim([-1, X_train_for_pca.shape[1]])
plt.title('Principal Components: Cumulative Explained Variance Ratio')
Out[27]:
Part Seven: Exploring feature extraction with Linear Discriminant Analysis¶
In [28]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis(solver='eigen',n_components=None)
In [29]:
# only using numeric and categorical data w/ dummy variables
X_train_for_lda = X_train[:,[0,1,7,9,12,15,16,17,18,19,20,21,22,23,24,25,26,27]]
y_train_for_lda = y_train
# standardize the data
from sklearn.preprocessing import StandardScaler
stdsc = StandardScaler()
X_train_for_lda_std = stdsc.fit_transform(X_train_for_lda)
# fitting pca to the data
X_train_lda = lda.fit_transform(X_train_for_lda_std,y_train_for_lda)
In [30]:
lda.explained_variance_ratio_
Out[30]:
In [31]:
# obtain feature importances
lda_var_explained = lda.explained_variance_ratio_
# cumulative sum of explained ration
lda_cumulative_var_explained = np.cumsum(lda_var_explained)
# cumulative distribution function for principle componets
# based on explained variance ratio
sns.set(style="whitegrid")
sns.set_color_codes("pastel")
sns.set_context('talk')
plt.plot(range(1,3), lda_cumulative_var_explained)
plt.xlim(0,3)
plt.title('Linear Discriminants: Cumulative Explained Variance Ratio')
Out[31]:
In [32]:
# import random forest classifier and cross val score from scikit-learn
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
In [33]:
# initiate random forest
forest = RandomForestClassifier(criterion='gini', n_estimators= 250, max_features='auto', random_state=1,n_jobs=-1)
In [34]:
# cross validation gives us three scores
rf_scores = cross_val_score(forest, X_train_rf_features, y_train_rf_features, scoring='accuracy', cv=10)
rf_scores.mean() - rf_scores.std(), rf_scores.mean(), rf_scores.mean() + rf_scores.std()
Out[34]:
Random Forest Classifier ROC Curve¶
In [35]:
# subset data for ROC plot purposes
X_rf_roc = X_train[:,[0,1,7,9,12,15,16,17,18,19,20,21,22,23,24,25,26,27]] #data only; no labels
y_rf_roc = y_train # no data; labels only
In [36]:
# binarize labels
from sklearn.preprocessing import label_binarize
y_rf_roc = label_binarize(y_rf_roc, classes=[0, 1, 2])
n_classes = y_rf_roc.shape[1]
In [37]:
# split training and test sets
from sklearn.model_selection import train_test_split
X_train_rf_roc, X_test_rf_roc, y_train_rf_roc, y_test_rf_roc = train_test_split(X_rf_roc, y_rf_roc, test_size=.7, random_state=1)
In [38]:
# predict classes
from sklearn.multiclass import OneVsRestClassifier
classifier = OneVsRestClassifier(RandomForestClassifier(criterion='entropy', n_estimators=600, max_features=6,
random_state=1,n_jobs=-1))
y_score = classifier.fit(X_train_rf_roc, y_train_rf_roc).predict(X_test_rf_roc)
In [39]:
# ROC curve and ROC area for each class
from sklearn.metrics import roc_curve, auc
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
fpr[i], tpr[i], _ = roc_curve(y_test_rf_roc[:, i], y_score[:, i])
roc_auc[i] = auc(fpr[i], tpr[i])
In [40]:
# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_test_rf_roc.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])
In [41]:
from scipy import interp
# aggregate false positive rates
all_fpr = np.unique(np.concatenate([fpr[i] for i in range(n_classes)]))
# interpolate ROC curves
mean_tpr = np.zeros_like(all_fpr)
for i in range(n_classes):
mean_tpr += interp(all_fpr, fpr[i], tpr[i])
# average, and compute AUC
mean_tpr /= n_classes
fpr["macro"] = all_fpr
tpr["macro"] = mean_tpr
roc_auc["macro"] = auc(fpr["macro"], tpr["macro"])
# plot ROC curves
plt.figure()
lw = 2
plt.plot(fpr["micro"], tpr["micro"],
label='micro-average ROC curve (area = {0:0.2f})'
''.format(roc_auc["micro"]),
color='m', linestyle=':', linewidth=4)
plt.plot(fpr["macro"], tpr["macro"],
label='macro-average ROC curve (area = {0:0.2f})'
''.format(roc_auc["macro"]),
color='k', linestyle=':', linewidth=4)
colors = ['r', 'g', 'b']
for i, color in zip(range(n_classes), colors):
plt.plot(fpr[i], tpr[i], color=color, lw=lw,
label='ROC curve of class {0} (area = {1:0.2f})'
''.format(i, roc_auc[i]))
plt.plot([0, 1], [0, 1], 'k--', lw=lw)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Random Forest Multi-Class ROC Curve (Training Data Only)')
plt.legend(loc="lower right")
plt.show()
In [42]:
train_df.interest_level.value_counts()[1]
Out[42]:
In [43]:
train_df.interest_level.value_counts()[0] / len(train_df.interest_level)
Out[43]:
In [44]:
train_df.interest_level.value_counts()[1] / len(train_df.interest_level)
Out[44]:
In [45]:
train_df.interest_level.value_counts()[2] / len(train_df.interest_level)
Out[45]:
In [46]:
train_df.interest_level.value_counts()[0] / train_df.interest_level.value_counts()[2]
Out[46]:
In [47]:
train_df.interest_level.value_counts()[1] / train_df.interest_level.value_counts()[2]
Out[47]:
KNN Classifier w/ All Features¶
In [48]:
# scale features for KNN
from sklearn.preprocessing import StandardScaler
stdsc = StandardScaler()
# separate training data, training labels, and testing data
X_train_knn_features = X_train[:,[0,1,7,9,12,15,16,17,18,19,20,21,22,23,24,25,26,27]]
y_train_knn_features = y_train
X_test_knn_features = X_test[:,[0,1,7,9,12,15,16,17,18,19,20,21,22,23,24,25,26,27]]
# standardize training and testing data
X_train_knn_features_std = stdsc.fit_transform(X_train_knn_features)
X_test_knn_features_std = stdsc.fit_transform(X_test_knn_features)
In [49]:
# initialize knn model
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(metric='minkowski', p=1, n_neighbors=4, n_jobs=-1)
In [50]:
# no grid search; only cross validation
knn_scores = cross_val_score(knn, X_train_knn_features_std, y_train_knn_features, scoring='accuracy', cv=10)
knn_scores.mean() - knn_scores.std(), knn_scores.mean(), knn_scores.mean() + knn_scores.std()
Out[50]:
KNN Classifier w/ Using Features from Random Forest Feature Selection¶
In [51]:
# scale features for KNN
from sklearn.preprocessing import StandardScaler
stdsc = StandardScaler()
# separate training data, training labels, and testing data
X_train_knn_features_select = X_train[:,[7,9,12,16,17,18,20,21,27]]
y_train_knn_features_select = y_train
X_test_knn_features_select = X_test[:,[7,9,12,16,17,18,20,21,27]]
# standardize training and testing data
X_train_knn_features_select_std = stdsc.fit_transform(X_train_knn_features_select)
X_test_knn_features_select_std = stdsc.fit_transform(X_test_knn_features_select)
In [52]:
# initialize knn model
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(metric='minkowski', p=1, n_neighbors=4, n_jobs=-1)
In [53]:
# no grid search; only cross validation
knn_fs_scores = cross_val_score(knn, X_train_knn_features_select_std, y_train_knn_features_select, scoring='accuracy', cv=10)
knn_fs_scores.mean() - knn_fs_scores.std(), knn_fs_scores.mean(), knn_fs_scores.mean() + knn_fs_scores.std()
Out[53]:
KNN Classifier ROC Curve¶
In [54]:
# subset data for ROC plot purposes
X_knn_roc = X_train[:,[0,1,7,9,12,15,16,17,18,19,20,21,22,23,24,25,26,27]] #data only; no labels
y_knn_roc = y_train # no data; labels only
In [55]:
# binarize labels
from sklearn.preprocessing import label_binarize
y_knn_roc = label_binarize(y_knn_roc, classes=[0, 1, 2])
n_classes = y_knn_roc.shape[1]
In [56]:
# split training and test sets
from sklearn.model_selection import train_test_split
X_train_knn_roc, X_test_knn_roc, y_train_knn_roc, y_test_knn_roc = train_test_split(X_knn_roc, y_knn_roc, test_size=.7, random_state=1)
In [57]:
# predict classes
from sklearn.multiclass import OneVsRestClassifier
classifier = OneVsRestClassifier(KNeighborsClassifier(n_neighbors=8, p=1, metric='minkowski',n_jobs=-1))
y_score = classifier.fit(X_train_knn_roc, y_train_knn_roc).predict(X_test_knn_roc)
In [58]:
# ROC curve and ROC area for each class
from sklearn.metrics import roc_curve, auc
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
fpr[i], tpr[i], _ = roc_curve(y_test_knn_roc[:, i], y_score[:, i])
roc_auc[i] = auc(fpr[i], tpr[i])
In [59]:
# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_test_knn_roc.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])
In [60]:
from scipy import interp
# aggregate false positive rates
all_fpr = np.unique(np.concatenate([fpr[i] for i in range(n_classes)]))
# interpolate ROC curves
mean_tpr = np.zeros_like(all_fpr)
for i in range(n_classes):
mean_tpr += interp(all_fpr, fpr[i], tpr[i])
# average, and compute AUC
mean_tpr /= n_classes
fpr["macro"] = all_fpr
tpr["macro"] = mean_tpr
roc_auc["macro"] = auc(fpr["macro"], tpr["macro"])
# plot ROC curves
plt.figure()
lw = 2
plt.plot(fpr["micro"], tpr["micro"],
label='micro-average ROC curve (area = {0:0.2f})'
''.format(roc_auc["micro"]),
color='m', linestyle=':', linewidth=4)
plt.plot(fpr["macro"], tpr["macro"],
label='macro-average ROC curve (area = {0:0.2f})'
''.format(roc_auc["macro"]),
color='k', linestyle=':', linewidth=4)
colors = ['r', 'g', 'b']
for i, color in zip(range(n_classes), colors):
plt.plot(fpr[i], tpr[i], color=color, lw=lw,
label='ROC curve of class {0} (area = {1:0.2f})'
''.format(i, roc_auc[i]))
plt.plot([0, 1], [0, 1], 'k--', lw=lw)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('KNN Multi-Class ROC Curve (Training Data Only)')
plt.legend(loc="lower right")
plt.show()
Logistic Regression Classifier¶
What is Logistic Regression?¶
- classification model that can perform well on linearly separable classes
- can be used for both binary and multi-class classification (via One v Rest)
- logistic regression is grounded in probability, speficially looking at the odds in favor or a particular event
- odds ratio = (probability of event we want to predict / (1 - probability of event we want to predict))
- logit function, ie logarithm of odds ratio =
log(probability of event we want to predict / (1 - probability of event we want to predict))
- logit function takes inputs between 0 and 1 and transforms the inputs into real numbers, which can be expressed as a linear relationship
- logit(probability( event we want to predict given that the observation has feature x)) = w0x0 + w1x1 + ... + wnxn
- we want the actual probability, so we use the logistic function (sometimes called sigmoid) which is the inverse of logit function
1 / (1 + log^-(w0x0 + w1x1 + ... + wnxn))
- result is a probability that an observation belongs to a particular class
- we can also convert the probabilities into binary outcomes via a quantizer (ex. if probability > 50%, then label as class 1)
- optimizing via gradient ascent: the weights for logistic regression are being optimized by maximizing the log-likelihood function
- optimizing via gradient descent: loglikelihood function is rewritten as a cost function that can be minimized in order to optimize the weights for logstic regression
- with either optimization method, we penalize wrong predictions with an increasingly larger cost
In [61]:
# scale features for logistic regression
from sklearn.preprocessing import StandardScaler
stdsc = StandardScaler()
# separate training data, training labels, and testing data
X_train_lr_features = X_train[:,[0,1,7,9,15,16,17,18,19,20,21,22,23,24,25,26,27]]
y_train_lr_features = y_train
X_test_lr_features = X_test[:,[0,1,7,9,15,16,17,18,19,20,21,22,23,24,25,26,27]]
# standardize training and testing data
X_train_lr_features_std = stdsc.fit_transform(X_train_lr_features)
X_test_lr_features_std = stdsc.fit_transform(X_test_lr_features)
In [65]:
# initiate logistic regression model
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(penalty='l2', C=10, random_state=1, n_jobs=-1)
lr.fit(X_train_lr_features_std, y_train_lr_features)
Out[65]:
In [66]:
# no grid search; only cross validation
lr_scores = cross_val_score(lr, X_train_lr_features_std, y_train_lr_features, scoring='accuracy', cv=10)
lr_scores.mean() - lr_scores.std(), lr_scores.mean(), lr_scores.mean() + lr_scores.std()
Out[66]:
Let's check out the coefficients for the Logistic Regression Model
In [67]:
# coefficients for scaled units
# class 0 vs class 1 and 2
# obtain features labels
lr_feature_labels = full_df.columns[[1,2,8,10,16,17,18,19,20,21,22,23,24,25,26,27,28]]
#X_train_lr_features = X_train[:,[0,1,7,9,15,16,17,18,19,20,21,22,23,24,25,26,27]]
# obtain feature importances
lr_feat_importance = lr.coef_[0]
# create indices
lr_indices = np.argsort(lr_feat_importance)[::-1]
# print variables and standardized importance score
for f in range(X_train_lr_features_std.shape[1]):
print("%2d) %-*s %f" % (f + 1, 30, lr_feature_labels[lr_indices[f]], lr_feat_importance[lr_indices[f]]))
In [68]:
# coefficients for scaled units
# class 1 vs class 0 and 2
# obtain features labels
lr_feature_labels = full_df.columns[[1,2,8,10,16,17,18,19,20,21,22,23,24,25,26,27,28]]
# obtain feature importances
lr_feat_importance = lr.coef_[1]
# create indices
lr_indices = np.argsort(lr_feat_importance)[::-1]
# print variables and standardized importance score
for f in range(X_train_lr_features_std.shape[1]):
print("%2d) %-*s %f" % (f + 1, 30, lr_feature_labels[lr_indices[f]], lr_feat_importance[lr_indices[f]]))
Logistic Regression Classifier w/ Using Features from Random Forest Feature Selection¶
In [71]:
# scale features for logistic regression
from sklearn.preprocessing import StandardScaler
stdsc = StandardScaler()
# separate training data, training labels, and testing data
X_train_lr_features_select = X_train[:,[7,9,16,17,18,20,21,27]]
y_train_lr_features_select = y_train
X_test_lr_features_select = X_test[:,[7,9,16,17,18,20,21,27]]
# standardize training and testing data
X_train_lr_features_select_std = stdsc.fit_transform(X_train_lr_features_select)
X_test_lr_features_select_std = stdsc.fit_transform(X_test_lr_features_select)
In [72]:
# initiate logistic regression model
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(penalty='l2', C=10, random_state=1, n_jobs=-1)
In [73]:
# no grid search; only cross validation
lr_fs_scores = cross_val_score(lr, X_train_lr_features_select_std, y_train_lr_features_select, scoring='accuracy', cv=10)
lr_fs_scores.mean() - lr_fs_scores.std(), lr_fs_scores.mean(), lr_fs_scores.mean() + lr_fs_scores.std()
Out[73]:
Logistic Regression Classifier ROC Curve¶
In [74]:
# subset data for ROC plot purposes
X_lr_roc = X_train[:,[0,1,7,9,12,15,16,17,18,19,20,21,22,23,24,25,26,27]] #data only; no labels
y_lr_roc = y_train # no data; labels only
In [75]:
# binarize labels
from sklearn.preprocessing import label_binarize
y_lr_roc = label_binarize(y_lr_roc, classes=[0, 1, 2])
n_classes = y_lr_roc.shape[1]
In [76]:
# split training and test sets
from sklearn.model_selection import train_test_split
X_train_lr_roc, X_test_lr_roc, y_train_lr_roc, y_test_lr_roc = train_test_split(X_lr_roc, y_lr_roc, test_size=.7, random_state=1)
In [77]:
# predict classes
from sklearn.multiclass import OneVsRestClassifier
classifier = OneVsRestClassifier(LogisticRegression(penalty='l2', random_state=1, C=10, n_jobs=-1))
y_score = classifier.fit(X_train_lr_roc, y_train_lr_roc).predict(X_test_lr_roc)
In [78]:
# ROC curve and ROC area for each class
from sklearn.metrics import roc_curve, auc
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
fpr[i], tpr[i], _ = roc_curve(y_test_lr_roc[:, i], y_score[:, i])
roc_auc[i] = auc(fpr[i], tpr[i])
In [79]:
# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_test_lr_roc.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])
In [80]:
from scipy import interp
# aggregate false positive rates
all_fpr = np.unique(np.concatenate([fpr[i] for i in range(n_classes)]))
# interpolate ROC curves
mean_tpr = np.zeros_like(all_fpr)
for i in range(n_classes):
mean_tpr += interp(all_fpr, fpr[i], tpr[i])
# average, and compute AUC
mean_tpr /= n_classes
fpr["macro"] = all_fpr
tpr["macro"] = mean_tpr
roc_auc["macro"] = auc(fpr["macro"], tpr["macro"])
# plot ROC curves
plt.figure()
lw = 2
plt.plot(fpr["micro"], tpr["micro"],
label='micro-average ROC curve (area = {0:0.2f})'
''.format(roc_auc["micro"]),
color='m', linestyle=':', linewidth=4)
plt.plot(fpr["macro"], tpr["macro"],
label='macro-average ROC curve (area = {0:0.2f})'
''.format(roc_auc["macro"]),
color='k', linestyle=':', linewidth=4)
colors = ['r', 'g', 'b']
for i, color in zip(range(n_classes), colors):
plt.plot(fpr[i], tpr[i], color=color, lw=lw,
label='ROC curve of class {0} (area = {1:0.2f})'
''.format(i, roc_auc[i]))
plt.plot([0, 1], [0, 1], 'k--', lw=lw)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Logistic Regression Multi-Class ROC Curve (Training Data Only)')
plt.legend(loc="lower right")
plt.show()
Model Tuning¶
In [81]:
# based on the accuracy and ROC AUC scores, it seems like the random forest classifier
# is better than the KNN and logistic regression models at correctly classifying the interest
# level of a new apartment listing. therefore, let's focus on tuning the random forest
# model in order to maximize classification performance.
# initiate random forest for grid search and cross validation
forest = RandomForestClassifier(random_state=1,n_jobs=-1)
# create parameter grid for grid search
crit_param = ['gini','entropy']
tree_param = [300,600]
max_feature_param = [3,6]
gs_param_grid = [{'criterion': crit_param,
'n_estimators': tree_param,
'max_features': max_feature_param
}]
In [82]:
# create grid search object
rf_gridsearch = GridSearchCV(estimator=forest, param_grid=gs_param_grid, scoring='accuracy',
cv=5, n_jobs=-1)
In [83]:
# fit grid search model
rf = rf_gridsearch.fit(X_train_rf_features, y_train_rf_features)
In [84]:
rf.best_score_
Out[84]:
In [85]:
rf.best_params_
Out[85]:
In [86]:
# create and fit the best rf model
rf_best = RandomForestClassifier(criterion='entropy', n_estimators=600, max_features=6,
random_state=1,n_jobs=-1)
rf_best.fit(X_train_rf_features, y_train_rf_features)
Out[86]:
Random Forest Classifier Submission for Kaggle¶
In [87]:
# kaggle submission must have listing id, high interest probability, medium interest probability,
# and low interest probability
rf_submission = pd.DataFrame(test_df.iloc[:,9].copy())
rf_submission['high'] = rf_best.predict_proba(X_test_rf_features)[:,2]
rf_submission['medium'] = rf_best.predict_proba(X_test_rf_features)[:,1]
rf_submission['low'] = rf_best.predict_proba(X_test_rf_features)[:,0]
rf_submission.head()
Out[87]:
In [88]:
# write submission to csv file
rf_submission.to_csv('rf_submission.csv',index=False)