Our data set is the NYC 2023 taxi cab data set. Each trip record
captures pick-up and drop-off times, taxi zone locations, trip
distances, itemized fares, rate types, and driver-reported
passenger counts. We want to figure out what the most important
factors are in predicting tip prices, to help taxi drivers
optimize the amount of tips they collect.
Upon inspection of the dataset, only credit card tips are
recorded and cash tips are not logged at all. We can't train a
model to predict a value that we do not know, so the cash
transactions had to be removed,
roughly 20% of the nearly 10,000 trip
dataset. I created a pre tip total by combining all of
the costs that make up the total amount before factoring in the
tip, and dropped the total amount because it includes the tip
amount so there would be data leakage.
I trained linear regression with cross validation of 10 folds,
and for the lasso a grid search picking λ from 0.0001 to 1 chose
λ = 0.01. They had very similar
R², 0.6033 vs 0.6075, but lasso dropped
15 of 27 features with a 0 coefficient:
ride duration, passenger count, and most of the pickup days and
times. For the taxi driver perspective, they shouldn't focus on
the time of day despite the EDA suggesting otherwise. They should
look for long rides, airport rides, and routes
that pass through tolls because these are the best
predictors for the tip amount.
Fig. 05 plots the real fitted coefficients from the report.
Toggle LASSO and watch it prune. Write-up in
papers ↓.