Kaya's blog A Data Science Blog:
    About     Archive     Feed

Predicting Movie Ratings

Movie ratings are publicly available on IMDB and Box Office Mojo, along with features such as runtime, budget, release regions, and past gross associated with the cast and crew. When we combine those with Kaggle’s IMDB 5000 Movie Dataset, there are more than 3000 complete records of movies for a linear regression analysis for the purpose of predicting a movie’s ratings based on such features.

Combined data from IMDB, BoxOfficeMojo, Kaggle

A correlation heatmap shows a promising selection of predictors such as *Runtime minutes *Release regions *Past total gross of cast/crew *Past number of movies *Director Facebook likes *Actor/Entire cast FB likes *Number of faces on poster *Budget

Correlation heatmap Correlation heatmap

The three models (linear regression, ridge, lasso) all gave similar results in terms of R^2 and RMSE. An R^2 of 0.2 implies that about 20% of the variations in ratings can be explained by the variation in these features. And the RMSE of 0.9 measure the magnitude of errors in the predicted ratings.

Results

Results

Regularization via Ridge and Lasso models was used to reduce model complexity, especially when there are too many predictors, the predictors are correlated, and/or the coefficients have large variances. The goal is to arrive at an optimal bias-variance tradeoff, so as to decrease model variance at the cost of introducing some bias, and in turn reduce overfitting and minimize the model’s total error when generalizing to a test set.

Looking at the Lasso coefficients for better interpretability, it’s clear that the most significant predictors of better movie ratings are: *Longer movie length *Release in more regions *Fewer faces on posters *Popular directors

Perhaps surprisingly and counter-intuitively, budget is not a predictor of ratings.

Coefficients

We can confirm that polynomial regression was unnecessary by looking at the residuals, which did not suggest heteroskedasticity.

Residuals Ratings

Data source: Box Office Mojo / IMDB / Kaggle IMDB 5000 Movie Dataset

Tools

  • Beautiful Soup
  • StatsModels
  • SciKitLearn
  • Seaborn
  • Matplotlib
  • Pandas/Numpy

MTA Turnstile Data Analysis

The MTA’s publicly available data includes turnstile information that allows detailed analysis on ridership by station, date, and other category. However, the raw data below contains irregularities such as duplicate entries, so it takes some cleaning before the set can be useful.

Raw data

Although the data includes information on turnstile exit numbers as well, for this exploratory data analysis, we’ll focus on the entries. To find daily entry volume at each station, the combined number of entries from all turnstiles within a station is calculated, where each turnstile’s volume is accumulated over six 4-hour periods.

Daily entry at a station

With entry volume by day, we’re able to see the ridership trend throughout the week, where typically Sunday and Monday have the lowest numbers, roughly showing less than half the volume compared with the rest of the week.

Ridership by day of the week

Ridership by day of week at top 10 busiest stations

The combined entry volume by station also highlights the top 5 stations with the most ridership, which include: 34th St/Penn, Grand Central/42nd St, 34th St/Herald Square, 23rd St, and 14th St/Union Square.

Top 5 stations by ridership

Finally, a histogram can show total ridership distribution as a stark contrast between the busiest stations and most average stations that have very low traffic.

Ridership distribution

Data source: MTA Turnstile Data

Tools

  • Seaborn
  • Matplotlib
  • Pandas/Numpy