Film Sales in International Markets

The past two weeks have flown by. This Friday I finished a project regarding the percentage of the foreign gross total of a film that comes from Hungary. Although Hungary is a small country, I made the assumption that the foreign distribution of films has a long tail. The motivation of this project was to use machine learning to determine which countries would be the best smaller countries to distribute a film in to maximize foreign gross.

The data I used was from the foreign tab for each film in the past six years on BoxOfficeMojo. I scraped the data using BeautifulSoup. Formulating a question and scraping the data took me the better part of a week. Then, on Tuesday of this week, a deadline for a minimum viable product was sprung upon us. This means that we had to have something to show for our data, ideally a regression of some sort – by the end of the day. Yikes!

By the end of day Tuesday, I had scrapped together linear regression using only budget to predict Hungarian percentage.

mvp

Although plotting the linear regression on top of the data points was giving me trouble, it was obvious that budget alone was not going to be a very strong predictor. I kept my hopes up, however.

Towards the end of the week, I had prettied up my data and broken out some categorical variables. The features I had were runtime, budget (in millions), G/PG/PG-13/R rating (categorial), genre (categorical) and Hungarian distributor (categorical). I had also created three more features: a boolean value of whether the Hungarian and American distributors were the same, and two time-delta variables: the number of days after the foreign release that the Hungarian release occurred, and the number of days after the American release that the first foreign release occurred. I was ready to run my machine learning regression.

The machine learning tool I chose was ridge regularization. This method creates a linear regression model with an error term that corrects for the features chosen for the regression. The coefficient on the error term is the lambda value. A higher lambda indicates a worse linear regression and will predict a more generalized model. A lower lambda indicates a good linear regression that needs little correction.

I had slightly more than 500 entries in my dataset, so I reserved about 100 for my test data set. With the 400 remaining, I used k-fold cross-validation with a k of 4. This means I tested my model on itself four times for each lamdba value I chose. I then calculated the standard error of the mean for each cross-validation set and averaged it over the chosen lamdba. Comparing across lambdas, my standard error decreased each time I increased the lamdba. Unfortunately, this indicated that my chosen features were not good predictors for the Hungarian percentage of foreign gross total.

Given further time to investigate this project, I would search the web for features that may be better predictors for performance of a film in a given country. When I began the project, I operated under the assumption that my model would be applicable for any given country, but I now believe that features more relevant to the given country would be more accurate predictors. Until next time!

-Em

Written on January 31, 2016