Natural Language Processing for Movie Roles

PROBLEM STATEMENT

The representation of character roles and genders in movies has significant social impact. To study the distribution of roles and genders in movies, we need to first extract this information. How can we effectively extract characters’ roles and genders from movie summaries and credit lists?

DATASET

Credit list and summary information for around 34,000 US movies with 10 or more IMDb reviews and IMDb and/or Wikipedia summaries.

RESULT

  • We found 114,922 character name variants in the summaries, for which we were able to extract 71,216 candidate character roles (2.1 roles per movie).
  • On a semi-random evaluation set of 10 movies, our algorithm achieved a 54% precision (proportion of correct, descriptive character roles among those that were matched), and 46% recall (character names from summaries correctly matched with IMDB credits). 
  • View our project report here.

Responsibilities

Information Architecture, Data Cleaning, POS and NER tagger, Name Extraction, Name Filtering, Alias Association, Regex and Chunking