Parsing University Challenge questions with youtube-dl and Python

By | October 28, 2017
University Challenge Picture

University Challenge

I’m a big fan of the BBC show University Challenge. Unfortunately, it doesn’t seem to have much of an online fanbase. For example, the subreddit is completely inactive (but I did find a Tumblr called Cuties of University Challenge). Sean Blanchflower even keeps track of overall season statistics, but that’s about as involved as anyone seems to be.

Anyway, I’ve been meaning to create a database of questions from the show, but I can only commit to something like that if I can automate it to the point where it takes me less than 5 minutes per episode. Enter youtube-dl and python (in linux):

youtube-dl -x --sub-lang id --write-sub --skip-download

For this video, the default “en” language subtitles are actually Google’s “Auto-Caption” subtitles generated using speech recognition, and those are poor quality. Thankfully, the uploader of this video has uploaded the original English subtitles, but as the “id” – Indonesian – subtitles. --skip-download allows you to download just the subtitles, not the video. VTT format looks like this:

I didn’t realize quite how easy it would be to parse this. Jeremy Paxman begins every round of questions by saying “10 points for this”, and each show ends with a “GONG”. Therefore, 13 lines of Python do the trick:

This parses out every starter question and the three bonus questions following it:

Now, obviously it will take an extra bit of work to parse the bonus questions out of this. Furthermore, I’m hoping to organize them by topic, and preserve the time tags (so as to link back to the video).