An Interactive 3D Recommender for TED Talks

Imagine that all TED talks form a universe, where galaxies are talks with similar topics. Wouldn’t it be fun to create a map of the galaxies so that we can nevigate visually? using topic modeling and Google’s Embedding Projecter, I created an interactive 3D map of the TED talks, in which people can explore topics they are interested in and get intuitive recommendations. Tools applied include: NLTK, Gensim, Word2Vec, scikit-learn, pymongo. (Click here to jump to the recommender demo.)

The universe of TED talks

TED, which represents “Technology, Entertainment, and Design”, is a conference started by Chris Anderson in 1984, with the slogan: “ideas worth spreading”. More than 2600 talks are available online at TED.com to stream. Each talk is labeled by a list of topic tags on the website, and a list of “related talks” are provided, likely based on these tags. If you Going through the list of all tags is not a trivial task, because there are more than 400 of them! Aternatively, as the transcripts of the talks are also avaialble on TED.com, why not let the machine read them and select for you?

Data preparation

My dataset consists of 2524 full transcripts. Most of them from Kaggle, and some most recent ones from web-scraping. These talks vary widely in length, with the longest one over an hour long, and more than 9000 words. While the shortest, in terms of number of words, are purely performance without talking.

Despite the high quality of the files, cleanup is still necessary.

Then, to find the position of each talk in this Ted universe, you need to give them coordinates. One way to do so is to use a vectorizer. Our old friend “count vectorizer” does a simple word count over the corpus vocabulary, while the “tfidf” penalize words that are too common across the corpus, often a better way for defining key terms at individual document level. Let’s say the we only keep the top 2000 words in the vocabulary, that’s still 2000 dimensions that will make our space travel very slow!

Dimension reduction

Topic modeling can help us reduce the dimensions. Instead of using words as features, we map each document to a much smaller topic feature space, let’s say 50. There are choices, LDA versus matrix factorization. Which one is better? First lets take a look at output topics. Instead of judging manually, we can define a coherence score. A few examples here with the score to the left. On average LDA topics are most coherent, LSA is worst. Then a quick KMeans clustering. LDA won again in the silhouette score competition, in fact indicating it only needs 15 clusters. Is it the true winner? Let’s take a look at what related talks each method recommends.

TED talk recommender

Let’s make a recommender using nearest neighbors with plain vanilla Euclidean distance. We have Ted.com’s own recommendations as benchmark.

LSA does significantly better while LDA the worst. In fact LSA manages to match 3 out of 6 talks in the first 20 talks it recommends. Really not bad especially considering the at least 400 topic tags and perhaps user ratings involved by Ted website.

Now it’s time to explore! Click the link below to try it yourself. Use the suggested visualization setting below once the page is loaded.

Recommender Demo

Visualization setting:

  • Data set: LSA-100_topics
  • Dimension: 3D
  • Projection Method: “T-SNE”
  • Perplexity: 13
  • Iterations: 1000

Below shows what you will see with the settings above:

Recommender Demo

Click on any of the dots to see similar talks highlighted. The labels of the dots are the talk titles.

Recommender Demo

Quick tips on publishing your dataset and visualization in Google’s embeddding projector:

Go to: http://projector.tensorflow.org/

Click the “Load Data” button on the left to load your data. The “tensor data” is the matrix of the points to project. For me it’s a 2524-by-50 matrix (2524 documents, 50 topics, from the LSA output), saved in tab-separated format (*.tsv). The “meta data” is a list of names to display for the points. For me it’s a list of 2524 talk titles, one title per line, also saved as tsv file. Then it’s done! Choose tsne on the lower left for better visualization.

To publish the visualization so that you have an URL link with your data loaded already, first save your tsv files somewhere in your GitHub project repo (for example), get the links for the files. Click the “Publish” button on the left, and copy the .json template file. Change the names and links to the tsv files in your GitHub, then put the .json file on you GitHub, too. Copy and paste the url for the .json file in the last field of the “Publish” dialogue window, click “Test your shareble url” button at the bottom, you’ll see the link for the projecter for you! The .json file and .tsv files for the demo above can be found in my GitHub project_4 folder

RETURN TO BLOG

Written on March 12, 2018