The RecSys Challenge, held annually, is a competition that seeks the best new ideas in the world of recommender systems applied to a specific context. The challenge is a key component of the ACM Conference on Recommender Systems series and comprises both industry professionals and academics alike.
The 2020 edition of the RecSys conference was sponsored by high-profile tech giants like Twitter, Netflix, Google, Amazon, and Spotify, and the RecSys Challenge was co-organized by academic institutions from around the globe including Politecnico di Bari, Free University of Bozen-Bolzano, TU Wien, University of Colorado, Boulder, and Universidade Federal de Campina Grande.
Considering that the teams with accepted, winning contributions would be given the chance to present their findings during the RecSys Challenge 2020 Workshop, our Factored team was raring to go!
Building A High-Caliber Data Science and AI Team
For the 2020 edition of the RecSys Challenge, participants were tasked with using the information contained in a large dataset of user-tweet interactions in order to build a model that predicted how a user would most likely interact with a specific tweet, with the options being: Reply, Like, Retweet or Retweet With Comment. To achieve this, Twitter released a very large public dataset of roughly 200 million public tweets, obtained by subsampling during a 2-week timeframe. This public dataset of tweets contained engagement response variables, user features and tweet features.
It’s also important to note that the challenge took full consideration of data protection and privacy frameworks. Therefore, if a user chose to delete a particular tweet or their data in general from Twitter, the dataset used for the challenge would immediately be updated accordingly; only data that users had already shared and hadn’t deleted was used. In keeping with this, the evaluation of submissions was updated in relation to any changes in the dataset so the leaderboard was never fixed, which made for a more motivating challenge, as no position on the leaderboard was guaranteed!
The Factored Team Approach
The challenge was taken on by a group of 6 Factored engineers, 2 of whom focused on data engineering, a further 2 on deep learning and the remaining 2 worked in both areas; an example of how we at Factored systematically assess and approach projects.This clear division of responsibilities meant that our team could efficiently focus on specific tasks at hand whilst also thinking of the bigger picture as a collective.
So, how did we go about providing a solution for this challenge? Building a prediction model based on 200 million user-tweet interactions was no small feat but this is how we tackled it.
First, we built a data processing pipeline to efficiently process large volumes of data (of over 800 GB), as well as to perform feature engineering tasks such as extracting additional structured features from users, detecting the topic of a tweet and grouping users based on their similarities. We also built a custom deep learning model, taking ideas from state-of-the-art recommender system models, such as DeepFM, xDeepFM and AutoInt. This approach was able to efficiently model and capture low and high-order interactions between features. Some architectural tricks such as BatchNorm, Dropout and Skip connections also helped to improve the performance of our model.
Throughout the challenge, we utilized an array of technologies at our disposal including Apache Spark, Hadoop, TensorFlow, AWS EMR, AWS EC2 and AWS S3.
What Did We Achieve Exactly?
In the end, our team of diligent AI experts came in the top 1% of this year’s Twitter-focused RecSys Challenge, out of 1,000 teams taking part. This means our team has been invited to share its findings and corresponding research paper with the academic institutions that co-organized this year’s challenge.
To achieve this accolade, we processed hundreds of millions of samples using Spark and large-scale distributed systems on AWS. We leveraged information available on Twitter using Natural Language Processing (NLP), and clustering algorithms to train a deep learning model that improved the performance of the recommender systems, providing greater insight as to how individual Twitter users would most likely engage with a particular tweet.
We’re also proud to say that our approach optimized training costs by designing specific processes such as using CPUs instead of GPUs whenever needed in certain steps of the training process. We also precomputed neural embeddings from Twitter text just once instead of recomputing them each time we trained the model, saving a significant amount of time overall.
In today’s world of constant interaction and information sharing, learning more about user habits on Twitter gives insight into the most relevant content being produced on the platform and how users are choosing to engage with it.
All in all, our engineers had a great time flexing their recommender system muscles and combining their knowledge and experience to tackle a real-life problem to provide tangible, valuable insights thanks to the prediction model they built.
We were delighted to see them present their research paper at the RecSys Challenge 2020 Workshop to show off the hard work they put in and share the key insights and solutions that they came up with! We’d like to take this opportunity to recognize the team of engineers who worked on this project: Carlos, Juan Manuel, Camilo, David, and Cristian. Great work, team! We are proud of you.
If you’d like to chat to someone from the Factored team about our work on this project, or to find out more about how we help companies to build top-notch AI solutions, don’t hesitate to get in touch at: email@example.com