Real-time topic-based social network feeds recommending system
The modern social networks are organised by authors. People connect with each other by sending friend requests or following before they share information. Reader cannot get information by interests. There are some apps like Zite allows user to define their interested topics and offers articles from multiply blogs and news feeds. But there is no such recommending system for short feeds like tweets or Facebook messages.
We built a recommending system for users to browse newest social networks feeds by their interests. Codenamed ‘Minerva’. The goal of this system is:
- Let users choose their interests in natural language, like ‘NASA’ and ‘Software’. The system will give out all of the newest feeds from multiply sources like Facebook / Twitter / Tumblr / Weibo / RSS related to user’s choice and order the feeds by publish time.
- Label feeds to its categories.
Fig.1. Example of categories label for feed
The box indicates the categories for the feed. Users can continue explore feeds by related categories.
This article will summarize the high level implementation of our system.
We use raw data to train low-level models like inversed term frequency (IDF) of words and LDA words number of topics (NWZ). Next we compute words and topic distribution as features for categories in Wikipedia. Each category has a name in natural language.
We need corpus from public domains to make sense of categories. Wikipedia holds more than 4,000,000 articles and 400,000 valid categories. Human volunteer contributes and its quality meets our needs. It’s ideal for this research.
In the online production scenario, when our data sources sent new feeds from social networks, we will compute its features like Wikipedia categories as well, save and index features in database for query. When a user would like to read feeds for one or more category, we can get features for categories and query just inside the database system to retrieve feeds related in the categories that users have chosen. To label feeds, use feed features to search for the top n categories by features similarity.
The data flow of training low-level features and categories features, described in Figure 2. We get feeds data from WebFusion, Wikipedia dumps form Wikipedia official web site. Dumps were parsed into database. To transform the raw data into corpus, we segmented text and represent words as numbers with formats stripped. Combined the entire corpus as training set for low-level models.
Because of the humongous amounts of data for a single worker (about 22 GB for Wikipedia and 7 GB for feeds raw data), we performed computations in parallel on a cluster with 172 VCPUs and 224 GB of RAM. We store raw data in MongoDB, result data in PostgreSQL and intermediate data in HDFS. Data processing framework is Hadoop Map Reduce.
First we need to compute the inversed document frequency for each word. This will be use as a feature for words and to reduce document size for LDA inference. Next, we compute NWZ. This is one of the major steps for topic-model. We run Gibbs sampling for 1024 topics and 100 iterations. NWZ will use to compute the topic probability distribution for texts.
After low-level models prepared, we can work on the most important part of this system. To extract features for each category from Wikipedia, we need to collect categories and connect its parent categories. Wikipedia category hierarchy is a graph. Categories should have features from its subcategories. To avoid cyclic dependencies and focus feature factors, we limited the hops. Then we get a list for each article with its categories and neighborhood categories. Map articles words to each item in the category list. Reduce words for each category to get its bag of words. The size of bag of words may be too large for some categories. We use IDF to compute TD-IDF for each word in the category, take top 2000 words as the new compact bag of words. Category features contain topic distribution and words TF-IDF. We can get all of them according to the word of bags. The two features are all sparse vectors, we can represent their factors as maps. To store the features into database for retrieve, we use extension for PostgreSQL named HStore to store and index map data.
That’s all we need to prepare offline. When the system is online, feeds streaming to the system from data sources. Minerva needs to extract features from feeds like training sets for words IF-IDF and topic distribution, store them into the database with feeds identifier. The content of feeds will not store with features, it was saved in the WebFusion database. Minerva uses identifier to fetch contents for users.
When a user request to query feeds for a category or compound categories, Minerva will use category features to find similar feeds in direct in the database. We built a classifier in the database engine to predict whether feed it is belong to the category or not. First, the database will find the rows that contain feature factors of the category. Then compute score according to hypothesis function for each feed. Set a Threshold for the output as the result. In-depth, we can use logistic regression to train this model for a better result. In our experiment, we use a simpler hypothesis function to get feed score like
D is feed featured topic distribution and W is TF-IDF for words. t the threshold, if h(x)> t, the feed is belong to this topic. Because the amount of candidate categories for each feed may be large, we don’t determinate the category list beforehand. Instead, we predict their categories when user asked to. For performance concern, Minerva runs this algorithm in database. With the help of indexes on maps, database doesn’t need to scan for all of the records to get their h(x).
User can also get categories list for feeds in reverse. We can use feed features to compute h(x) for categories and get top n in descending order.
Fig. 2. The data flow of training procedure.
Fig. 3. The constitution of bag of words for categories that have subcategories.
* Thanks Yuhao Zhu @ USC for viewing draft version of this article