The Demo is ready

I am thrilled to announce the first tech demo of the new WebFusion (codenamed odin) is completed.

The address of the demo is: http://www.shisoft.net/

This demo exhibits the function of recommend feeds (from about 1,000 twitter accounts I followed in this demo) by user selected categories. This function is an imitation of mobile app Zite, which have been acquired by Flipboard[1]. You can simply type your thoughts in the text box, the system will give you hints automatically.

Snip20160108_1

You can then use your mouse or keyboard to select the item. If you prefer to let the system suggest categories combined feeds, feel free to select more. Then, click the search button to submit your query.

Snip20151222_30

Your recommendation of feeds should be displayed in no time. In the image above, you can see four feeds from twitter. You can also scroll the page down to get more until there is no more feeds in the database that related to your selected categories. In your attempts, the results should vary because the system is receiving and processing new feeds published by other persons in real time, you can always see the newest feeds once it arrived in the system.

This system also supports recommend feeds by user inputs that categories does not exists in my database (updated at 8 January 2016). Every internal categories was prepended with symbol "♦", indicates that this category has comprehensive features in my database that the system can provide accurate recommendations. A category that does not prepended with the symbol indicates that the system will use features according to the words user inputed to provide recommendations which it may not very accurate compare to internal categories.

You can switch categories by input new names in the text box, or you can simply click on the links below the feeds that indicated as "Related Categories".

This is NOT a search engine. The system considers both keywords and the meaning of the contents to categories, including probability distribution on topics.

This function is still experimental. After filtering, there are over 100,000 categories that the user can select for recommendations. The data training processes are all unsupervised, it is impossible for me to check the accuracy for all of them. As far as I know, there are some flaws right now which makes it imperfect. There are a lot of overlapping categories that have almost the same features. Their results of recommendations are identical. It can distinguish Apple computer and apple on trees, but it cannot tell the difference from "Google" and "Microsoft". Because their topics are the same and their keywords are almost identical. Due to all of the categories and its data comes from Wikipedia dumps, the quality cannot be controlled, some categories are also overfitted by more popular subcategories.

This function is a part of the new WebFusion. In the demo version, "Share" and "Reply" button are not functional right now, they only use as placeholder.

The dedicated server for this demo is in Shanghai right now because this system is huge and I cannot afford virtual servers in some cloud provider like DigitalOcean. For the concern of the national great firewall, U.S users may suffer time out exceptions or service temporally unavailable. I am trying to overcome these problems. If you have any difficulties on reviewing this demo, please don't be hesitated to contact me.

* Right now I have made my best efforts on improving the experience in your visiting this demo by redirecting traffic through 3 more servers. The outbound server is in San Francisco with this blog site, the origin server is still in Shanghai.

* I have already use shadowsocks to replace ssh for internal proxy service to twitter and other banned feed source in China. But the route to my shadowsocks server is also not stable. I tried to solve this issue by using HaProxy to load-balance and failover 9 different routes to the shadowsocks server in U.S, it works.

* It have been a long time after I first start to collect twitter feeds and compute their features. I adjusted the parameters which slightly more strict so that the recommendations for categories should be more accurate.

* You can get more technical details from here

Real-time topic-based social network feeds recommending system

Hao Shi

2015-04-18

PDF File

Real-time topic-based social network feeds recommending system

The modern social networks are organised by authors. People connect with each other by sending friend requests or following before they share information. Reader cannot get information by interests. There are some apps like Zite allows user to define their interested topics and offers articles from multiply blogs and news feeds. But there is no such recommending system for short feeds like tweets or Facebook messages.

We built a recommending system for users to browse newest social networks feeds by their interests. Codenamed ‘Minerva’. The goal of this system is:

  • Let users choose their interests in natural language, like ‘NASA’ and ‘Software’. The system will give out all of the newest feeds from multiply sources like Facebook / Twitter / Tumblr / Weibo / RSS related to user’s choice and order the feeds by publish time.
  • Label feeds to its categories.

Snip20151214_13

Fig.1. Example of categories label for feed

The box indicates the categories for the feed. Users can continue explore feeds by related categories.

This article will summarize the high level implementation of our system.

We use raw data to train low-level models like inversed term frequency (IDF) of words and LDA words number of topics (NWZ). Next we compute words and topic distribution as features for categories in Wikipedia. Each category has a name in natural language.

We need corpus from public domains to make sense of categories. Wikipedia holds more than 4,000,000 articles and 400,000 valid categories. Human volunteer contributes and its quality meets our needs. It’s ideal for this research.

In the online production scenario, when our data sources sent new feeds from social networks, we will compute its features like Wikipedia categories as well, save and index features in database for query. When a user would like to read feeds for one or more category, we can get features for categories and query just inside the database system to retrieve feeds related in the categories that users have chosen. To label feeds, use feed features to search for the top n categories by features similarity.

The data flow of training low-level features and categories features, described in Figure 2.  We get feeds data from WebFusion, Wikipedia dumps form Wikipedia official web site. Dumps were parsed into database. To transform the raw data into corpus, we segmented text and represent words as numbers with formats stripped. Combined the entire corpus as training set for low-level models.

Because of the humongous amounts of data for a single worker (about 22 GB for Wikipedia and 7 GB for feeds raw data), we performed computations in parallel on a cluster with 172 VCPUs and 224 GB of RAM. We store raw data in MongoDB, result data in PostgreSQL and intermediate data in HDFS. Data processing framework is Hadoop Map Reduce.

First we need to compute the inversed document frequency for each word. This will be use as a feature for words and to reduce document size for LDA inference. Next, we compute NWZ. This is one of the major steps for topic-model. We run Gibbs sampling for 1024 topics and 100 iterations. NWZ will use to compute the topic probability distribution for texts.

After low-level models prepared, we can work on the most important part of this system. To extract features for each category from Wikipedia, we need to collect categories and connect its parent categories. Wikipedia category hierarchy is a graph. Categories should have features from its subcategories. To avoid cyclic dependencies and focus feature factors, we limited the hops. Then we get a list for each article with its categories and neighborhood categories. Map articles words to each item in the category list. Reduce words for each category to get its bag of words. The size of bag of words may be too large for some categories. We use IDF to compute TD-IDF for each word in the category, take top 2000 words as the new compact bag of words. Category features contain topic distribution and words TF-IDF. We can get all of them according to the word of bags. The two features are all sparse vectors, we can represent their factors as maps. To store the features into database for retrieve, we use extension for PostgreSQL named HStore to store and index map data.

That’s all we need to prepare offline. When the system is online, feeds streaming to the system from data sources. Minerva needs to extract features from feeds like training sets for words IF-IDF and topic distribution, store them into the database with feeds identifier. The content of feeds will not store with features, it was saved in the WebFusion database. Minerva uses identifier to fetch contents for users.

When a user request to query feeds for a category or compound categories, Minerva will use category features to find similar feeds in direct in the database. We built a classifier in the database engine to predict whether feed it is belong to the category or not. First, the database will find the rows that contain feature factors of the category. Then compute score according to hypothesis function for each feed. Set a Threshold for the output as the result. In-depth, we can use logistic regression to train this model for a better result. In our experiment, we use a simpler hypothesis function to get feed score like

Snip20151214_14

D is feed featured topic distribution and W is TF-IDF for words. t the threshold, if h(x)> t, the feed is belong to this topic. Because the amount of candidate categories for each feed may be large, we don’t determinate the category list beforehand. Instead, we predict their categories when user asked to. For performance concern, Minerva runs this algorithm in database. With the help of indexes on maps, database doesn’t need to scan for all of the records to get their h(x).

User can also get categories list for feeds in reverse. We can use feed features to compute h(x) for categories and get top n in descending order.

Snip20151214_15

Fig. 2. The data flow of training procedure.

Snip20151214_16

Fig. 3. The constitution of bag of words for categories that have subcategories.

 

* Thanks Yuhao Zhu @ USC for viewing draft version of this article

External scripts work with clojurescripts

A important concept in optimising web site is to reduce the number of files and compress them. Lein-cljsbuild for Clojurescript is doing a great job. It combines various cljs files together and compile them to one js file together with removing unnecessary codes and whitespaces. But when there are any external js in the project files, how to manage them is an issue.

This article provides some approach to do so. I also have already doing research on this for some time. And I think I need to take some note for record.

I was highly expected the :extern approach. I was considered this approach as some way to attach the contents of the js file to the compiled js file for cljs. But I cannot find anything form the extern js file in the compiled file. But the compiler throws warning for the extern js file. I am not very sure what did it do.

I also tried the :foreign-libs approach but still get nothing after compilation. I think I was misunderstood the manual, but I don't which part.

The final approach to resolve this issue is to attach <script> tag in to the webpage after all of the compiled js file are loaded.