Open Source Machine Learning for Billion Size Data Graphs

Tags:

By Pablo Rodriguez (@pabloryr) , Director of Research and Innovation at Telefónica Digital

20 February 2014: Plenty of information is encoded in network graphs, be it social channels, the web, the Protein Interaction network, or the phone communications network. Graphs encode relationships with vertices representing nodes, and edges representing the relationship. For instance in the Web graph, vertices correspond to individual pages, while edges correspond to how pages link to one another.

Analysing such graphs is indispensable to many organisations and an integral part of their analytical capabilities

Examples where using such analysis is useful include:

Discover influential users or detect communities in social networks,
Improve fraud detection through the analysis of network transactions, or
Develop movie recommending systems or suggestions for new blogs or products to users.

For instance, a movie recommendation system is able to predict what people may like (e.g. which movies to watch) based on past historical data (e.g. movie ratings). That information is encoded in a graph that connects users, movies and their ratings. Using a model-based computation system (for instance matrix factorization), one is able to predict user ratings for movies, and therefore, recommend movies based on those predictions. This sort of mechanism is used by companies like Netflix, Amazon, or Telefonica on a daily basis to improve customer experience and user satisfaction.

Such graphs can have billions of vertices and millions of nodes (e.g. 100 million movie ratings), and extracting the right information out of them is often a daunting task in terms of computational resources and algorithmic capabilities.

The most interesting information is often extracted using Machine Learning algorithms, which learn from the data and extract interesting features. For instance, the set of clusters of a given graph, its popularity ranking, the shortest path between nodes, or the set of most influential nodes in the network.

To ease the analysis of such graphs, data is often partitioned across clusters of commodity hardware using big data processing systems such as Hadoop or similar. Hadoop supports a data-parallel computation model, where individual computers work on their own part of the data and results are merged at the end to produce the final solution. However, due to the inherent inter-dependency that characterises graphs, implementing algorithms such as Machine Learning processing algorithms on top of Hadoop becomes difficult and inefficient.

In this regard, Giraph is a new framework for Hadoop which supports a parallel-data computing model of graphs, allowing the machines to exchange messages among them while the algorithm is running. Giraph is now an Open Source project backed by Facebook, which can be seen as an ad-on on top of Hadoop that gives it the capabilities to process graphs at scale.

Over the last three years a team of researchers at Telefonica Digital (http://tid.aerstudio.com/research) has worked on developing advanced Machine Learning algorithms for graph mining that can run on top of Giraph. These research efforts have produced a set of advanced machine learning and ranking algorithms, which have been implemented in a software package call Okapi. Okapi includes state-of-the-art features, such as the latest algorithms for collaborative filtering, recommender systems and social network analysis for Big Data.

Following the spirit of community building and openness that pushed us to contribute to the launch of Firefox OS and foster the Open Web device, we have also decided to release Okapi as an Open Source project under the Apache 2.0 License, hopefully becoming a key open-source Machine Learning library for Apache Giraph. Algorithms include tensor factorization for mean average precision optimisation (TFMAP), collaborative less-is-more filtering (CLiMF), or fake account detection (sybilrank).

More is currently in the pipeline, and Okapi can deal today with Internet scale data, even on a relatively small cluster. The full set of algorithms is currently implemented in http://grafos.ml/index.html#Okapi, and the code has already received multiple community contributions from various academic institutions around the world. We hope an active community of contributors will form to make this effort a great success, developing modern Open Source Big Data tools for a new cloud environments that will benefit all. For more information, please visit http://grafos.ml/