Building a Recommendation Engine Using a Graph Database

Printer-friendly version


A very common use case for Graph Databases is to support client applications that provide suggestions for their customers or rate products or services based on user behavior or preference. For the InfiniteGraph 3.4 release, we built a Podcast Recommendation Sample using the features available in IG 3.4 and previous releases.

A recommendation engine is typically built using a graph database for three reasons:

  1. Recommendations use multiple factors that have complex relationships
  2. Recommendations are given in real-time
  3. Recommendations utilize multiple levels of relationships
For example, recommending purchases might involve looking at purchase event information and the many types of relationships between purchases, including customer preferences, website, store or brand loyalty, and general popularity. It also may be beneficial to look at relationships between customers, including trends within communities, socio-economic groups, etc. This type of relationship analytics is most successful after looking at multiple levels of depth to see how pervasive trends may be. This blog will look at complex relationships and multiple levels of relationship in graph data sets to see how they can be mined to develop an interesting and compelling recommendation algorithm.


To create a successful recommendation service, care must be taken to determine what types of data are required to create a compelling recommendation engine. For example, if we recommend podcasts based only on popularity, then we would only be offering recommendations along one differentiating factor. All queries would return the same top set of podcast titles based on that one factor. Even two factors of differentiation is uninteresting. Consider using two factors like category and popularity. In this case, all podcasts in the same category would return the same most popular results. Since there are only 16 Itunes Categories, this is very likely to occur. Accordingly, we looked for other factors that would allow us to provide unique results for most, if not all, of our podcast queries.

Here is a list of differentiating factors that we used to design our recommendation algorithm and provide our simple user interface.

  • Category
  • Popularity within category
  • Keywords
  • Related Podcast Data
  • Ownership/Authorship
On top of this, we delved deeper into the connections in the dataset (across multiple degrees of separation for related data) to offer more comprehensive recommendations. Randomness could also provide some differentiation, but we decided that we wanted subsequent queries with the same parameters to provide the same results to give the user more confidence in the results. With all of this, we were able to offer a unique rank list of podcast recommendations for most queries.

Because we are using a graph database, the navigation engine provides the optimal way to populate our recommendation engine with data to get real-time results. Also, we utilized the edge information that is stored in the edge types to retrieve more relevant data in a rapid manner. The following shows the data model used in the sample.

Image 1: The Podcast Data Model
Podcast Data Model

We store most of the data in the vertex objects, but there is also critical data in the edges, like "keywords" in the CategorizedBy edge and "pubDate" in the PublishedBy edge. We didn't use the episode data to provide another layer of differentiation for the results, but this could be done. For example, as an extension to the podcast recommendation ranking algorithm, you could recommend the podcast with more current episodes or with a higher episode count.

Sources of Data

The main source of our dataset was a github project that scrubbed the feeds from all publicly available podcasts and presented them in tsv formatted compressed files. This gave us a way to get historical podcast data, but parsing the backlog of the entire podcast dataset was not realistic (i.e. parsing the rss feeds for all of the podcasts would take too long). Therefore, we needed a way to identify the top relevant podcasts. We used the Itunes search API, which has a way of listing the top podcasts per category in JSON format. After identifying relevant podcasts, we were able to parse the RSS feed data for the podcasts to get their episode data, category, and keyword data. Finally, we had to parse information about relevant user preferences from a custom file format.

Challenges of Data ETL

A main hurdle is that each one of the data sources came in a different format which we had to parse. This was mitigated through the use of third party libraries like open-csv, which allowed us to parse the tsv formatted files, json-smart, which allowed us to parse the JSON format streams, and XMLStreamReader, which allowed us to parse the XML RSS feeds. These libraries made the transform cycle more feasible.

Another significant hurdle was avoiding duplication when ingesting data (like People) from the RSS feeds of different podcasts. Therefore, we needed to use a merge operation that is available as a new feature called a "representation task" in IG 3.4.

Code Sample 1: Using a Representation

VertexRepresentation episodeRep = new VertexRepresentation(new RepresentVertexId(Episode.class, episodeId));
VertexRepresentation authorRep = new VertexRepresentation(new RepresentVertexId(Person.class, "name",
	authorName), new Person(authorName));
EdgeRepresentation authoredByRep = new EdgeRepresentation(new RepresentEdgeId(AuthoredBy.class,
	EdgeKind.INCOMING, (short) 0));
graph.represent(authorRep, authoredByRep, episodeRep);

This "representation task" is useful if you are doing a single process merge. It is orders of magnitude faster than performing a single query, single add/update. For more information on using the tasking framework, see the Wiki topic. See also the Subgraph Sample for more information on how to use the tasking framework in a multi-threaded or multi-process environment. The "representation task" encapsulates the work of the query and the add/update into one step and optimizes it. This makes the Load stage of ETL much easier and quicker.

We also performed a batch query to avoid duplicating podcasts after getting the top lists by category and when adding the related information. The BatchQuery is a new feature in IG 3.4 that enables you to perform an optimized scan for a large number of objects of the same type all at once. This is really helpful when resolving a list of unique identifiers to a list of persistent elements, like when associating a table with primary keys with their actual persistent elements. Similarly, resolving a list of podcasts to their persistent elements in a single scan can save time.

Writing the Recommendation Algorithm

First we identified a simple user interface to provide access to recommendations. The user interface is based on the available dataset and what the user might be asking for. We identified three types of queries of interest:

  • By title
  • By genre or category
  • By keyword

Next, we created an algorithm that retrieved data matching these components while at the same time retrieving secondary related data that could be used to populate a secondary set of recommendations. For example, a popular podcast X could match a category, but other top podcasts Y, Z related to podcast X but in a different category could also be recommended, but with a lower rank. Similarly, shared keywords in podcast X and Z could increase the rank of podcast Z. Similarly, if querying by title, a related podcast could be given or a podcast that has shared category and keywords. In other words, the user query is just an entry into the algorithm, but doesn't limit the types of connections that could be found. The output, as seen below, will include both the rank value as well as the type of matches that contributed to the rank.

Sample Output 1: Using the Podcast Recommendation Algorithm

Search for related and recommended podcasts based on: 
	(T)itle    Return other podcasts preferred by
		    listeners of this podcast.
	(G)enre    Return podcasts in this genre.
	(K)eywords Return podcasts represented by these
	Type "E(x)it" to exit the program.
	Type "(P)rint" for a list of genres.
For example, to search the 'Technology' genre using
the keywords 'android' and 'apple', type:
	G=Technology K=android,apple
>> G=Technology K=android,apple

Found 194 podcast(s) in 0.124 seconds.
- Kid Friday - apps, websites, gadgets, games, fun! [10]
	Matched Category: [Technology]
	Matched keywords: [apple, android]

- This is Only a Test [10]
	Matched Category: [Technology]
	Matched keywords: [apple, android]

- The Vergecast [10]
	Matched Category: [Technology]
	Matched keywords: [apple, android]

- iMore show [6]
	Matched Category: [Technology]
	Matched keywords: [apple]

- Accidental Tech Podcast [6]
	Matched Category: [Technology]
	Matched keywords: [apple]

This just provides an example of how to about writing a recommendation algorithm that leverages a graph database. More data could be added to make this recommendation algorithm more efficient and to give more interesting results.

The Engine Behind the Algorithm

The key to making the recommendation algorithm work is the query and navigation engine behind it. Without these pieces, the desired recommendations would not be as rich and diverse, nor would this be served up as fast. Several features in IG 3.4 that contributed to the sample are worth noting.

Path Match Navigations

When getting recommendations by title, the recommendation algorithm executes two navigational queries. The first involves querying for related titles by one or two degrees.

Code Sample 2: Identifying Related Podcasts

PathCollector pCollector = new PathCollector();
PolicyChain policyChain = new PolicyChain(new MaximumPathDepthPolicy(2));
policyChain.addPolicy(new EdgeTraversalPolicy(EdgeKind.BIDIRECTIONAL, EdgeKind.INCOMING, EdgeKind.OUTGOING));
Navigator n = targetPc.navigate(null, "[Podcast]-[Related]->...[Podcast]", policyChain, pCollector);

Notice, how the "..." in the path-match pattern allows for any number of steps, but the MaximumPathDepthPolicy set to 2 limits the navigation to 2 path steps. Also, we allow outgoing and incoming edges, but in the post-processing of the navigation results, we give the paths with incoming edges a lower rank. Finally, we are using a PathCollector result handler because it allows for optimized navigation queries.

We are also using path matching for the shared owner/author relationship between podcasts.

Code Sample 3: Finding Shared Owner/Author Podcasts

n = targetPc.navigate(null, "[Podcast]-[OwnedBy|AuthoredBy]->[Person]-[OwnedBy|AuthoredBy]->[Podcast]", null, pCollector);

Using the "|" to split types in the path-match pattern can allow us to reduce the navigation for two types of navigation into one. Again, we use the PathCollector result handler for optimized navigation.

Optimized Edge Discovery Methods

When searching for categories or keywords that match, we use two techniques that allow us to perform optimized edge discovery. The first is the PreloadCacheForHandlesPolicy which allows us to front load the java cache when getting the vertex/edge objects from the VertexHandle or EdgeHandle. Next, we used the Vertex#getAllEdges(className, predicate) method to retrieve all of the CategorizedBy edges.

Code Sample 4: Get All Edges

for (EdgeHandle edgeHandle : category.getAllEdges(CategorizedBy.class.getName(), null)) {
   Podcast podcast = (Podcast) edgeHandle.getPeer().getVertex();

Using the PreloadCacheForHandlesPolicy and the getAllEdges(...)method on the Vertex allows us to optimize the lookups for associated edge information.

Using Indices and Variables for Querying

When we create the database, we set up native graph indices for Podcasts, People, and Categories. This allows us to get fast results when querying, both during ingest and from within the recommendation algorithm.

Code Sample 5: Setting up Indices

IndexManager indexManager = graph.getIndexManager();
// add indexes for people, categories and podcasts
indexManager.addGraphIndex("people", Person.class.getName(), new String[] {"name"}, true, "Person");
indexManager.addGraphIndex("categories", Category.class.getName(), new String[] { "name" }, true, "Category");
indexManager.addGraphIndex("podcasts", Podcast.class.getName(), new String[] { "title" }, false, "Podcast");

Once the indices have been set up, we can get optimal behavior for all of our queries. In the recommendation algorithm, we commonly query for Podcast by name. Therefore, we are also using variable queries. This allows us to set up a common query for a particular type and field name and execute it for different values of that field.

Code Sample 6: Configuring variable query for Podcasts

podcastNameQuery = graph.createQuery(Podcast.class.getName(), "title==$podcastName:STRING");
podcastNameQuery.setStringVarValue("podcastName", podcastName);
return podcastNameQuery.getSingleResult();

This allows us to bypass the creation of the query every time that we need to execute it. By doing this, we can reuse the query and get the best possible speed for single lookups.

Representations of Data

Reiterating this concept from earlier, the tasking framework allow users to combine the query and add/update in a single merge step and makes the process of adding unique data from different sources a lot easier. For more information on using the "representation tasks" in your application, see our Wiki topics.


We hope you enjoy the new podcast recommendation sample and more importantly, the exciting new features of IG 3.4 that made it possible. As always, if you have any feedback or questions, feel free to check out our google group and for more information about Objectivity or InfiniteGraph, or if you want to build your own recommendation engine application, feel free to visit our website or contact Objectivity support at Happy Trails!