Declarative Objectivity (DO) Language : Querying for Data : Performing Graph Queries
Performing Graph Queries
In this chapter, you’ll see how to perform graph queries with DO to read data from a federated database. If you have used query languages such as Cypher, then DO graph queries should look familiar to you.
When you perform a graph query, you view your data as a graph, and base your search criteria on how the data of interest is connected to other data within that graph. You express the search criteria as patterns that match characteristics like the length or composition of paths that connect the data.
In this chapter, we will see how to use patterns to qualify paths within a data graph, and then return values extracted from those paths. For example, you can use this approach to find the crime movies that received 5-star ratings from lawyers, and then obtain the titles of those movies, the ages of the lawyers who liked them, and so on. This technique can be used to perform collaborative filtering in recommender systems—if a new user likes a movie that has been highly rated by a particular user population, what other movies might that new user like?
Note:In the next chapter, Navigating a Graph, we will use graph queries to return not just values from paths, but entire paths themselves, which are useful when you want to know all about how certain objects are connected.
Data Stored as Graphs
A federated database is ideally suited for graph queries because it stores your persistent data not only as objects, but as objects in a graph. In fact, the object model for storing persistent data is a general implementation of a graph model such as those used by other popular data stores.
Anatomy of a Graph
Recall from Data Stored as Objects that persistent data is represented as objects of specific classes (types), where an object has named values corresponding to the attributes of its class. A relationship between two objects exists when an attribute of one of the objects has a reference to the other.
Notice how similar this is to a typical graph model, in which persistent data is represented as vertices that are connected by edges. Each edge is directed, which means that it starts at one vertex and ends at another. A vertex can have a type and any number of named values. (If you are familiar with Cypher, you might recognize these as labels and properties, respectively.)
You can think of the object model as the general case for representing data, while the graph model is specialized for representing connectivity between objects. When talking about graphs and graph queries, we’ll use the specialized vocabulary such as vertex and edge, but we’ll keep the terms class and attribute values.
Borrowing the Sample Schema and Data Set from the previous chapter, we can represent the movie-ratings data set as a graph. This diagram shows a small portion of that graph:
In this diagram, we see that each object is a vertex, and each reference from one object to another is an edge. In this particular graph, every edge is paired with an inverse edge—for example, each Rating object has a user edge leading to a User object, which, in turn, has a reviews edge leading back to the Rating object. (To reduce clutter, the diagram represents each pair of inverse edges as a two-headed arrow.) Inverse edges are very useful, but not required—many data sets form graphs in which some pairs of vertices are connected by “one-way arrows” (edges without inverses).
Notice that in some cases, an edge corresponds to an attribute that holds a single reference. For example, a Rating object’s user attribute holds a reference to a particular User object. Such an edge corresponds to a to-one relationship between the Rating object and the User object.
In other cases, an edge corresponds to one of many references that are elements of a collection. For example, a User object’s reviews attribute holds a collection of references, and each of those references is an edge that connects to a particular Rating object. Such an edge is part of a to-many relationship between the User object and its Rating objects.
Paths in a Graph
Within a graph, you can follow directed edges to traverse from one vertex to the next. A particular traversal can describe a path through the graph, where a path is a chain of alternating vertices and edges in which no vertex or edge is repeated. A path begins on a vertex and ends on a different vertex.
For example, in the following subgraph, two distinct paths lead from a particular 4-star rating of the movie Charade to the two genres shown for the movie:
Paths are a central concept for querying graphs. Depending on the nature of the data set, a path may represent:
A particular configuration of significant objects from which values of interest can be discovered or derived. For example, from a particular configuration of movies, ratings, and users, we can find other movies that similar users will probably like.
A particular set of movements that can be traced among significant objects. For example, given a set of channels between persons, devices, or locations, we can find particular flows of information, commerce, pathogens, and so on.
In this chapter, we will see queries that extract values from configurations of objects. Furthermore, we will work only with paths, although it’s worth noting here that DO queries can distinguish between paths and trails, which allow vertices to be revisited, as long as no edge is repeated. (In the next chapter, Navigating a Graph, we’ll query for paths and trails that represent a flow.)
When Edges Need Values
The vertices in a graph normally store values of various types, in addition to storing references. For example, a Rating vertex stores a numeric score and a timestamp, in addition to storing a reference to a User vertex and a reference to a Movie vertex.
In some data sets, it is meaningful for edges to store values that are not part of any vertex. Indeed, you could imagine using a Rating object as an edge to connect a pair of User and Movie vertices. In this scenario, the rating score and timestamp values would belong to the edge itself.
Whether an edge is implemented as a reference to an object or as an object in its own right is determined by the schema for the data set. By default, a schema is set up so that the objects of every class are vertices, and every connecting edge corresponds either to a reference attribute or to an element in a collection of references. As an alternative, it is possible to configure some classes to be edge classes, so their objects are considered edge objects.
Note:The examples in this chapter use the movie-ratings data set, in which all edges are implemented as references.  
A First Look at a Graph Query
Like other DO queries, a DO graph query is composed of clauses. The following two clauses are required:
A MATCH clause, which searches a graph for paths that match (conform to) a specified pattern.
A RETURN clause, which specifies the information to be returned from the matched paths, if any are found.
Note:You can optionally add a WHERE clause to further qualify paths that match the pattern; we’ll see examples later, in Filtering Vertices With an Optional WHERE Clause.
Here is a very simple graph query that returns pairs of vertices of any type that are connected by a single edge:
This example is quite simple, and not especially useful. Nevertheless, it shows several important aspects of a graph query:
You can think of the MATCH clause as the graph-query counterpart of the FROM clause in a by-value query. Whereas a FROM clause specifies a query’s source as objects, a MATCH clause specifies a query’s source as paths.
You can think of a path pattern as ASCII art that uses text symbols to represent the ovals and arrows that you might use to draw vertices and edges in a diagram:
You can use variables in a path pattern to represent the individual components of the paths being matched. Here, two comma-separated variables (a and b) are assigned to the vertices in the pattern. The RETURN clause then uses these variables to refer to any matched vertices as values to be returned.
The path pattern within the MATCH clause is the central feature of a graph query because it describes the combination of vertices and edges you are looking for. Whether you want to find all paths that represent a recurrent configuration of similar vertices, or a particular path that represents a singular configuration of specific vertices, you identify the path(s) of interest by providing a pattern to be matched.
The following sections show the basic building blocks of a path pattern. In our examples, we’ll continue using the movie ratings data set described in Sample Schema and Data Set.
Patterns for Describing Vertices
The minimal pattern for describing a vertex in a path is an empty pair of parentheses. This minimal pattern does not convey any additional information about the vertex, but instead matches a vertex of any type containing any values. We could imagine using such as minimal pattern simply to indicate that some vertex is present at a particular point the path being described: 
Specifying the Type of a Vertex
Normally, you know the types of the vertices in the paths you are looking for. You can specify the schema class of the vertex objects to be matched by including :className between the parentheses. For example, the following partial pattern indicates that a vertex of class Movie must be present in the path being described:
Assigning a Variable to a Vertex
You can assign a variable to a vertex to enable other clauses in the query to refer to it. For example, the following partial pattern assigns a variable to the Movie vertex. Doing so enables the RETURN clause to return the matched vertex objects as values:
The variable assignment precedes the type designation if there is one. When no type is designated, a variable simply occurs by itself within the parentheses—for example, (m). You can introduce variables even if you never use them later in the query.
A variable name can be any string you choose, provided it begins with a letter. A variable name does not need to correspond to any particular name or value in the database. Short variable names tend to be most convenient, although you may prefer more descriptive names, such as(filmFromBeforeYear2000:Movie).
Specifying Conditions on the Values of a Vertex
You commonly need to describe paths in which vertices of a particular type have attribute values that must meet particular conditions. You can do this by adding inline qualification directly in the pattern.
In the simplest case, a pattern for a vertex can specify a literal value to be compared against the value of a particular attribute. You separate the attribute name and the literal value with a colon, and enclose them in curly braces. The attribute must belong to the class that is specified for the vertex. A vertex matches the pattern only if it is of the correct type and has the specified value for the named attribute. For example, the following partial pattern matches a Movie vertex only if its title is the string 'Charade':
You can specify more complex conditions on vertices by enclosing a predicate expression in the curly braces. Recall from Filtering the Values to be Returned that you can compose predicate expressions to qualify objects based on a wide variety of tests. For example, the following partial pattern matches a Movie vertex only if its title is a string starting with 'C' and if it was released before 1950:
Notice that you use the colon notation only when comparing an attribute value to a literal value for equality. But it is possible to compose a predicate expression that does the same thing. Consequently, the following two partial patterns are equivalent:
(m:Movie {title: 'Charade'})    // equality comparison notation
(m:Movie {title == 'Charade'})  // predicate expression syntax
Patterns for Describing Edges
You can describe various aspects of the edges between adjacent vertices in a path pattern.
Specifying Edge Direction
The minimal pattern for describing edges between pairs of adjacent vertices is an arrow formed by two hyphens and a greater-than symbol. The arrow indicates the edge direction. For example, the following pattern is matched by edges that originate from a Movie vertex and lead to a Genre vertex: 
Notice that the arrow always starts with the “from” vertex and points to the “to” vertex. The order of vertices can be reversed to match any edges leading from a Genre vertex to a Movie vertex: 
Identifying an Edge by Type
You can identify a particular edge of interest by including type information about it. Doing so is particularly useful if you are querying a graph in which a “from” vertex is connected to a “to” vertex through several different types of edges. For example, this could happen if our Movie class were associated with the Genre class through two different attributes—say, a genres attribute (for referencing all of a movie’s genres), and a primaryGenre attribute (for referencing just the single most important genre). For some queries, it could be important to identify which of these edges to consider when searching for matching paths. Omitting the edge type would allow the pattern to match edges that correspond to either attribute.
You specify edge-type information in a pair of square brackets within the arrow. The information you specify depends on whether the edge is implemented as a reference or as an edge object (see When Edges Need Values).
Edges Implemented as References
If the edge of interest is implemented as a reference, you specify the name of the reference attribute by including [:attributeName] within the arrow that represents the edge. For example, the following pattern describes edges that correspond to the genres attribute of a Movie object:
Notice that the specified reference attribute (genres) is defined by the schema class of the preceding “from” vertex (Movie), which is explicitly specified in the pattern.
Note:When you identify an edge using the name of a reference attribute, you must also specify the type of the “from” vertex explicitly; see Specifying the Type of a Vertex.
An error is reported if you omit the type of the “from” vertex, or if you include a vertex type that does not define the specified reference-attribute name:
Error: Edge class or edge attribute 'genres' not found.
Error: Edge class or edge attribute 'movieGenre' not found.
Edges Implemented as Edge Objects
If the edge of interest is implemented as an edge object, you specify the name of the edge class by including [:edgeClassName ] within the arrow that represents the edge. You do not need to specify the type of the preceding vertex.
Assigning a Variable to an Edge
As is the case for vertices, you can assign variables to edges if you want to enable other clauses in the query to refer to them:
You can combine a variable with a type identifier:
Note:Variables can be omitted from edges that you don’t plan to return explicitly. Every pattern must include at least one variable, which may represent either an edge or a vertex.
Specifying a Number of Consecutive Edges
You can specify a pattern that matches a number of consecutive edges without having to explicitly represent the vertices between them. In effect, a single arrow is interpreted as a chain of N edges when you specify *N in a pair of square brackets within the arrow. For example, the following pattern describes a pair of vertices (an Occupation vertex and a Genre vertex) that are connected by 4 consecutive edges:
This pattern is equivalent to the following (more cumbersome) pattern, which explicitly represents all four edges along with their three intermediate vertices:
(o:Occupation {name: 'farmer'})-->()-->()-->()-->(g:Genre {name: 'Comedy'}) 
If you don’t know the exact number of edges, you can specify a range—for example:
(o:Occupation {name: 'farmer'})-[*2..4]->(g:Genre {name: 'Comedy'}) 
Specifying a series of consecutive edges is especially useful when the distance between two endpoint vertices is significant. In a social-relationships graph, for example, we might be interested in pairs of Person vertices that represent friends or friends of friends, but not more distant acquaintances.   
Composing Path Patterns
So far, we’ve looked at isolated examples of patterns describing edges and vertices. In this section, we’ll use these building blocks to compose patterns that match longer and more interesting paths.
A Simple Path Pattern
The pattern in a graph query can represent every vertex and every edge explicitly in the paths to be matched. There is no particular limit to the number of vertices and edges you can describe in a pattern. Pattern length is driven by the paths you want to match.
Consider the following graph query, which returns the crime movies that are rated highly by lawyers:
MATCH (o:Occupation {name: 'lawyer'})-->(u:User)-->(r:Rating {score:5})-->(m:Movie)-->(g:Genre {name:'Crime'}) RETURN m;
Notice that the pattern in this graph query explicitly represents all five vertices and the edges between them.
The following diagram shows two of the matched paths in our sample data set:
Note:Each path matched by this graph query originates with an Occupation vertex and end with a Genre vertex. Paths that flow in the opposite direction (from Genre vertices to Occupation vertices) are not matched, even though they contain the same vertices.
Alternative Path Descriptions
The DO language provides flexibility in the way you specify paths in a graph query. You can choose the way that you find most readable.
Consider the graph query shown in A Simple Path Pattern above. Notice that the Occupation vertex in this example mainly serves to qualify User vertices. Furthermore, it doesn’t seem likely that we’ll want to use the variable (o) to refer to the matched Occupation vertices elsewhere in the query. So, we can trim the size of our pattern a little by treating the Occupation object’s name as a value that is reachable from a User object (see Reachable Attributes and Elements). That is, we can rewrite the pattern for matching User vertices so that it uses the dot operator in a predicate expression to test for the qualifying occupation name:
MATCH (u:User { == 'lawyer'})-->(r:Rating {score:5})-->(m:Movie)-->(g:Genre {name:'Crime'}) RETURN m;
Similar reasoning can apply to the Genre vertex, which mainly serves to qualify Movie vertices in this example. Because a Movie object holds a collection of Genre objects, we use the ANY operator in a predicate expression to test whether any of a Movie’s genres have the qualifying name ('Crime'):
MATCH (u:User { == 'lawyer'})-->(r:Rating {score:5})-->(m:Movie {ANY(genres, name =='Crime')}) RETURN m;
Note:A path pattern must explicitly include a vertex with an assigned variable if you’ll need the variable to refer to that vertex elsewhere in the query.
These examples show that when you use inline qualification to test a value that is reachable through an attribute, you need to know whether the attribute holds a single reference or a collection of references. This level of detail is needed, for example, to know when to use the dot operator or the ANY operator.
In contrast, when you represent a relationship as an explicit edge and vertex, you do not need to know whether the edge traverses a to-one relationship or is part of a to-many relationship. You use the same path-pattern notation either way.
Filtering Vertices With an Optional WHERE Clause
So far we have seen examples of inline qualification in a graph query, where conditions for qualifying vertices are specified directly within the path pattern in the MATCH clause. As an alternative, you can add a WHERE clause for specifying such conditions.
A WHERE clause in a graph query works much the same way as a WHERE clause in a by-value query (see Filtering the Values to be Returned). In both cases, the WHERE keyword introduces a predicate expression for qualifying objects. In a graph query, the predicate expression uses variables from the path pattern to identify the vertices to be qualified. A path is matched only if all of its vertices meet the specified conditions.
In the following example, the WHERE clause allows paths to be matched only if they contain User vertices that represent lawyers, Rating vertices that represent 5-star reviews, and Movie vertices that represent crime movies:
The WHERE clause in this graph query uses variables u, r, and m to refer to User, Rating, and Movie vertices, respectively. All of these variables are assigned in the path pattern.
Using a WHERE clause can make larger patterns more readable by consolidating all conditions in a single place. If you like, you can mix techniques, using inline qualification for some vertices and a WHERE clause for others.
More interestingly, a WHERE clause makes it possible to qualify paths by comparing values obtained from different vertices. The following rather artificial example illustrates this by matching the paths that connect a User vertex to a Movie vertex only if the user-identifier of the User is equal to the catalog number of the Movie.
MATCH (u:User)-->(r:Rating)-->(m:Movie) WHERE u.userId == m.catalogNum RETURN m;
Including Vertices of the Same Type
Sometimes a graph query needs to find paths that pass through multiple vertices of the same type. When this is the case, you must assign different variables to these vertices.
For example, the following graph query finds users who gave a 3-star rating to the movie Fargo, and then looks for any other movies that those same users gave 3 stars to.
MATCH (m:Movie {title == 'Fargo'}) -->(r:Rating {score:3}) --> (u:Users) -->(rr:Rating {score:3})-->(mm:Movie) RETURN mm;
The pattern in this query has two Movie vertices and two Rating vertices, each with a different variable assigned to it.
The following diagram shows one of the matched paths in our sample data set:
Even though the path contains multiple vertices of the same type, it does not traverse through any particular vertex more than once. Consequently, the Movie object representing Fargo is not among the returned results. (As we will see in Paths, Trails, and Walks, a trail can pass through the same vertex as long as it does not traverse the same edge more than once.)
Returning Values From a Graph Query
You can use the RETURN clause in graph query to return values extracted from each matched path, or to return the matched paths themselves. You specify what you want by including variables from the path pattern. Consequently, the path pattern must introduce at least one variable. For now, we’ll look at examples that return vertices. (We’ll see examples of returned edges and paths later, in Navigating a Graph.)
Returning Whole Vertices
You can return a particular vertex from each matched path by specifying the variable that was assigned to that vertex in the pattern. Each returned value is a projection from a matched vertex. If the pattern specified a type for the vertex, the projection includes all of the vertex’s attribute values.
For example, the RETURN clause in the following graph query specifies the variable m that was assigned to Movie vertices in the path pattern. Consequently, this query returns the Movie vertex from each of the matched paths.
Here is an excerpt from the DO runner’s output for this sample query. (For brevity, this excerpt shows the projections of the referenced vertex from just two of the matched paths. Furthermore, the lists held by the genres and ratings attributes have been omitted.)
      __identifier__: 4-29-1-95,
      __identifier__: 6-21-1-296,
      title:'Lethal Weapon',
Notice that projections for a particular vertex will appear as many times as there are matched paths that contain that vertex. If only one user rated the movie Fargo with 5 stars, the query results will contain only one projection for the corresponding Movie vertex. In the actual data set, 34 lawyers gave that movie the highest rating, so the actual query results will contain 34 projections showing the values for Fargo. (You can consolidate duplicate results using a GROUP BY clause, as shown in Example: Predicting Movie Preferences.)  
Note:Duplicate projections indicate that different paths pass through the same vertex. A given vertex does not appear twice in the same path.  
Returning Values From a Vertex
You can request projections that include particular attribute values from a vertex in the matched paths. You do this by using the dot operator with the variable that references the vertex. For example, the following graph query returns only the values of the title attribute from the Movie vertices within the matched paths:
MATCH (o:Occupation {name: 'lawyer'}) -->(u:User)-->(r:Rating {score:5})-->(m:Movie)-->(g:Genre {name:'Crime'}) RETURN m.title;
Here is an excerpt from the query’s output showing the requested projection for just one of the matched paths:
You can obtain indirect values that are reachable from a referenced vertex by combining the variable with the dot and subscript operators (see Reachable Attributes and Elements). For example, the following graph query returns the names of all the genres associated with the Movie vertex in each matched path:
MATCH (o:Occupation {name: 'lawyer'}) -->(u:User)-->(r:Rating {score:5})-->(m:Movie)-->(g:Genre {name:'Crime'}) RETURN;
You can use expressions to return transformed values. For example, the following graph query returns the titles of Movie vertices as strings in which the characters are uppercased:
MATCH (o:Occupation {name: 'lawyer'}) -->(u:User)-->(r:Rating {score:5})-->(m:Movie)-->(g:Genre {name:'Crime'}) RETURN UPPER(m.title);
Returning Values From Multiple Vertices
You can obtain values from multiple vertices within each matched path by including multiple variables in the RETURN expression. For example, the following graph query returns both the movie titles and the zip codes of the lawyers who like crime movies:
MATCH (o:Occupation {name: 'lawyer'}) -->(u:User)-->(r:Rating {score:5})-->(m:Movie)-->(g:Genre {name:'Crime'}) RETURN m.title, u.zipCode;
Even though the values are extracted from different objects, they are included in the same projection:
     u.zipCode: '98443'
Of course, you can use obtain multiple values from the same vertex by using the same variable multiple times in the RETURN expression.
Returning Values Under an Alias
As with by-value queries, a graph query allows you to specify an alias for the names of returned values. For example, the following graph query uses the requested aliases as the names for the values in the returned projections:
MATCH (o:Occupation {name: 'lawyer'}) -->(u:User)-->(r:Rating {score:5})-->(m:Movie)-->(g:Genre {name:'Crime'}) RETURN m.title AS Title, u.zipCode AS 'Zip Code';
Here is an example of a returned projection:
     Zip Code: '98443'
Returning All Variables
The special expression RETURN * is equivalent to listing every variable in the RETURN clause. Consequently:
MATCH (o:Occupation {name: 'lawyer'}) -->(u:User)-->(r:Rating {score:5})-->(m:Movie)-->(g:Genre {name:'Crime'}) RETURN *;
is equivalent to:
MATCH (o:Occupation {name: 'lawyer'}) -->(u:User)-->(r:Rating {score:5})-->(m:Movie)-->(g:Genre {name:'Crime'}) RETURN o, u, r, m, g;
Example: Predicting Movie Preferences
The Movie Ratings data set can used to predict preferences among a group users. Say, for example, that a user has given Toy Story a five-star rating. What other movies is that user likely to enjoy? We can answer that question by looking at the movie preferences of the users within our data set.
Let’s start with a query similar to the one shown in Including Vertices of the Same Type. This time, we’ll construct a path that traverses from Toy Story, through its 5-star ratings, to the users that gave those ratings, and then we’ll extend that path to include the other 5-star ratings given by those users. When the DO query matches that path, it returns the title from each of the 56832 ratings:
MATCH (m:Movie {title == 'Toy Story'}) -->(r:Rating {score == 5}) --> (u:Users) -->(rr:Ratings {score == 5}) RETURN; 
Of course, there aren’t 56832 different 5-star movies in the data set. Rather, the query returns one projection for each rating represented by the variable rr. Consequently, any movie that was rated highly by more than one user is listed multiple times in the result set.
We can return more useful results by using a GROUP BY clause to group the 5-star ratings by movie title:
MATCH (m:Movie {title == 'Toy Story'}) -->(r:Rating {score == 5}) --> (u:Users) -->(rr:Ratings {score == 5}) GROUP BY RETURN; 
This narrows the result set down to 2589 projections, where each projection represents the group of 5-star ratings for a particular movie. But these results are still not especially helpful—for example, we can’t tell which movies received 5-star ratings from only a few users and which received such ratings from many users.
What we really want to find are the highly co-rated movies—that is, the movies that are most likely to get a 5-star rating from users who liked Toy Story. Consequently, for each 5-star movie, we now want to count how many ratings it got, and then list just the top 10:
MATCH (m:Movie {title == 'Toy Story'}) -->(r:Rating {score == 5}) --> (u:Users) -->(rr:Ratings {score == 5}) GROUP BY ORDER BY COUNT() DESC TAKE 10 RETURN, COUNT(); 
Star Wars: E{...}     401
Toy Story 2           385
Raiders of t{...}     373
Star Wars: E{...}     346
Shawshank Re{...}     327
American Beauty       323
Matrix, The           316
Schindler's List      311
Sixth Sense, The      303
Saving Priva{...}     300
10 object(s) returned.
Next Step
In this chapter, we saw how to view your data set as a graph and use graph queries to find values. Now you are ready to navigate a graph to find out all the ways two objects are connected in a data set. Read:
Navigating a Graph