Introduction to the Titan graph database

This articles is the first articles in a series and introduces the Titan graph database as well as how to access it via the Gremlin console shell. Furthermore, a basic schema for the ESecLog domain is introduced that is going to be used in future articles.

Introduction

In the last couple of years, graph databases have gained a lot a popularity (or hype). A graph database focuses on entities (vertices) and their relations (edges). As an example, Wikipedia can easily be represented as a graph. In a simple approach one can model the articles as entities and hyperlinks as relationship between the articles.

In our experience it depends on your domain whether a graph or relational database is better suited. A relational database is better suitable for data that can be modelled in tables while graphs are better for domains with interconnections and patterns between entities.

With Domain Driven Design(DDD) one often already has a domain graph model that can then be transferred into a graph database.

Example: A Wikimedia network graph showing the connections of programming languages.

Titan

In the ESecLog project we use the Titan graph database. Titan is an open source project developed by Aurelius with the Apache2 license. It supports different storage backends (Apache Cassandra, Apache HBase or Oracle Berkeley DB) and has a native integration into the TinkerPop technology stack.

Tinkerpop

The Tinkerpop stack is used for providing standard libraries and interfaces for different graph database vendors (e.g. Titan, Neo4J, OrientDB,…). The Tinkerpop stack can be best explained from bottom to top:

Blueprints: Collection of interfaces and implementations for graph databases, similarly to JDBC.
Pipes: A dataflow processing framework
Gremlin: A graph traversal language
Frames: An object-graph mapper
Furnace: Standard graph algorithms for undirected and directed graphs
Rexster: A graph server that exposes the underlying graph via REST

Titan implements the Blueprints API and thus allows to use the complete technology stack of Tinkerpop. In this introductory post we will be using Gremlin and start to define a simple database model that we will continue to use and extend in future articles.

Installation

The Titan version 0.5.3 was released in December 2014. Make sure that you have java installed.

Download and unzip the server into a folder of your choice:

[codesyntax lang="bash"]
<pre>$ wget http://s3.thinkaurelius.com/downloads/titan/titan-0.5.3-hadoop2.zip
$ unzip titan-0.5.3-hadoop2.zip
$ cd titan-0.5.3-haddop2
$ sudo bin/titan.sh -c cassandra-es start
Forking Cassandra...
Running `nodetool statusthrift`.. OK (returned exit status 0 and printed string "running").
Forking Elasticsearch... 
Forking Titan + Rexster...
Connecting to Titan + Rexster (127.0.0.1:8184).... OK (connected to 127.0.0.1:8184).
Run rexster-console.sh to connect.</pre>
[/codesyntax]

Now Titan is running and you can connect either via Rexster or Gremlin. In this post we are going use the Gremlin connection and open a Titangraph . The Gremlin Shell that we are going to use is a Groovy shell, therefore you can load Java or Groovy classes into the shell and write plain Groovy into the shell. A good documentation for those familiar with SQL can be found on the website http://sql2gremlin.com/.

[codesyntax lang="groovy"]
<pre>$ bin/gremlin.sh

\,,,/
(o o)
-----oOOo-(_)-oOOo-----

// Load a new graph
gremlin&gt; g = TitanFactory.open("cassandra:localhost")
==&gt;titangraph[cassandra:[localhost]]
[/codesyntax]

ESecLog

We are part of the joint research project ESecLog that is being funded by the Federal Ministry of Education and Research (BMBF).

Security is a top priority in air freight logistics but screening procedures can be very time consuming and costly. The freight’s security status is monitored throughout the entire transport chain aggregated into a digital freight fingerprint.

Lets create a simplified Flight cargo delivery model: A shipment is ordered from a customer. The shipment contains a different amount of pieces. The pieces are later placed onto a pallet and pallets can even be bundled onto other pallets.

ESeclogSimple — A simple ESecLog Domain model.

In this model, every vertice has a name, and every edge has a timestamp. We have four vertex labels: Customer, Shipment, Piece and Pallet. We have three edge labels: „orders“, „consists-of“ and „is-bundled-on“.

Create the Graph Schema

Lets create the schema and indices for the above model. In case you want to clean and clear your entire graph Titan, you can do the following:

gremlin> g.shutdown()
==>null
gremlin> TitanCleanup.clear(g)
==>null
gremlin> g = TitanFactory.open("cassandra:localhost")
==>titangraph[cassandra:[localhost]]

Be aware that the graph needs to be shutdown before clearing it.

There is a difference in if you are constructing a TinkerGraph or a TitanGraph in regards to the KeyIndex declaration.

TinkerGraph

For the TinkerGraph the ESecLog indices can be defined as following:

gremlin> tg = new TinkerGraph()
==>tinkergraph[vertices:0 edges:0]
gremlin> tg.createKeyIndex('time', Edge.class)
==> null
gremlin> tg.createKeyIndex('name', Vertex.class)
==> null

// Adding a Vertex with name = Hobbes
gremlin> tg.addVertex(null).setProperty("name","Hobbes")
==>null
// See how many vertices/edges the graph has
gremlin> tg
==>tinkergraph[vertices:1 edges:0]

// Listing all vertices (uppercase 'V')
gremlin> tg.V
==>v[0]

// Query a specific vertex (lowercase 'v') by id or name
gremlin> tg.v(0)
==>v[0]
gremlin> tg.V("name","Hobbes")
==>v[0]
gremlin> tg.v(0).map
==>{name=Hobbes}

Titan Graph

In a Titan 0.5.x one needs to use the TitanManagement for configuring the Schema and defining indices. Be aware that you need to initialize indices before creating graph object. One trick that we use to do that is by querying a graph node that is only used for storing the schema version. If that exists we do an upgrade if necessary, otherwise we initialize the schema and indices.

g = TitanFactory.open("cassandra:localhost");
TitanManagement management = g.getManagementSystem();

final PropertyKey name = management.makePropertyKey("name").dataType(String.class).make();
TitanGraphIndex namei = management.buildIndex("name",Vertex.class).addKey(name).unique().buildCompositeIndex();
management.setConsistency(namei, ConsistencyModifier.LOCK);

final PropertyKey time = management.makePropertyKey("time").dataType(Integer.class).make();
TitanGraphIndex timei = management.buildIndex("time",Edge.class).addKey(time).buildCompositeIndex();
management.setConsistency(timei, ConsistencyModifier.LOCK);

management.makeVertexLabel("Customer").make();
management.makeVertexLabel("Shipment").make();
management.makeVertexLabel("Piece").make();
management.makeVertexLabel("Pallet").make();
management.makeEdgeLabel("orders").make();
management.makeEdgeLabel("consists-of").make();
management.makeEdgeLabel("is-bundled-on").make();

management.commmit();

// Adding a Customer Vertex with name = Hobbes
gremlin> g.addVertexWithLabel("Customer").setProperty("name","Hobbes")
==>null

// See how many vertices/edges the graph has
gremlin> g.V.count()
==>1

// Listing all vertices (uppercase 'V')
gremlin> g.V
==>v[256]


// Query a specific vertex (lowercase 'v') by id or name
gremlin> g.v(256)
==>v[256]
gremlin> g.v(256).map
==>{name=Hobbes}
gremlin> g.v(256).label
==>Customer
gremlin> g.V("name","Hobbes").map
==>{name=Hobbes} 

// Query the vertices for all vertices with label "Customer"
gremlin> g.query().has("label","Customer").vertices()
==>v[256]

In future articles we will see how to use Gremlin to query or transform the graph as well as how to travers the graph recursively with time-frames.