In the past decade I focused mainly on relational databases. I had the chance to design, work with and optimize databases from small(MB) to medium(TB) highly available enterprise clusters. In the past years, I turned to NoSql databases with couple projects, with Cassandra and MongoDb. NoSql is really cool to move onto, if you are not too used to having all the “fancy” logic available, based on the structure and the Sql engine itself like searching, ordering or data connection. Nowadays having memory for caching and computation power on hand for the application layer the trends are changing, from retrieving as much structured data in single database query as possible to structuring data in application and store it in cache.
My adventure with Neo4J started almost 2 years ago, when we started to look for alternatives on a MyIsam Enterprise database. The data was not changing to often, but there where really complex queries and some tables had 300+ columns. This was crying for a re-design, but for a baseline re-think at least. So we started to look into possibilities and the Neo4J was the closest for our needs. It has really nice built in functions for search(WHERE predicate functions), with string pattern matching and join alike queries. Couple of useful functions I would categories as “weird”, based on my Sql and NoSql background, like the “head()” – giving the first element in list and “coaleslce()” – returning the first non NULL in list. The engine itself does a pretty good job and accessing data is quiet fast, but there where quiet few queries to re-jig to get best performance. I suggest to create at least a sample project before deep dive into Neo4J.
I created a simple performance test couple months ago to see how the different versions change performance. The base project is available trough my Github account here. In that simple project I download the Alexa first 1 million sites list, and insert it with different approaches. I run batch and single entry inserts as well and I have to admit that the engine performs quiet well under this load in both cases. Obviously it can not compete with a clustered Cassandra environment but at the end, complex queries are mostly important for project turning to graph databases.
For mixed datatype “reads” I created a project with data from 1918 to roughly today’s Irish election results, which is a really mixed data set. This project was part of the Graph theory module I did at the time. The project can be forked from Github, here. The data can be initialized with a python crawler script(or single import script, or full download). I ran into couple problems with this project, regarding import export. The data collected is creating a running database of 175+MB, which is not a problem for the community version to crunch, but gives some issues when trying to import after a “suggested” export. The problem arises as the exported data is wrapped into a single transaction which is running longer then I can wait(45min+), with no result. After playing around with exports and imports I found that breaking the export file into chunks is the best option to import.
The project itself contains some interesting queries what I would categories as “fancy” and quiet slow in relational databases. These include path and edge search which is only possible with complex nested select and/or joins.
Graph databases are giving a nice set of performance advantages compared to relational and NoSql databases in certain cases, but I would suggest to look into them deeply before diving into with real project, as it can not solve all performance problems, and at the same time, it can create serious headaches.
Neo4j site contains a really nice and easy to follow documentation with examples to run. Packtpub.com has couple nice titles on the topic, whort to read, like Neo4J Graph data modeling and Neo4j High performance.
Neo4j is wildly implemented so it has drivers for the most commonly used languages, from it’s native java to php. Also by default the storage can be accessed trough Rest API requests which gives also a simple integration option for any project. After some testing it seams that using other languages than it’s native java, is always somewhat slower but still in the “OK” range.
I suggest to give it a go and see the “Graph magic” yourself… 🙂