Recently, I’m exploring the possibilities of using graph database at work. The best way to know if this technology works to try it out. In this post, I will set up Neo4j server using a docker image and play around with Neo4j.
What’s Neo4j
Neo4j is a highly scalable, robust native graph database.
Neo4j uses Cypher
to query for data in the database. The community has also created an extensive range of drivers for other programming languages.
There are other graph databases available. You can view the list here.
As I’m very new in graph databases, I have no idea what technical aspect should I be concerned. I have chosen Neo4j for the time being as it is open-source, regularly updated and has a commercial license.
Setup Neo4j with a Docker Image
If you have not installed Docker for Mac (or other OS), you can download the installer on this page.
Pull the latest docker image from Docker Hub:
docker pull neo4j
Then, start a Neo4j container:
docker run --publish=7474:7474 --publish=7687:7687 --env=NEO4J_AUTH=none --volume=$HOME/neo4j/data:/tmp/neo4j -v $HOME/neo4j/import:/var/lib/neo4j/import neo4j
Now, we can access Neo4j through the web browser at: http://localhost:7474
.
We have bounded a volume ~/neo4j/data
to allow the database remains persistent outside of the container. We have also bounded an import volume so that we can import files in the later stage.
Since we are testing the technology, we disable the authentication on the server by passing --env=NEO4J_AUTH=none
to docker run. You should never do this if you are going to run this container on production.
Using Neo4j
Once you access http://localhost:7474
, you will be greeted with the dashboard. The dashboard contains a stream of query data that you have requested to run in the editor windows.
Website Recommendation Engine
In this post, I will build a website recommendation engine. This engine will recommend new website for the user based on what he has read in the past.
Preparing Data
Before I go to Neo4j, I will need to create some data. I use Feedly API to search and populate the data. You can find the snippet here.
This is how the extracted data looks like:
Fortunately, we can load csv data into Neo4j using a function called LOAD CSV
. I will not go through the details. If you need more information, you can read the guide here.
Let’s copy the file into Neo4j’s instance:
cp feedly.csv ~/neo4j/import/
Let’s check that our file can be processed by Neo4j. In the editor windows:
// check first few raw lines
LOAD CSV WITH HEADERS FROM "file-url" AS line WITH line
RETURN line
LIMIT 5;
You should be able to get similar results:
Create Our Property Graph
Now it’s time to create some nodes!
In our recommendation engine, we have two labels - Website and Tag. Each Website is tagged with one or more Tag.
First, we create unique constraint for Website and Tag:
// Add constraint for uniqueness
CREATE CONSTRAINT ON (website:Website) ASSERT website.name IS UNIQUE
CREATE CONSTRAINT ON (tag:Tag) ASSERT tag.name IS UNIQUE
Then, we load the data into Neo4j and create the relationship:
LOAD CSV WITH HEADERS FROM "file:///feedly.csv" AS line
WITH line, SPLIT(line.tags, "|") as tags
// For each tag in tags, create a node with name as the property
// Then create the relationship between this website and the tags
// As tag can be repeated, we have to use MERGE to ensure there's only one node created for each tag.
FOREACH(each_tag IN tags |
MERGE (website:Website { name: line.website, subscriber: toInteger(line.subscribers)} )
MERGE (tag:Tag { name: each_tag})
MERGE (website)-[:TAG]->(tag))
You can check that you have created the relationship correctly using the following query:
MATCH (website:Website)-[:TAG]->(tag:Tag {name: "apple"})
return website, tag
This is what you will probably get:
Adding More Nodes & Relationship
Now that we have our data ready in Neo4j, we can start to create some users.
We have two users, Alice and Bob. Alice loves to eat and catch up on technology news. On the other hand, Bob likes cats and dogs. He also likes to eat and drink alcohol.
In editor, we set our parameter.
// Use parameter to store the data
:param props: [{"name": "Alice" }, {"name": "Bob"}]
Then, we unwind the parameter and create the nodes.
// Create user based on the given parameters
UNWIND $props AS userMap
MERGE (user:User {name:userMap.name})
RETURN user
Now, we want to create relationship among the users and tags.
MATCH (tag:Tag) where tag.name in ["food", "apple", "mac", "tech"]
MATCH (user:User {name:"Alice"})
MERGE (user)-[:LOVE]->(tag)
RETURN tag, user
MATCH (tagB:Tag) where tagB.name in ["food", "cats", "dogs", "alcohol", "mac", "apple"]
MATCH (userB:User {name:"Bob"})
MERGE (userB)-[:LOVE]->(tagB)
RETURN tagB, userB
To check if you have executed the commands correctly, you can run the following:
MATCH (user:User)-[:LOVE]->(tag:Tag)
return user, tag
And this is what you will see:
Let’s say Alice has already subscribed to some of the websites related to food. We can create a new relationship between Alice and websites.
// Subscribed to some random website
MATCH (user:User {name: "Alice"})
// Note that your ID might be different from mine
MATCH (website:Website) where ID(website) in [504, 517, 541, 501, 522, 478,503,500,634]
MERGE (user)-[:SUBSCRIBE]->(website)
return user, website
What’s Next?
Now, let’s recommend some websites for Alice and Bob to get started. We can run the following command:
MATCH (user:User)-[:LOVE]->(tag:Tag)
MATCH (website:Website)-[:TAG]->(tag)
return user, tag, website
This is what we will get:
Basically, we query the database to extract all the websites that are tagged with the tags which Alice and Bob are interested.
We can make the query even more precise. As Alice is a very picky reader, we can recommend those websites that have more subscribers.
MATCH (user:User {name: "Alice"})-[:LOVE]->(tag:Tag {name:"food"})
MATCH (website:Website)-[:TAG]->(tag) WHERE NOT (user)-[:SUBSCRIBE]->(website) and website.subscriber > 20000
return user, tag, website
This will be how it looks like:
Neo4j, Python and REST API
I see that there are two ways to access data from Neo4j. First, we can install Python driver to access the graph database. Second, we can access the data directly through REST API.
You can read more about Python driver here.
Basically, you need to use Neo4j Python driver to connect to the database. Then you can run queries with the driver.
#!/usr/bin/python
from neo4j.v1 import GraphDatabase, basic_auth
password = "neo4j" #use env to protect this when go to production
driver = GraphDatabase.driver('bolt://localhost',auth=basic_auth("neo4j", password))
# connects to db
db = driver.session()
# get some graphs
results = db.run("MATCH (user:User {name: 'Alice'})-[:LOVE]->(tag:Tag {name:'food'}) "
"MATCH (website:Website)-[:TAG]->(tag) WHERE NOT (user)-[:SUBSCRIBE]->(website) and website.subscriber > 20000 "
"return user.name as user, tag.name as tag, website.name as link, website.subscriber as count")
for record in results:
print record["user"], record["link"], record["count"]
I have also created API for my app using Flask. You can see the code here. Here’s a preview of the API:
Conclusion
I find that graph database is really useful in modeling connected data. The hardest part is to model the graph based on the given data.
Neo4j can be set up easily. Cypher
is fairly straightforward and easy to use. The challenge is to formulate the queries such that it can return something that is expected.