Clustering and Prediction in Neo4J

Jeet Singh
3 min readMay 1, 2023

Introduction

Neo4j is a popular database management system for storing and accessing highly linked data in the form of graphs. It uses the property graph paradigm to efficiently describe and analyze complicated connections between data elements. Neo4j, instead of standard relational databases, stores data as nodes and relationships, making it easier to query and analyze data patterns. The system includes Cypher, a query language that makes it easier to extract insights from graph data. It also includes a number of plugins for graph visualization and data science tasks.

This blog will summarize the findings produced by performing various queries on a dataset of Twitch streamers using Neo4j. Readers can go straight to the clustering section by clicking the “Clustering Using GDS” link.

Setup

This study’s dataset had over 150,000 nodes and 6.5 million edges. It was discovered that using Cypher’s LOAD_CSV function might take a long time. However, after some trial and error, the researchers discovered that using Neo4j’s admin-import terminal command allowed them to set up the network in under a minute.

The aforementioned program can load a network with 168,114 nodes and 6,797,557 edges in around 20 seconds while keeping all node properties. Using the LOAD_CSV method, on the other hand, would take more than 40 minutes to load just the edges.

It’s worth noting that the command’s nodes_header.csv and edges_header.csv files provide data-type information in addition to the headers. Because entries are treated as strings by default, this is critical for proper data import.

Using this command will load the characteristics as node properties rather than labels, so bear that in mind. The Clustering using GDS section can be used by readers to translate properties into labels.

Cypher Queries

Because of its ability to handle complicated data connections and derive significant insights from big datasets, Neo4j’s query language, Cypher, is frequently utilized for data science applications. Cypher is intended to be both user-friendly and adaptable, allowing data scientists to create complex queries that are easily understood by others.

The easiest Cypher query would be

match (n) return n

When you use this command, it will display all of the network’s nodes. It is crucial to note, however, that the Neo4j Browser has a restriction on the number of nodes that may be displayed in a single query, which can be changed. As a result, the maximum number of nodes displayed on the screen at any given moment will be less than this limit, and not all nodes may be visible.

The top 10 nodes (based on the number of connections) in this dataset may be retrieved by:

match (s)-[]->(t) return s.numeric_id, size(collect(t)) as connections order by connections desc limit 10

To set the criteria as the number of views, the Cypher command would be

match (n) return n.numeric_id, n.views as gamers order by n.views desc limit 10

Clustering is the process of grouping comparable nodes based on particular criteria. Neo4j provides the Graph Data Science (GDS) plugin, which includes a number of clustering techniques classified as “Community Detection.” To build clusters in this dataset, we utilized GDS’s Louvain community discovery algorithm. We were able to create 19 unique clusters as a consequence of this strategy.

The network must be preserved as a “graph” to do this. This may be accomplished with the following command:

CALL gds.graph.project.cypher(
'twitch',
'MATCH (n)
RETURN
id(n) AS id,
n.views AS views',
'MATCH (n)-[]->(m) RETURN id(n) AS source, id(m) AS target'
)
YIELD
graphName, nodeCount AS nodes, relationshipCount AS rels
RETURN graphName, nodes, rels

This stores a graph with the name “twitch” and the supplied features in the current runtime.

Louvain clustering can be triggered by

call gds.louvain.write('twitch', {writeProperty:'louvain'})

When this command is executed, the Louvain clustering algorithm is implemented and the outcome is saved as a node attribute labeled “louvain.” To show clusters individually, use the following code to convert the node property into a node label.

match (n)
call apoc.create.addLabels([id(n)], [toString(n.louvain)])
yield node
with node remove node.louvain return node

Thank You for Reading !!!

--

--