This hands-on tutorial in Python demonstrates integration of Senzing and Neo4j to construct an Entity Resolved Knowledge Graph:
- Use three datasets describing businesses in Las Vegas: ~85K records, ~2% duplicates.
- Run entity resolution in Senzing to resolve duplicate business names and addresses.
- Parse results to construct a knowledge graph in Neo4j.
- Analyze and visualize the entity resolved knowledge graph.
We'll walk through example code based on Neo4j Desktop and the Graph Data Science (GDS) library to run Cypher queries on the graph, preparing data for downstream analysis and visualizations with Jupyter, Pandas, Seaborn, PyVis.
The code is simple to download and easy to follow, and presented so you can try it with your own data. Overall, this tutorial takes about 35 minutes total to run.
Why? For one example, popular use of retrieval augmented generation (RAG) to make AI applications more robust has boosted recent interest in KGs. When the entities, relations, and properties in a KG leverage your domain-specific data to strengthen your AI app ... compliance issues and audits rush to the foreground.
TL;DR: sense-making of the data coming from a connected world. During the transition from data integration to KG construction, you need to make sure the entities in your graph get resolved correctly. Otherwise, your AI app downstream will struggle with the kinds of details that make people get concerned, very concerned, very quickly: e.g., billing, deliveries, voter registration, crucial medical details, credit reporting, industrial safety, security, and so on.
Highly recommended:
- "Entity Resolved Knowledge Graphs"
- "Analytics on Entity Resolved Knowledge Graphs", Mel Richey (2023)
In this tutorial we'll work in two environments. The configuration and coding are at a level which should be comfortable for most people working in data science. You'll need to have familiarity with how to:
- clone a public repo from GitHub
- launch a server in the cloud
- use Linux command lines
- write some code in Python
Total estimated project time: 35 minutes.
Cloud computing budget: running Senzing in this tutorial cost a total of $0.04 USD.
After cloning this repo, connect into the ERKG
directory and set up
your local environment:
git clone https://github.com/DerwenAI/ERKG.git
cd ERKG
python3.11 -m venv venv
source venv/bin/activate
python3 -m pip install -U pip wheel setuptools
python3 -m pip install -r requirements.txt
We're using Python 3.11 here, although this code should run with most of the recent Python 3.x versions.
First, launch Jupyter:
./venv/bin/jupyter lab
Then based on the tutorial, follow the steps shown in these notebooks:
You can view the results --
an interactive visualization of the entity resolved knowledge graph --
by loading examples/big_vegas.2.html
in a web browser.
The full HTML+JavaScript is large and may take several minutes to load.
If you need to clear the database and start over, run this in Neo4j Desktop:
MATCH (n)
CALL {
WITH n
DETACH DELETE n
} IN TRANSACTIONS
Many thanks to: @akollegger, @brianmacy