PyTorch Geometric is a Python library for dealing with graph algorithms.
Use the package manager poetry in myenv to install foobar. Install pyenv befrorehand.
python3 -m venv .venv
source .venv/bin/activate
pip3 install poetry --no-cache
poetry install
Now build desired neo4j
container.
CONTAINER=$(docker run -d \
-p 7474:7474 -p 7687:7687 \
-v $(pwd)/data/neo4j_db/data:/data \
-v $(pwd)/data/neo4j_db/logs:/logs \
-v $(pwd)/data/neo4j_db/import:/var/lib/neo4j/import \
--name test-neo4j-stx-books-recommender44 \
-e NEO4J_apoc_export_file_enabled=true \
-e NEO4J_apoc_import_file_enabled=true \
-e NEO4J_apoc_import_file_use__neo4j__config=true \
-e NEO4J_AUTH=neo4j/stx_books_pass \
-e NEO4JLABS_PLUGINS='["apoc", "graph-data-science"]' \
-e NEO4J_ACCEPT_LICENSE_AGREEMENT=yes \
neo4j:4.4-enterprise
)
Of note: We use here 4.4 version due to not being stable (at 30.01) APOC version from 5. x. This might vary in future.
Once Docker Container is up and running, create contents based on queries in YOUR_DOCKER_NEO_LOCATION/db_loader.cypher
file.
You have few options:
- (Easy-mode) You can run them in browser and just copy-paste.
- Within terminal run ->
$ docker exec $CONTAINER /var/lib/neo4j/bin/neo4j-shell -f YOUR_DOCKER_NEO_LOCATION/db_loader.cypher
or for interactive mode... (to copy-paste like in the browser)
$ docker exec -ti $CONTAINER /var/lib/neo4j/bin/neo4j-shell
Before running your code, you need to define all variables stored in .env
.
Especially:
MLFLOW_USER=
MLFLOW_PASSWORD=
MLFLOW_URL=
So either uses your own MLFlow account or use your dockerized one.
After proper data population within the graph database there should be visible following schema: Or you can try by yourself by calling
CALL db.schema.visualization()
- Users - representing our users with some attributes (including
first_name
,last_name
etc) - Titles - representing specific books with their metadata. Connected with a user with relations
RATED_BY
andREAD_BY
. WhileRATED_BY
has its wage (0-10) and is used for further embeddings via FastRB to classify and obtain our recommendations (that will be modelled viaRECOMMENDED_BY
) - Authors - Node that points to the given Author of the book, with its metadata. By relation
WRITTEN_BY
- YearsOfPublications - node for a specific year of publication (via
WRITTEN_IN_YEAR
relation) - Publishers - node representing publisher of a given book (via
PUBLISHED_BY
relation)
More detailed schema (with specific indices in csv
view) can be read here
Our dataset comes from Kaggle Dataset It was modified limited to 50k and for readability by adding fixtures to Users (first name, last name) by faker so that any similarity to real person is pure coincidence :)
Then run the following code in the terminal for the training model and create a new RECOMMENDED_TO
relationship.
python3 main.py
Obviously, the relationship is between Titles
and Users
(Titles)-[:RECOMMENDED_TO)->(Users)
Below is a fracture of new relationships:
How the process of embeddings (to temporary book_titles
graph) looks like:
- TODO: See also our blog-post!
- link to STX blogpost here for more (TODO: or copy-paste here)
Graph-based recommendations give us a very powerful tool to search by different criteria. Where our imagination is the limit.
Results of recommendation for a specific user (in this case Patti Jacobs)
MATCH paths=(u: Users {first_name: 'Patti', last_name: 'Jacobs'})-[:RECOMMENDED_TO]->(t:Titles) RETURN paths;
List of readers that loves "pride & prejudice" to check what they have in common: For results CSV
MATCH (romance_lovers:Users)-[:READ_BY]->(n:Titles) WHERE n.title = 'Pride and Prejudice'
MATCH (other_book:Titles)-[:RECOMMENDED_T0]->(romance_lovers:Users)
WHERE id(other_book) <> id(n)
RETURN other_book.author AS author, other_book.title AS title;
What are the best guesses for top-5 book readers? Below the cypher, sub-query obtaining first part
Full query showing all recommendations For results CSV
CALL {
MATCH (users:Users)-[:READ_BY]->(n:Titles)
WITH COUNT(n) AS counter, n, COLLECT(id(users)) AS user_ids
RETURN n.title, counter, user_ids
ORDER BY counter DESC
LIMIT 5
}
WITH user_ids
UNWIND user_ids AS user_id
MATCH (u:Users {user:user_id})-[:RECOMMENDED_TO]->(t2:Titles)
RETURN t2
LIMIT 10;
Here make the limitation to only readers based on US
that have already rated books published after 1984
!
For results CSV
CALL {
MATCH (u:Users)-[r:RATED_BY]->(t:Titles)
WITH lTrim(split(u. location, ',')[-1]) AS location, t, u
WHERE Location - 'usa' AND t.year_of_publication > 1984
RETURN t, u
LIMIT 10
}
WITH u
MATCH (u)<-[:RECOMMENDED_TO]-(t2:Titles)
RETURN t2.author AS recommended_author, t2.title AS recommended_title
LIMIT 5;
Pulling data from Neo4j and loading results to Neo4j are made with the use of ["graph-data-science", "apoc"]
plugins.
For a visualisation - an example of new mapping can be found in the sample/results.txt
file, but it is not updated after new training.
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
Please make sure to update tests as appropriate.