PoliGraph is a framework made by UCI Networking Group. This is a fork of their work that is adapted to run on a M1 macbook to work with a dataset of Google Play Store Privacy Policy
This repository hosts the source code for PoliGraph, including:
- PoliGraph-er software - see instructions below.
- Evaluation scripts under
evals/
. - PoliGraph analysis scripts under
analyses/
. - Dataset preparation scripts under
datasets/
. - Model training scripts under
models/
.
PoliGraph is part of the Policy-Technology project of the UCI Networking Group.
If you create a publication based on PoliGraph and/or its dataset, please cite the original paper as follows:
@inproceedings{cui2023poligraph,
title = {{PoliGraph: Automated Privacy Policy Analysis using Knowledge Graphs}},
author = {Cui, Hao and Trimananda, Rahmadi and Markopoulou, Athina and Jordan, Scott},
booktitle = {Proceedings of the 32nd USENIX Security Symposium (USENIX Security 23)},
year = {2023}
}
I am currently testing all the code in this repository on a M1 Macbook with the following configuration:
- CPU: M1 Max
- Memory: 64 GiB
- OS: macOS Sonoma 14.6
PoliGraph-er is the NLP software used to generate PoliGraphs from the text of a privacy policy.
PoliGraph-er is written in Python. The original repo suggests using Conda, however this repo uses pip instead.
Create a virtual environment using venv and activate it:
$ python3 -m venv venv
$ source ./venv/bin/activate
After cloning this repository and creating a virtual environment, change the working directory to the cloned directory.
Install the necessary pip packages using:
$ pip install -r requirements.txt
Initialize the Playwright library (used by the crawler script):
$ playwright install
Download poligrapher-extra-data.tar.gz
from here. Extract its content to poligrapher/extra-data
:
$ tar xf /path/to/poligrapher-extra-data.tar.gz -C poligrapher/extra-data
Download the PrivacyPoliciesDataset_Brianna dataset from here. Extract its content to PrivacyPoliciesDataset_Brianna
.
Install the PoliGraph-er (poligrapher
) library:
$ pip install --editable .
Here we illustrate how to generate a PoliGraph from a real privacy policy. We use the following privacy policy webpage as an example:
$ POLICY_URL="https://web.archive.org/web/20230330161225id_/https://proteygames.github.io/"
First, run the HTML crawler script to download the webpage:
$ python -m poligrapher.scripts.html_crawler ${POLICY_URL} example/
The directory example/
will be used to store all the intermediate and final output associated with this privacy policy.
Second, run the init_document
script to preprocess the webpage and run the NLP pipeline on the privacy policy document:
$ python -m poligrapher.scripts.init_document example/
Third, execute the run_annotators
script to run annotators on the privacy policy document.
$ python -m poligrapher.scripts.run_annotators example/
Lastly, execute the build_graph
script to generate the PoliGraph:
$ python -m poligrapher.scripts.build_graph example/
The generated graph is stored at example/graph-original.yml
. You may use a text editor to view it. The format is human-readable and fairly straightforward.
Alternatively, if you run build_graph
with the --pretty
parameter, it will generate a PoliGraph in the GraphML format (example/graph-original.graphml
), which can be imported to some graph editor software:
$ python -m poligrapher.scripts.build_graph --pretty example/
For more instructions on how to view the graphs, please refer to the document Viewing a PoliGraph.
The init_document
, run_annotators
, and build_graph
scripts support batch processing. Simply supply multiple directories in the arguments:
$ python -m poligrapher.scripts.init_document dataset/policy1 dataset/policy2 dataset/policy3
$ python -m poligrapher.scripts.run_annotators dataset/policy1 dataset/policy2 dataset/policy3
$ python -m poligrapher.scripts.build_graph dataset/policy1 dataset/policy2 dataset/policy3
If all the subdirectories under dataset
contain valid crawled webpages, you may simply supply dataset/*
to let the shell expand the arguments.
We will update the documentation under the docs/
directory to explain the usage of other scripts.
Please refer to the document USENIX Security 2023 Artifact Evaluation and Artifact Evaluation (Additional Experiments) for instructions on reproducing the main results in the original paper.
To use this with the Brianna dataset, simply run the command below:
$ python3 briannaRun.py
Output graphs will be saved to the output/GooglePlay_Privacy_Policies/{filename}
& output/IndividWebsite_Privacy_Policies/{filename}
directories, and the graphs stored within will be fed into Neo4J.
To start the import of graphs into Neo4J, simply run the command below:
$ python3 graphImport.py