Datavoids Web Simulator

This web simulator explores the dynamics of data voids and the effectiveness of different mitigation strategies. A data void occurs when there is a lack of relevant information online for certain search keywords, which can be exploited to spread disinformation. The entire English Wikipedia is used as dataset to simulate a web of hyperlinked Web pages.

Our simulator models an adversarial game between disinformers and mitigators, each attempting to influence search engine rankings by adding content to fill these voids:

Simulation of Data Voids: the simulator constructs data voids by removing relevant pages from the Wikipedia dataset, creating an environment where search queries return few or no relevant results, mimicking real-world scenarios where data voids can be exploited.
Adversarial Game Model: disinformers and mitigators take turns adding content to the dataset. Disinformers aim to promote misleading information, while mitigators attempt to counteract this by adding accurate information.
Evaluation of Strategies: the simulator evaluates various strategies for both disinformers and mitigators. Strategies include Random, Greedy, and Multiobjective approaches, each with different resource allocations and impact levels.
Tracking and Analysis: the simulator tracks the changes in search result rankings over time, providing insights into the effectiveness of different mitigation efforts.

Some of the results can be illustrated from our main research paper. The following figure shows the difference in effects between the mitigator and disinformer at every turn of the web search simulation across four data void scenarios:

This figure illustrates the costs associated with different mitigator strategies at each turn of the web search simulation across the same four data void scenarios as the previous figure.

Running the simulator

Python environment

pipenv install
pipenv shell
pipenv --venv # To know where is the virtual environment path

Specify that environment path as path in anytime a Jupyter Notebook in VSCode need to be run.

Project configuration

The project is configured by a file config.json. This file is not included in this repository, but a template of this file is available as config.template.json.

An example here:

{
  "database": {
    "host": "localhost",
    "user": "postgres",
    "password": "postgres",
    "database": "wikidump"
  },
  "stored_functions_dir": "./database/functions/",
  "target_groups": [
    "mit",
    "dis",
    "None"
  ],
  "groups_colors": {
    "mit": "#16a085",
    "dis": "#f39c12",
    "None": "#000000"
  },
  "mitigator_keyword": "mit",
  "disinformer_keyword": "dis",
  "mit_keywords": [],
  "dis_keywords": [],
  "target_node": {
    "bay": 404412,
    "fre": 15537745
  },
  "page_rank_at_each_step": false,
  "compute_initial_rage_rank": false,
  "top_k": 1000,
  "steps_config": {
    "max_steps": -1,
    "max_atomic_steps": -1,
    "on_each_node": true,
    "on_each_edge": false
  },
  "costs": {
    "budget": -1 
  },
  "labeling_hops": 1,
  "datavoids": [],
  "output_filename": null,
  "gzip": true
}

Create the database

Install Postgres and create a database to contain the whole Wikipedia dataset.

create database wikidump;

After connecting to the database create the necessary tables:

drop table if exists nodes, nodes_info, edges, redirects, rank;

create table nodes(
  id serial primary key, 
  grp varchar,
  active boolean default true
);

create table nodes_info (
    id integer primary key,
    url character varying,
    content text,
    content_vector tsvector,
    date_added timestamp without time zone
);
create index index_name on nodes_info (url);

create table edges(
  src int references nodes,
  des int references nodes,
  active boolean default true,
  primary key (src, des)
);

create table rank(
  id int primary key
  -- rank algorithm can add columns to it
);

create table info(
  id varchar primary key,
  prop varchar
);

create table redirects(
  from_title varchar primary key, 
  to_title varchar
);

Data can then be imported through the official Wikipedia dump files with:

PYTHONPATH=.:loaders/wikiextractor python loaders/load_wiki_dump.py config.json  ~/path/to/enwiki-multistream

alternatively, an existing DB dump is already available of an already imported Wikipedia dump.

psql -U postgres -d wikidump -f ~/path/to/wikidump.sql

Load stopwords

Execute the following:

pipenv run python ./loaders/load_stopwords.py

The stopwords are contained in data/stopwords and are coming from these databases:

file	size	source	description
CoreNLP (Hardcoded)	28	⇱	Hardcoded in src/edu/stanford/nlp/coref/data/WordLists.java and the same in src/edu/stanford/nlp/dcoref/Dictionaries.java
Ranks NL (Google)	32	⇱	The short stopwords list below is based on what we believed to be Google stopwords a decade ago, based on words that were ignored if you would search for them in combination with another word. (ie. as in the phrase "a keyword").
Lucene, Solr, Elastisearch	33	⇱	(NOTE: Some config files have extra 's' and 't' as stopwords.) An unmodifiable set containing some common English words that are not usually useful for searching.
MySQL (InnoDB)	36	⇱	A word that is used by default as a stopword for FULLTEXT indexes on InnoDB tables. Not used if you override the default stopword processing with either the innodb_ft_server_stopword_table or the innodb_ft_user_stopword_table option.
Ovid (Medical information services)	39	⇱	Words of little intrinsic meaning that occur too frequently to be useful in searching text are known as "stopwords." You cannot search for the following stopwords by themselves, but you can include them within phrases.
Bow (libbow, rainbow, arrow, crossbow)	48	⇱	Bow: A Toolkit for Statistical Language Modeling, Text Retrieval, Classification and Clustering. Short list hardcoded. Also includes 524 SMART derived list, same as MALLET. See http://www.cs.cmu.edu/~mccallum/bow/rainbow/
LingPipe	76	⇱	An EnglishStopTokenizerFactory applies an English stop list to a contained base tokenizer factory
Vowpal Wabbit (doc2lda)	83	⇱	Stopwords used in LDA example
Text Analytics 101	85	⇱	Minimal list compiled by Kavita Ganesan consisting of determiners, coordinating conjunctions and prepositions http://text-analytics101.rxnlp.com/2014/10/all-about-stop-words-for-text-mining.html
LexisNexis®	100	⇱	“The following are 'noise words' and are never searchable: EVER HARDLY HENCE INTO NOR WERE VIZ. Others are 'noisy keywords' and are searchable by enclosing them in quotes.”
Okapi (gsl.cacm)	108	⇱	Cacm specific stoplist from Okapi
TextFixer	119	⇱	From textfixer.com Linked from Wiki page on Stop words.
DKPro	127	⇱	Postgresql (Snowball derived)
Postgres	127	⇱	“Stop words are words that are very common, appear in almost every document, and have no discrimination value.”
PubMed Help	133	⇱	Listed in PubMed Help pages.
CoreNLP (Acronym)	150	⇱	A set of words that should be considered stopwords for the acronym matcher
NLTK	153	⇱	According to email Van Rij. Sbergen (1979) "Information retrieval" (Butterworths, London). It's slightly expanded from postgres postgresql.txt which was borrowed from snowball presumably.
Spark ML lib	153	⇱	(Note: Same as NLTK) They were obtained from postgres The English list has been augmented
MongoDB	174	⇱	Commit says 'Changed stop words files to the snowball stop lists'
Quanteda	174	⇱	Has SMART and Snowball Default Lists. Source
Ranks NL (Default)	174	⇱	(Note: Same as Default Snowball Stoplist, but RanksNL frequently cited as source) “This list is used in [Ranks NL] Page Analyzer and Article Analyzer for English text, when you let it use the default stopwords list.”
Snowball (Original)	174	⇱	Default Snowball Stoplist.
Xapian	174	⇱	(Note: uses Snowball Stopwords) “It has been traditional in setting up IR systems to discard the very commonest words of a language - the stopwords - during indexing.”
R `tm`	174	⇱	R `tm` package uses snowball list and also has SMART.
99webTools	183	⇱	“Stop Words are words which do not contain important significance to be used in Search Queries. Most search engine filters these words from search query before performing search, this improves performance.”
Deeplearning4J	194	⇱	DL4J Stopwords are in 2 places - stopwords and stopwords.txt. Probably derived from snowball. Some unusual entires eg: `----s`.
Reuters Web of Science™	211	⇱	“Stopwords are common, frequently used words such as articles (a, an, the), prepositions (of, in, for, through), and pronouns (it, their, his) that cannot be searched as individual words in the Topic and Title fields. If you include a stopword in a phrase, the stopword is interpreted as a word placeholder.”
Function Words (Cook 1988)	221	⇱	“This list of 225 items was compiled for practical purposes some time ago as data for a computer parser for student English. Paper
Okapi (gsl.sample)	222	⇱	This Okapi is the BM25 Okapi. (Note: Included stopword text file is from all “F” “H” terms, as defined by defs.h) The GSL file contains terms that are to be dealt with in a special way by the indexing process. Each type is defined by a class code.
Snowball (Expanded)	227	⇱	NOTE: This Includes the extra words mentioned in comments “An English stop word list. Many of the forms below are quite rare (e.g. 'yourselves') but included for completeness.”
DataScienceDojo	250	⇱	Used in a real-time sentiment AzureML demo for a meetup
CoreNLP (stopwords.txt)	257	⇱	Note: "a", "an", "the", "and", "or", "but", "nor" hardcoded in StopList.java also includes punctuation (!!, -lrb- …)
OkapiFramework	262	⇱	THIS IS NOT Okapi of BM25! (At least I don't think so) This list used in Okapi FRAMEWORK this Okapi is the Localization and Translation Okapi.
Azure Gallery	310	⇱	Slightly modified glasgow list.
ATIRE (NCBI Medline)	313	⇱	NCBI wrd_stop stop word list of 313 terms extracted from Medline. Its use is unrestricted. The list can be downloaded from here
Go	317	⇱	Go stopwords library. This is the glasgow list without 'computer' 'i' 'thick' - has 'thickv'
scikit-learn	318	⇱	Uses Glasgow list, but without the word “computer”
Glasgow IR	319	⇱	Linguistic resources from Glasgow Information Retrieval group. Lots of copies and edits of this one. Eg: xpo6 has mistakes – has quote instead of 'lf' eg: herse" instead of herself - comes up as one of the top results in google search.
xpo6	319	⇱	Used in Humboldt Diglital Library and Network and documented in blogpost. Likely derived from Glasgow list.
spaCy	326	⇱	Improved list from Stone, Denis, Kwantes (2010) Paper
Gensim	337	⇱	Same as spaCy (Improved list from Stone, Denis, Kwantes (2010))
Okapi (Expanded gsl.cacm)	339	⇱	Expanded cacm list from Okapi
C99 and TextTiling	371	⇱	UIMA wrapper for the java implementations of the segmentation algorithms C99 and TextTiling, written by Freddy Choi
Galago (inquery)	418	⇱	The core/src/main/resources/stopwords/inquery list is same as Indri default.
Indri	418	⇱	Part of Lemur Project
Onix & Lextek	429	⇱	This stopword list is probably the most widely used stopword list. It covers a wide number of stopwords without getting too aggressive and including too many words which a user might search upon. This wordlist contains 429 words.
GATE (Keyphrase Extraction)	452	⇱	Stopwords used in GATE Keyphrase Extraction Algorithm
Zettair	469	⇱	Zettair is a compact and fast text search engine designed and written by the Search Engine Group at RMIT University. It was once known as Lucy.
Okapi (Expanded gsl.sample)	474	⇱	Same as okapi_sample.txt but with “I” terms (not default Okapi behaviour! but may be useful)
Taporware	485	⇱	TAPoRware Project, McMaster University - modified Glasgow list – includes numbers 0 to 100, and 1990 to 2020 (for dates presumably) also punctuation
Voyant (Taporware)	488	⇱	Voyant uses taporware list by default, includes extra thou, thee, thy – presumably for Shakespeare corpus. Trombone repo also has Glasgow and SMART in resources.
MALLET	524	⇱	Default MALLET stopword list. (Based on SMART I think) See Docs
Weka	526	⇱	Like Bow (Rainbow, which is SMART) but with extra ll ve added to avoid words like you'll,I've etc. Almost exactly the same as mallet.txt
MySQL (MyISAM)	543	⇱	MyISAM and InnoDB use different stoplists. Taken from SMART but modified
Galago (rmstop)	565	⇱	Includes some punctuation, utf8 characters, www, http, org, net, youtube, wikipedia
Kevin Bougé	571	⇱	Multilang lists compiled by Kevin Bougé. English is SMART.
SMART	571	⇱	SMART (System for the Mechanical Analysis and Retrieval of Text) Information Retrieval System is an information retrieval system developed at Cornell University in the 1960s.
ROUGE	598	⇱	Extended SMART list used in ROUGE 1.5.5 Summary Evaluation Toolkit – includes extra words: reuters, ap, news, tech, index, 3 letter days of the week and months.
tonybsk_1.txt	635	⇱	Unknown origin - I lost the reference.
Sphinx Search Ultimate	665	⇱	An extension for Sphinx has this list.
Ranks NL (Large)	667	⇱	A very long list from ranks.nl
tonybsk_6.txt	671	⇱	Unknown origin - I lost the reference.
Terrier	733	⇱	Terrier Retrieval Engine “Stopword list to load can be loaded from the stopwords.filename property.”
ATIRE (Puurula)	988	⇱	Included in ATIRE See Paper
Alir3z4	1298	⇱	List of common stop words in various languages. The English list looks like merged from several sources.

TF-IDF

Compute TF-IDF following the notebook tf_idf.ipynb. The reason is not a straighforward python file to execute is to give enough configurability in this part. For example IDF is only computed on the topics you are interested about, but a command to calculate for all pages in wikipedia is commented out and available, but it might take several weeks to run on a laptop.

Remember in this notebook the stored procedure compute_idf_seeds_nodes will take a long time to execute. On a M1 Pro Macbook Pro it finished in 7 hours.

Alternatively a computed TF-IDF is stored in wikidump_tf_idf.sql and you can import with:

psql -U postgres -d wikidump -f ~/path/to/wikidump_tf_idf.sql

Performances (optional)

Postgres

Increase shared_buffers to 1/4 of your RAM

sudo nvim /usr/local/var/postgres@14/postgresql.conf

Edit shared_buffers, for example for 16GB of ram calculated with PGTune.leopard.in.ua

max_connections = 40
shared_buffers = 4GB
effective_cache_size = 12GB
maintenance_work_mem = 2GB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
default_statistics_target = 500
random_page_cost = 1.1
work_mem = 26214kB
min_wal_size = 4GB
max_wal_size = 16GB

Restart postgres

brew services restart postgresql@14

Test with wikilite (optional)

In order to run faster simulation in development phase is possible to run in a smaller copy of the wikipedia pages network containing only nodes that are labeled, its neighgboors, and a random sample of the unlabeled ones.

In order to do this, first import wikipedia dump as well in another database.

create database wikilite;

Import dumps like the above steps done for wikidump.

Then everytime you run a simulation have wikilite as database instead of another database. This database name is reserved for this kind of execution where tables like nodes is copied in a smaller one to improve simulation performances.

Important Folders and files

docs folder contains some more details about the simulator with the description of strategies, how costs were modeled and various documents explaing the decision making behind.
database folder contains SQL files which the code uses. For example, functions/searchrank.sql contains the algorithm for the search rank which performs pagerank and text search rank, which are respectively in functions/pagerank.sql and functions/tsrank.sql
data folder contains various data needed for the simulator to work. Most importantly data/datavoids_per_topic_filtered.json contains the topics used for the paper, while data/datavoids_per_topic.json contains more topics that were considered and found to be not compelling.
hcp folder contains files useful to run the simulator in the HPC
results folder is where results are saved when simulations are running
tests is where tests are. Important tests are:
- strategy-evaluation-all-topics.ipynb which takes several days to finish, and runs all simulations for all topics. This produces files in the results folder. For example "pro_dec-eval-all-rnd-greedy.csv" is saved for the simuation of the topic Procedural Languages vs. Declarative Languages, running random vs greedy respectively. ('eval-all' indicates just a label in the simulation about the evaluation of all topics)
- To plots the results of those run strategy-evaluation-all-topics-results-single-plots.ipynb and this will crate files in results folder, for example in results/images_against_disinformer images such as opt-pes-eval-all-base-rnd-rnd-eval-all-greedy-rnd.pdf are created.
PubVisualizations also contains scripts and data to produce some paper visualizations.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
analysis		analysis
data		data
database		database
datavoids		datavoids
labeler		labeler
loaders		loaders
results		results
strategies		strategies
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
Pipfile		Pipfile
README.md		README.md
agent.py		agent.py
commons.py		commons.py
config.template.json		config.template.json
graph.py		graph.py
load_wiki_dump.sh		load_wiki_dump.sh
rank.py		rank.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Datavoids Web Simulator

Running the simulator

Python environment

Project configuration

Create the database

Load stopwords

TF-IDF

Performances (optional)

Postgres

Test with wikilite (optional)

Important Folders and files

About

Releases

Packages

Languages

huda-lab/datavoids-simulator

Folders and files

Latest commit

History

Repository files navigation

Datavoids Web Simulator

Running the simulator

Python environment

Project configuration

Create the database

Load stopwords

TF-IDF

Performances (optional)

Postgres

Test with wikilite (optional)

Important Folders and files

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages