This project provides a web scanning tool for analysing interactions between websites and browser extensions. It allows automated large-scale web studies targeting the security and privacy problems caused by these interactions.
The goal of this project is to identify websites that abuse browser extensions for malicious purposes. In particular, the following two issues are targeted.
- Extension discovery using web accessible resources
- We detect if a website tries to discover which extensions are installed in a user’s browser using Web Accessible Resources (WARs).
- Paper: A. Sjösten, S. Van Acker, and A. Sabelfeld. "Discovering Browser Extensions via Web Accessible Resources" (Link)
- Message Passing between websites and extensions
- There are several risks originating from the message passing APIs available for extensions and websites. We analyse which websites make use of these capabilities with a focus on potentially malicious behaviour. For this we save all messages sent on a website using
window.postMessage
,runtime.sendMessage
,runtime.connect
andport.postMessage
. - Paper: D. F. Somé. "EmPoWeb: Empowering Web Applications with Browser Extensions" (Link)
- There are several risks originating from the message passing APIs available for extensions and websites. We analyse which websites make use of these capabilities with a focus on potentially malicious behaviour. For this we save all messages sent on a website using
This project consists of two parts.
- A scanning system to perform web studies based on the Chrome DevTools Protocol.
- An analysis tool for automated evaluations of the data collected by the scanners. Details regarding the specific evaluations can be found in
analysis/README.md
.
The scanner detects and saves requests to WARs. Additionally, all observed messages are collected. These can be benign or malicious, which is decided in a later analysis step. The scanning process consists of the following six steps which are repeated for each scanned website.
- Retrieve website URL from the list of domains to crawl.
- Open website in new browser tab.
- When the website has loaded, start recording potentially malicious behaviour.
- Wait 20 seconds (while simulating user interaction).
- Close the browser tab.
- Save recorded data as results to the database.
The detection logic is implemented as a Python program which uses the pychrome module for connecting to and controlling the Chrome DevTools Protocol which in turn controls the Chromium web browser. The three mentioned components all reside on the same server. In order to parallelise the scanning process it is possible to run multiple instances of the scanner (on separate servers) simultaneously.
The data collected by this tool is stored into a PostgreSQL database which can be installed on a separate database server. For being able to track the progress of a scan and potential errors or system crashes we employ the external monitoring system Healthchecks.io.
In case you only want to scan a few websites or want to test the scanning system, you can follow this quick start guide consisting of only five steps. For performing larger scans, have a look at the following section where the usage for larger production environments is explained.
- Clone this repository.
- Install the following packages.
sudo apt-get install python3 python3-pip python3-dev libpq-dev chromium-browser git
- Install all needed Python modules.
pip3 install -r scanner/requirements.txt
- Disable the database and the healthchecks in
scanner/config.cfg
by setting the corresponding values toFalse
. Moreover, create a new entry in the sectioninput
where the key is your computer's hostname (as given by thehostname
command) and the value is the path to your input file, e.g.myhost=../input_data/extscan01.csv
. - Start the scan.
cd scanner/
python3 main.py
In order to perform large-scale studies it is necessary to analyse a large amount of websites. This can be archived by parallelising the scanning system. For example, a scanner can be a virtual machine (tested using Ubuntu 18.04.4 LTS) equipped with additional software. These scan servers can be configured and deployed using Ansible playbooks.
- Create a fork of this repository. Depending on your use case make this a private repository.
- Clone your new repository to your local development machine.
As input data, a list of domains to scan is required. For example, the Tranco top-list can be used for this. The input file must be a comma-separated CSV file containing the rank and the domain itself.
1,a.com
2,b.com
3,c.com
First, split the original input file up into multiple files so that each scanner processes a different list. For this configure the parameters in the Python file scanner/inputgeneration/inputgenerator.py
. The following example creates 20 files containing URLs from the given input file. To distribute the URLs equally a chunking mechanism is used. The algorithm goes through the input file and always appends the next 25 URLs to an output file. Then it continues with the next file.
InputGenerator.split_input_data_into_files(
input_file="../input_data/top-1m.csv",
output_directory="../input_data/top-1m/",
n=20,
chunk_size=25
)
Run the generator.
cd scanner/
python3 inputgeneration/inputgenerator.py
The following listing gives an example output for the input chunking algorithm for n = 2 and a chunk size of 5 (with numbers instead of URLs).
# Input
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
# Output
0: [0, 1, 2, 3, 4, 10, 11, 12, 13]
1: [5, 6, 7, 8, 9]
Now add these files to your repository. This is important as the scan servers configured in the following steps will download their list of URLs directly from the repository.
git add input_data/top-1m/
git commit -m "Added input files"
git push origin master
- Create the input files with the domains to scan as described above.
- Configure the scanner according to your needs in the file
scanner/config.cfg
. At least define which input fields to use in[input]
and provide your database credentials in[database]
. If you want to monitor your scan using healthchecks.io also add yourapi_key
in[healthcheck]
. - Define your Ansible inventory, i.e. the addresses of the scan and DB servers, in
ansible/hosts
. - Set your operating system credentials (for the VMs) and database credentials in
ansible/variables.yml
. Also define your Ansible user and if a jump server shall be used in this file. - Create an SSH deploy key that will be used by the scan servers to fetch the code from your (private) repository. Place your secret and public key in
ansible/files/id_rsa_deploy
andansible/files/id_rsa_deploy.pub
respectively and add the key to your repo's deploy keys on GitHub. This is not necessary when using a public repository. - Depending on the local network your scanners and database are running in, change the database access settings in
ansible/playbook_deploy_dbserver.yml
.
During deployment the source code and the URL input files will be cloned from the defined repository directly to the servers. Sensitive data should not reside within the repo, therefore scanner/config.cfg
, ansible/files/id_rsa_deploy
and ansible/files/id_rsa_deploy.pub
will be uploaded from your local development machine to the servers.
Important: Please read through the playbooks carefully before running them for the first time. They are written targeting VMs that are solely used for the purpose of scanning and modify the operating system. For example, snapd will be removed from all scan servers (for storage capacity reasons).
To install all necessary servers perform the following steps in the given order.
First, reboot all servers.
ansible-playbook ansible/playbook_reboot.yml -K -i ansible/hosts
Deploy and configure all scan servers.
ansible-playbook ansible/playbook_deploy_scanserver.yml -K -i ansible/hosts
Deploy the database server and database.
ansible-playbook ansible/playbook_deploy_dbserver.yml -K -i ansible/hosts
Start the scan on all servers.
ansible-playbook ansible/playbook_run_scan.yml -K -i ansible/hosts
During/after a scan, the log files can be downloaded using the following command.
ansible-playbook ansible/playbook_collect_logs.yml -K -i ansible/hosts
The analysis tool is described in detail in analysis/README.md
.
This project is licensed under the terms of the MIT license.