GPS is a scanning platform that learns and predicts the location of IPv4 services across all 65K ports. GPS uses application, transport, and network layer features to probabilistically model and predict service presence. GPS computes service predictions in 13 minutes. GPS can find 92.5% of all services across all ports with 131x less bandwidth, and 204x more precision, compared to exhaustive scanning.
To learn more about GPS' system and performance, check out the original paper appearing at Sigcomm '22.
To run GPS, you need the following capabilities:
- Python v3
- Access to Google BigQuery and the google cloud command line. Users are responsible for their own billing. As long as intermediate tables are not stored in Google BigQuery for longer than GPS' execution, the total cost of BigQuery should be less than $1.
- Access to an Internet scanner (e.g., LZR) and Internet scanning infrastructure. Please make sure to adhere to these scanning best practices.
- Access to a large disk (e.g., 1TB). The final list of service predictions generates a file that is larger than half a terabyte in size.
GPS uses a config.ini
configuration file which expects users to specify:
- a Big Query account
- an existing BQ dataset that GPS can store tables to
- the table name to which the seed scan was uploaded to (see below)
- a local directory of where GPS can store predictions
- other GPS parameters (e.g., minimum hitrate)
GPS relies on an initial seed scan---a sub-sampled IPv4 scan across all 65K ports---to learn patterns from. A sample seed scan (1% IPv4 LZR scan across all 65K ports collected in April 2021) can be found here. The seed scan has been filtered for real services (i.e., services that send back real data) and hosts that respond on 10 or less ports (i.e., removing pseudo services). Please see the LZR paper and the GPS paper for more details behind this methodology.
The sample seed scan should just be used for testing purposes. Using this data means that GPS will predict services given the state of the Internet from April 2021. To make up-to-date predictions, please use an up-to-date seed scan.
To use the sample seed scan, upload it to BigQuery and update the seed BigQuery table name in config.ini
(i.e., Seed_Table = lzr_seed_april2021_filt
).
The following command-line big query command uploads the seed scan, lzr_seed_april2021_filt.json
to BigQuery:
bq load --source_format NEWLINE_DELIMITED_JSON --autodetect \
BQ_RESOURCE_PROJECT.BQ_DATASET.SEED_TABLE lzr_seed_april2021_filt.json
The GPS code base currently supports two formats of Interent scans to be used as the seed:
- The raw output of a LZR scan.
GPS then re-formats it when
Pre_Filt_Seed=False
is set in the config.ini. - A scan with the following schema:
ip (string), p (port number- integer), asn (integer), data (string),\
fingerprint (protocol - string), w (tcp window size-integer).
Fields can be added or removed, as long as src/data_features.py
is appropriately updated.
At minimum, to compute predictions, the gps algorithm expects an ip address, a port number, and some form of layer 7 data for each service.
Once the config.ini
is properly initialized, and a valid seed scan has been uploaded to BigQuery, GPS is ready to predict services.
GPS prediction works in two phases:
-
GPS predicts at least one service across all IPv4 hosts. To run GPS' first phase, simply run the following:
python gps.py first
GPS outputs and downloads locally a short list of sub-networks and ports for the user to scan. -
GPS predicts any remaining services on every host it has discovered in the first phase. To run GPS' second phase, simply run the following:
python gps.py remaining
GPS saves a large list of individual services for the user to scan. During runtime, GPS provides user instructions for how to best download that large list.
Once GPS is done running, remember to delete any remaining BigQuery tables that are not desired to have around. GPS does not automatically clean up BigQuery tables, in case the user wants to use/explore the intermediate tables.
When adding functionality to GPS, the user may run into the following Big Query errors:
400 Resources exceeded during query execution: Not enough resources for query planning - too many subqueries or query is too complex.
Why it happened: This message means that the query has become too long/nested for BigQuery to process. This can happen if you have added more features, or added more code that calls the defined sub-tables.
Solution: Reduce the amount of queries that are defined as sub-tables or split the query in two and run them seperately (saving to a destination table in the process). This will require some hacking on the GPS source code.
Copyright 2022 The Board of Trustees of The Leland Stanford Junior University
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.