Skip to content


Repository files navigation



When user visits websites, it must left a record. This project's final goal is to predict users who visits the domains.

For more details, we reference the paper from ACM SIGCOMM 2017, Workshop 3 Big-DAMA, Users' Fingerprinting Techniques from TCP Traffic.


The dataset is from the paper supports. link

And I transform to my format.


  • I7 6700
  • Ubuntu 16.04
  • Python 2.7
  • Sklearn
  • Pandas
  • Numpy


1. Cosine similarity

Domain type is optional, default is all domain.

$ python --time=[time series type] --domain=[domain type]
timeTypes = [1, 6, 24]
domainTypes = ['all', 'core', 'support']

2. Predicting Models

Domain type is optional, default is all domain.

$ python --model=[model type] --time=[time series type] --domain=[domain type]
modelTypes = ['knn', 'navieBayes', 'randomForest', 'neuralNetwork']
timeTypes = [1, 6, 24]
domainTypes = ['all', 'core', 'support']


I use some of the predicting models from sklearn and cosine similarity to do the predicting. Simply, we want to use domain to predict which user visited by ip address.

1. Cosine similarity

$ python

Step 1. First of all, read the training and testing data file.

Step 2. Set the data format to pandas like this:

    |	training data 	|
    |	testing  data 	|

Step 3. And transform it to training data and testing data like this.

	training / testing data:
    |      | visit domain 1 times | visit domain 2 times | ... | visit domain N times |
    | IP 1 |                      |                      | ... |                      |
    | IP 2 |                      |                      | ... |                      |
    |                                        .                                        |	
    |                                        .                                        |
    |                                        .                                        |
    | IP N |                      |                      | ... |                      |

Step 4. Get the domain's flag, which is core domain or support domain.

	domainFlag: (dtype: pandas dataFrame)
    |          |  is core domain  |
    | damain 1 |                  |
    | damain 2 |                  |
    |             .               |
    |             .               |
    |             .               |
    | damain N |                  |

Step 5. Transform it to numpy array type. Because numpy array doing mutiply is super quick.

	domainWeight: (dtype: numpy array)
    |                | domain 1 | domain 2 | ... | domain N |
    | is core domain |          |          | ... |          |

Step 6. Calculating accuracy by doing cosine similarity. Choose the biggest similarity value as the predicting answer, and then check it.

2. Predicting Models

2.1 K Nearest Neighbor
2.2 Navie Bayes
2.3 Random forest
2.4 MLPClassifier (Neural Network)
$ python --model=[model type] --time=[time series type] --domain=[domain type]
modelTypes = ['knn', 'navieBayes', 'randomForest', 'neuralNetwork']
timeTypes = [1, 6, 24]
domainTypes = ['all', 'core', 'support']

Step 1. read data from csv file. The data format is:

    | Index | IP address | Domain | flag | date | IP label |
    |       |            |        |      |      |          |

Step 2. Set training / testing data X and Y.

    |  Index  |  Domain  |
    |         |          |

    |  Index  | IP label |
    |         |          |

Step 3. Train training data to the models., trainingY)

Step 4. Predict testing data to the models.

testingPredicingY = models.predict(testingX)

Step 5. Calculating accuracy by module metrics.

accuracy = metrics.accuracy_score(testingY, testingPredicingY)