When user visits websites, it must left a record. This project's final goal is to predict users who visits the domains.
For more details, we reference the paper from ACM SIGCOMM 2017, Workshop 3 Big-DAMA, Users' Fingerprinting Techniques from TCP Traffic.
The dataset is from the paper supports. link
And I transform to my format.
- I7 6700
- Ubuntu 16.04
- Python 2.7
- Sklearn
- Pandas
- Numpy
Domain type is optional, default is all domain.
$ python cosine-similarity.py --time=[time series type] --domain=[domain type]
timeTypes = [1, 6, 24]
domainTypes = ['all', 'core', 'support']
Domain type is optional, default is all domain.
$ python modelPredict.py --model=[model type] --time=[time series type] --domain=[domain type]
modelTypes = ['knn', 'navieBayes', 'randomForest', 'neuralNetwork']
timeTypes = [1, 6, 24]
domainTypes = ['all', 'core', 'support']
I use some of the predicting models from sklearn and cosine similarity to do the predicting. Simply, we want to use domain to predict which user visited by ip address.
$ python cosine-similarity.py
Step 1. First of all, read the training and testing data file.
Step 2. Set the data format to pandas like this:
allDataset:
+-------------------+
| training data |
+-------------------+
| testing data |
+-------------------+
Step 3. And transform it to training data and testing data like this.
training / testing data:
+---------------------------------------------------------------------------------+
| | visit domain 1 times | visit domain 2 times | ... | visit domain N times |
|---------------------------------------------------------------------------------|
| IP 1 | | | ... | |
|---------------------------------------------------------------------------------|
| IP 2 | | | ... | |
|---------------------------------------------------------------------------------|
| . |
| . |
| . |
|---------------------------------------------------------------------------------|
| IP N | | | ... | |
+---------------------------------------------------------------------------------+
Step 4. Get the domain's flag, which is core domain or support domain.
domainFlag: (dtype: pandas dataFrame)
+-----------------------------+
| | is core domain |
|-----------------------------|
| damain 1 | |
|-----------------------------|
| damain 2 | |
|-----------------------------|
| . |
| . |
| . |
|-----------------------------|
| damain N | |
+-----------------------------+
Step 5. Transform it to numpy array type. Because numpy array doing mutiply is super quick.
domainWeight: (dtype: numpy array)
+-------------------------------------------------------+
| | domain 1 | domain 2 | ... | domain N |
|-------------------------------------------------------|
| is core domain | | | ... | |
+-------------------------------------------------------+
Step 6. Calculating accuracy by doing cosine similarity. Choose the biggest similarity value as the predicting answer, and then check it.
$ python modelPredict.py --model=[model type] --time=[time series type] --domain=[domain type]
modelTypes = ['knn', 'navieBayes', 'randomForest', 'neuralNetwork']
timeTypes = [1, 6, 24]
domainTypes = ['all', 'core', 'support']
Step 1. read data from csv file. The data format is:
dataset:
+------------------------------------------------------+
| Index | IP address | Domain | flag | date | IP label |
|------------------------------------------------------|
| | | | | | |
+------------------------------------------------------+
Step 2. Set training / testing data X and Y.
X:
+--------------------+
| Index | Domain |
|--------------------|
| | |
+--------------------+
Y:
+--------------------+
| Index | IP label |
|--------------------|
| | |
+--------------------+
Step 3. Train training data to the models.
models.fit(trainingX, trainingY)
Step 4. Predict testing data to the models.
testingPredicingY = models.predict(testingX)
Step 5. Calculating accuracy by module metrics
.
accuracy = metrics.accuracy_score(testingY, testingPredicingY)