User-fingerprinting-and-profiling

Description

When user visits websites, it must left a record. This project's final goal is to predict users who visits the domains.

For more details, we reference the paper from ACM SIGCOMM 2017, Workshop 3 Big-DAMA, Users' Fingerprinting Techniques from TCP Traffic.

Dataset

The dataset is from the paper supports. link

And I transform to my format.

Environment

I7 6700
Ubuntu 16.04
Python 2.7
Sklearn
Pandas
Numpy

Usage

1. Cosine similarity

Domain type is optional, default is all domain.

$ python cosine-similarity.py --time=[time series type] --domain=[domain type]

timeTypes = [1, 6, 24]
domainTypes = ['all', 'core', 'support']

2. Predicting Models

Domain type is optional, default is all domain.

$ python modelPredict.py --model=[model type] --time=[time series type] --domain=[domain type]

modelTypes = ['knn', 'navieBayes', 'randomForest', 'neuralNetwork']
timeTypes = [1, 6, 24]
domainTypes = ['all', 'core', 'support']

Methods

I use some of the predicting models from sklearn and cosine similarity to do the predicting. Simply, we want to use domain to predict which user visited by ip address.

1. Cosine similarity

$ python cosine-similarity.py

Step 1. First of all, read the training and testing data file.

Step 2. Set the data format to pandas like this:

    allDataset:
    +-------------------+
    |	training data 	|
    +-------------------+
    |	testing  data 	|
    +-------------------+

Step 3. And transform it to training data and testing data like this.

	training / testing data:
    +---------------------------------------------------------------------------------+
    |      | visit domain 1 times | visit domain 2 times | ... | visit domain N times |
    |---------------------------------------------------------------------------------|
    | IP 1 |                      |                      | ... |                      |
    |---------------------------------------------------------------------------------|
    | IP 2 |                      |                      | ... |                      |
    |---------------------------------------------------------------------------------|
    |                                        .                                        |	
    |                                        .                                        |
    |                                        .                                        |
    |---------------------------------------------------------------------------------|
    | IP N |                      |                      | ... |                      |
    +---------------------------------------------------------------------------------+

Step 4. Get the domain's flag, which is core domain or support domain.

	domainFlag: (dtype: pandas dataFrame)
    +-----------------------------+
    |          |  is core domain  |
    |-----------------------------|
    | damain 1 |                  |
    |-----------------------------|
    | damain 2 |                  |
    |-----------------------------|
    |             .               |
    |             .               |
    |             .               |
    |-----------------------------|
    | damain N |                  |
    +-----------------------------+

Step 5. Transform it to numpy array type. Because numpy array doing mutiply is super quick.

	domainWeight: (dtype: numpy array)
    +-------------------------------------------------------+
    |                | domain 1 | domain 2 | ... | domain N |
    |-------------------------------------------------------|
    | is core domain |          |          | ... |          |
    +-------------------------------------------------------+

Step 6. Calculating accuracy by doing cosine similarity. Choose the biggest similarity value as the predicting answer, and then check it.

2. Predicting Models

2.1 K Nearest Neighbor

2.2 Navie Bayes

2.3 Random forest

2.4 MLPClassifier (Neural Network)

$ python modelPredict.py --model=[model type] --time=[time series type] --domain=[domain type]

modelTypes = ['knn', 'navieBayes', 'randomForest', 'neuralNetwork']
timeTypes = [1, 6, 24]
domainTypes = ['all', 'core', 'support']

Step 1. read data from csv file. The data format is:

	dataset:
    +------------------------------------------------------+
    | Index | IP address | Domain | flag | date | IP label |
    |------------------------------------------------------|
    |       |            |        |      |      |          |
    +------------------------------------------------------+

Step 2. Set training / testing data X and Y.

    X:
    +--------------------+
    |  Index  |  Domain  |
    |--------------------|
    |         |          |
    +--------------------+

    Y:
    +--------------------+
    |  Index  | IP label |
    |--------------------|
    |         |          |
    +--------------------+

Step 3. Train training data to the models.

models.fit(trainingX, trainingY)

Step 4. Predict testing data to the models.

testingPredicingY = models.predict(testingX)

Step 5. Calculating accuracy by module metrics.

accuracy = metrics.accuracy_score(testingY, testingPredicingY)

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
reference		reference
.gitignore		.gitignore
README.md		README.md
cosine-similarity.py		cosine-similarity.py
model.py		model.py
model.pyc		model.pyc
modelPredicting.py		modelPredicting.py
testing_data_Label.py		testing_data_Label.py
testing_data_Label_with_thread.py		testing_data_Label_with_thread.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

User-fingerprinting-and-profiling

Description

Dataset

Environment

Usage

1. Cosine similarity

2. Predicting Models

Methods

1. Cosine similarity

2. Predicting Models

2.1 K Nearest Neighbor

2.2 Navie Bayes

2.3 Random forest

2.4 MLPClassifier (Neural Network)

About

Releases

Packages

Languages

thumbe12856/User-fingerprinting-and-profiling

Folders and files

Latest commit

History

Repository files navigation

User-fingerprinting-and-profiling

Description

Dataset

Environment

Usage

1. Cosine similarity

2. Predicting Models

Methods

1. Cosine similarity

2. Predicting Models

2.1 K Nearest Neighbor

2.2 Navie Bayes

2.3 Random forest

2.4 MLPClassifier (Neural Network)

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages