Skip to content

Use for bayesian integration studies. To optimize integration with use of continuous data measurements of each dataset needs to be grouped in bins based on their value compared to the positive predictive geneset and/or the negative predictive geneset

Notifications You must be signed in to change notification settings

JordyCoolen/Binning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 

Repository files navigation

Binning

Use for bayesian integration studies. To optimize integration with use of continuous data measurements of each dataset needs to be grouped in bins based on their value compared to the positive predictive geneset and/or the negative predictive geneset.

To have a optimized bayesian integration. This script automates the binning by optimizing for a high likelihood score in a automated and systematic way.

Optimalization options

  1. It is possible to optimize on highest likelihood overall bins (--score 0)
  2. Optimize for a total likelihood centered around zero but by having a optimized highest likelihood score per bin (--score 1)

Currently, both methodes only apply a global optimalization, which is in most cases the preferred method of optimalization.

The bin borders can be globally optimized for 2,3,4 or 5 bins. For optimal results we prefer using (--bins 0) option to let the binning tool calculate the most optimal number of bins and borders (--bins 0).

The binning process uses a negative and postive set of which all identifiers should be present in the dataset. If there are missing values present or there is overlap in identifiers between the negative and positive set the program will give a error indicating what the problem is.

Algorithm

                  reads data file
                        |
                extract genes and values
                        |
                checks double genes
                        |
        get genes from positive and negative
                        |
         check if genes in positive and negative
                are present in input data file
                        |
        check if there is no overlap between
             positive and negative list
                        |
      sort values while maintaining gene order
                        |
            extract sorted genes to list
                        |
            extract sorted values to list
                        |
              get MIN and MAX values
                        |
             set MIN and MAX borders
                        |
  sweeps all combinations of each bin and remembers the score
                        |
  highest scoring option will be given and plotted

Install

download or pull this repository then:

conda create -n <environmentname> python=2.7
conda activate <environmentname>
conda install -c conda-forge matplotlib

Usage

python Binning.py -d <datafile> -n <negativelist> -p <positivelist> -o <outputfilename> [--bins 0/2/3/4] [--type local/global] [--score 0/1]

Parameters

--bins    0: calculates the optimum number of bins with a max of 5 bins, with a high likelihood score (preferred method)
--bins    2/3/4/5 force the code to optimize for this many bins
--score   0: |log2((P/totalP)/(N/totalN))| for optimal likelikhood
--score.  1: log2((P/totalP)/(N/totalN)) for closest to 0 optimalization

P = number of positives in bin
totalP = number of total postives in positivelist
N = number of negatives in bin
totalN = number of total negatives in negativelist

Example results

--score 0 (optimal likelihood method)

Number of Gene/identifiers for the positive and negative set

totalP 53

totalN 294

Result: overall score = 8.13212179365 (this is optimal likelihood method so high overall score)

0: (0, 11, 2.734786296106959, 1.0, 101.0)

1: (11, 18, 5.056714390994322, 101.0, 162.0)

2: (18, 347, 0.3406211065510634, 162.0, 5572.0)

Number of positive and negatives in optimized bins

0 pos 6 0 neg 5 1 pos 6 1 neg 1 2 pos 41 2 neg 288

plot with the results:

alt tag

--score 1 (closest to zero --> 0 <--)

Number of Gene/identifiers for the positive and negative set

totalP 53

totalN 294

Result: overall score = 3.44931450115 (this is result of closest to zero --> 0 <--)

0: (0, 104, 1.1009141949048564, 1.0, 1243.0)

1: (104, 173, 0.6237549837182151, 1243.0, 2441.0)

2: (173, 347, -1.7246453225303384, 2441.0, 5572.0)

Number of positive and negatives in optimized bins

0 pos 29 0 neg 75 1 pos 15 1 neg 54 2 pos 9 2 neg 165

plot with the results:

alt tag

Known problems

--type local

Is currently not working.

Version

27 October 2020

Authors

J.P.M. Coolen and D.R. Garza

License

MIT

About

Use for bayesian integration studies. To optimize integration with use of continuous data measurements of each dataset needs to be grouped in bins based on their value compared to the positive predictive geneset and/or the negative predictive geneset

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages