Merge pull request #2 from BAMeScience/dev

Dev
BAMeScience · Feb 28, 2023 · 40449f4 · 40449f4
2 parents 79be708 + 1256dd9
commit 40449f4
Show file tree

Hide file tree

Showing 16 changed files with 349 additions and 34 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,4 +1,5 @@
 **/.idea
 **/cmake-*
 **/build
+**/pBuild
 **/.vscode
diff --git a/README.md b/README.md
@@ -18,7 +18,7 @@ For building the project, please create (mkdir) a separate build directory. Chan
 
 In order to make use of SIMD instruction AVX2 or AVX512 build with -DAVX_2=ON or -DAVX_512=ON compiler flag. Check if your CPU supports these. If necessary adjust CMakeList.txt according to the preferences of your CPU.
 
-Optionally, export the directory where *mistle* was built as an executable PATH in the *~/.bashrc* file. Add line
+Optionally, export the directory where *mistle* was built as an executable PATH in the *~/.bashrc* file. Add the following line:
 
     export PATH="/home/$USER/path/to/mistle/build:$PATH"
 
@@ -29,16 +29,18 @@ Optionally, export the directory where *mistle* was built as an executable PATH
 
 Build Mistle's fragment ion index from spectral library.
 
-    mistle-build [OPTION...] [optional args]
+    mistle-build -i /path/to/library/ -o /path/to/index/ [optional args]
+
+Required arguments are the input directory, which must contain spectral library files (.msp or .mgf format), and the output directory for the fragment index. 
 
 ### Mistle search
 
 Search experimental mass spectra in Mistle's fragment ion index.
 
 
-    mistle-search [OPTION...] [optional args]
+    mistle-search -s /path/to/search_file.mgf -i /path/to/index/ [optional args]
 
-Use *-h* flag to print the help message. Refer to the [EXAMPLE README](example/README.md) and the example directory to test the program.
+Required arguments are the search file (.mgf or .msp format) and the path to the fragment index. Additionally, output directory and formats can be specified as well as various search parameters. Use *-h* flag to print the help message for more information. Also, refer to the [EXAMPLE README](example/README.md) and the example directory to test the program.
 
 ## Output format
 
@@ -49,10 +51,17 @@ The next line is the header listing all tracked attributes (tab separated).
 
     id	spectrum	charge	hit_rank	match	peptide	isomers	similarity	bias    [...]
 
+A large number of scores and statistics are appended as additional columns (marked [...]). A detailed explanation of the scores can be found in the next section.
+
+Below the header, all matched experimental spetra are listed and indexed by their scan name and the rank of the matched library spectrum. (Rank R is appended with /R to the scan name). See example [output](example/example_results_control.csv).
+
+Alternatively, a pin-tab format that is readable by Percolator (Käll *et al.*, 2007) can be produced, listing the same scores as features. To obtain this output format, the user needs to specify the output path (*-o*) during mistle-search with the file extension *.pin*. Note that the library label needs to be set correctly at index construction (1: target, -1: decoy libary) and the *results.pin* files of target and decoy search need to be concatenated or merged before using Percolator. It's recommended to use the this python [script](scripts/merge_pin_output.py) to merge the query results and correctly update delta scores.
 
-Below, all matched experimental spetra are listed and indexed by their scan name and the rank of the matched library spectrum. (Rank R is appended with /R to the scan name). See example [output](example/index/example_results_control.csv).
+## Scores 
 
+*Similarity* is the preferred baseline score, which is a refined version of the normalized dot product based on square root transformed peak intensities. A *bias* measurement highlights how biased the *similarity* is on a few matching peaks, and a *delta_similarity* score describes the *similarity* difference between the top hit and second-best hit. Additionally, an *annotation_similarity* version of these scores exists, which accounts only for peak intensities matching reference peaks. This is useful when the library consists of fewer annotated or predicted peaks and is less noisy than the query spectra. 
 
+As a high-quality discriminant scoring function we suggest the *avg_bias_adjusted_similarity*, which is composed equally of the *similarity* and *annotation similarity* metrics. Specifically, a *bias-adjusted similarity* (*sim2*) is calculated by the product of *similarity* and *(1-bias)* and is averaged between standard and annotation version. This scoring function provides excellent discrimination between target and decoy matches.
 
 
 
@@ -65,7 +74,3 @@ Input files coming from Windows distributions may have a line ending with \r\n (
 Remove \r character (char 13) using the following commad line
 * *tr -d '\r' < FILE.mgf > FILE_FIXED.mgf*
 
-### Scores 
-
-Similarity is the preferred baseline score. For the dot product certain properties (e.g. bias, lgamma score) are not defined or defined based on the similarity.
-
diff --git a/example/README.md b/example/README.md
@@ -14,7 +14,7 @@ Next, perform example searches
 
     mistle-search -s yeast_exp.mgf -i index/ -o example_results.csv -p 10 -b 0.2 -t 1 --hits_per_spectrum 1
 
-This should produce an *example_results.csv* file in the index/ directory. Compare the output to the *example_results_control.csv*, which is already present in the index/ directory. If they align, *mistle* is configured correctly and is read to use. (Note that floating-point inaccuracy may occur when using different hardware or advanced vector extensions. Results and scores may deviate from the control in the last digit(s)).
+This should produce an *example_results.csv* file in the current directory. Compare the output to the *example_results_control.csv*, which already resides in this directory. If they align, *mistle* is configured correctly and is read to use. (Note that floating-point inaccuracy may occur when using different hardware or advanced vector extensions. Results and scores may deviate from the control in the last digit(s)).
 
 Below, the first match of the list is displayed by a mirror plot (experimental spectrum top; matched simulated spectrum bottom) generated using the python *spectrum_utils* package.
 

diff --git a/example/index/example_results_control.csv → example/example_results_control.csv b/example/index/example_results_control.csv → example/example_results_control.csv
diff --git a/scripts/merge_pin_output.py b/scripts/merge_pin_output.py
@@ -0,0 +1,139 @@
+
+import pandas as pd
+import argparse
+import random
+from matplotlib import pyplot as plt
+
+def parse_args():
+
+	parser = argparse.ArgumentParser()
+	parser.add_argument("-t", "--target",
+						help="target search results file (.pin) format",
+						type=str,required=True)
+	parser.add_argument("-d", "--decoy",
+						help="decoy search results file (.pin) format",
+						type=str,required=True)
+	parser.add_argument("-o", "--output",
+						help="output file (.pin) format",
+						type=str, required=True)
+	parser.add_argument("--score",
+						help="discriminant score for target decoy competition",
+						type=str, default="avg_bias_adjusted_similarity")
+	parser.add_argument("--update_delta_scores",
+						help="Update delta scores when target/decoy are 1st and 2nd ranked hit",
+						action='store_true')
+	parser.add_argument("--main_features",
+						help="Track main features only. Otherwise all features will be tracked, which can reduce separation performance.",
+						action='store_true')
+	parser.add_argument("--drop_redundant_features",
+						help="Track main features only. Otherwise all features will be tracked, which can reduce separation performance.",
+						action='store_true')
+
+
+
+	args = parser.parse_args()
+	return args
+
+
+def update_delta_scores(df1, idx1, df2, idx2):
+    if df1.at[idx1, "delta_avg"] > df1.at[idx1, "avg_bias_adjusted_similarity"] - df2.at[idx2, "avg_bias_adjusted_similarity"]:
+        df1.at[idx1, "delta_similarity"] = df1.at[idx1, "similarity"] - df2.at[idx2, "similarity"]
+        df1.at[idx1, "delta_dot"] = df1.at[idx1, "dot_product"] - df2.at[idx2, "dot_product"]
+        df1.at[idx1, "delta_annotation_similarity"] = df1.at[idx1, "annotation_similarity"] - df2.at[idx2, "annotation_similarity"]
+        df1.at[idx1, "delta_sim2"] = df1.at[idx1, "sim2"] - df2.at[idx2, "sim2"]
+        df1.at[idx1, "delta_avg"] = df1.at[idx1, "avg_bias_adjusted_similarity"] - df2.at[idx2, "avg_bias_adjusted_similarity"]
+    return
+
+def merge_files(args):
+
+    print("+++ Merging target and decoy results (.pin format) +++")
+    df = pd.read_csv(args.target, sep='\t', comment='#', low_memory=False)
+    df_decoy = pd.read_csv(args.decoy, sep='\t', comment='#', low_memory=False)
+
+    target_nans = df.isnull().any(axis=1).sum()
+    decoy_nans = df_decoy.isnull().any(axis=1).sum()
+    if target_nans > 0 or decoy_nans > 0:
+        print(f"Waring: NaN values detected. Dropping {target_nans} target and {decoy_nans} decoy matches.")
+        df.dropna(inplace=True)
+        df_decoy.dropna(inplace=True)
+
+    if not all(df["Label"].unique() == 1):
+        print("Warning: Not all target labels match expected value of 1.")
+
+    if not all(df_decoy["Label"].unique() == -1):
+        print("Warning: Not all decoy labels match expected value of -1.")
+
+    print(f"Detected {df.shape[0]} target and {df_decoy.shape[0]} decoy matches.")
+    scans = df["ScanNr"].unique()
+
+    for num in scans:
+        decoy_match = df_decoy[df_decoy["ScanNr"] == num]
+        if len(decoy_match) == 0:
+            continue
+        elif len(decoy_match) > 1:
+            print("Error: multiple occurance of a ScanNr")
+            exit(1)
+        else:
+            decoy_idx = decoy_match.index[0]
+            decoy_match = decoy_match.iloc[0]
+
+        target_match = df[df["ScanNr"] == num]
+        target_idx = target_match.index[0]
+        target_match = target_match.iloc[0]
+
+        # Equal peptide -> Drop decoy
+        if target_match["Peptide"].replace("L", "I") == decoy_match["Peptide"].replace("L", "I"):
+            df_decoy.drop(decoy_idx, inplace=True)
+            continue
+
+        #print(target_idx, decoy_idx)
+        # Compare score -> Keep higher scoring match
+        if target_match[args.score] > decoy_match[args.score]:
+            if args.update_delta_scores:
+                update_delta_scores(df, target_idx, df_decoy, decoy_idx)
+            df_decoy.drop(decoy_idx, inplace=True)
+        else:
+            if args.update_delta_scores:
+                update_delta_scores(df_decoy, decoy_idx, df, target_idx)
+            df.drop(target_idx, inplace=True)
+
+
+
+    df = pd.concat([df, df_decoy], ignore_index=True)
+    #df["sim2_half"] = df["sim2"] / 2.0
+    #df["sim2_double"] = 2.0 * df["sim2"]
+
+    #cols = ["PSMId", "Label", "ScanNr", "sim2", "sim2_half", "sim2_double", "Peptide", "Proteins"]
+    #df = df[cols]
+
+    if args.main_features:
+        features = ["charge", "similarity", "bias", "delta_similarity", "sim2", "delta_sim2", "annotation_similarity", "annotation_bias", "annotation_sim2", "delta_annotation_similarity", "peak_count_ref", "avg_bias_adjusted_similarity", "delta_avg", "abs_mass_difference", "ppm_difference", "peptide_length", "precursor_mz"]
+        col = ["PSMId", "Label", "ScanNr"] + features + ["Peptide", "Proteins"]
+        df = df[col]
+    if args.drop_redundant_features:
+        df.drop(columns=["x_score", "x_score_dot"], inplace=True)
+        #df.drop(columns=["fragment_standard_deviation", "fragment_weighted_standard_deviation"], inplace=True)
+    #if args.experimental:
+    #    df["exp1"] = 
+    #df.drop(columns=["x_score", "x_score_dot"], inplace=True)
+    #df.drop(columns=["fragment_standard_deviation", "fragment_weighted_standard_deviation"], inplace=True)
+    df.to_csv(args.output, sep="\t", index=False)
+
+    num_targets = sum(df["Label"] == 1)
+    num_decoys = sum(df["Label"] == -1)
+
+
+    print(f"Files merged successfully! {num_targets} targets and {num_decoys} decoys remaining after competition.")
+
+
+
+def main():
+	args = parse_args()
+	merge_files(args)
+
+if __name__ == "__main__":
+    main()
+
+
+# bug fix: Remove -inf values
+# sed -i 's/-inf/-9999/g' yeast_td.pin
diff --git a/scripts/test.csv b/scripts/test.csv
@@ -0,0 +1,8 @@
+
+PSMId	Label	ScanNr	charge	similarity	bias	annotation_similarity	annotation_bias	avg_bias_adjusted_similarity	dot_product	delta_dot	delta_similarity	delta_annotation_similarity	delta_sim2	delta_avg	dot_contrast_angle	similarity_contrast_angle	annotation_contrast_angle	mass_difference	abs_mass_difference	ppm_difference	peptide_length	precursor_mz	peak_count_query	peak_count_ref	fragment_standard_deviation	fragment_weighted_standard_deviation	sim2	annotation_sim2	x_score	x_score_dot	x_lgamma	x_lgamma_dot	st_score	st_score_dot	Peptide	Proteins
+64	1	189	2	1.00149	0.220921	0.823551	0.220921	0.710924	0.873902	0.444778	0.342503	0.202489	0.304451	0.248828	0.676836		0.61602	-0.000854492	0.000854492	1.30494	11	654.814	1000	142	0.408765	0.421254	0.780238	0.641611	0.0	0.0	560.171	560.034	0.73769	0.727923	X.YGRPPDSHHSR.X	Unknown
+
+
+charge	similarity	bias	annotation_similarity	annotation_bias	avg_bias_adjusted_similarity	dot_product	delta_dot	delta_similarity	delta_annotation_similarity	delta_sim2delta_avg	dot_contrast_angle	similarity_contrast_angle	annotation_contrast_angle	mass_difference	abs_mass_difference	ppm_difference	peptide_length	precursor_mz	peak_count_query	peak_count_ref	fragment_standard_deviation	fragment_weighted_standard_deviation	sim2	annotation_sim2	x_score	x_score_dot	x_lgamma	x_lgamma_dot	st_score	st_score_dot	m0
+-0.0833	-0.2382	-0.1786	2.4159	-0.1786	-0.0354	1.3627	0.3950	-0.4690	-0.1501	0.1838	0.9446	0.0305	-1.3467	-0.4678	-0.0915	-0.0932	-0.4207	0.0715	0.1952	0.0000	1.7078	0.0120	0.0322	-0.7453	0.3660	0.1577	0.1706	0.0408	0.1718	0.3820	0.4089	-0.9830
+
diff --git a/src/build_index.cpp b/src/build_index.cpp
@@ -23,6 +23,7 @@ cxxopts::ParseResult parseArgs(int argc, const char* argv[], std::vector<std::st
                 ("o,output", "output directory where indices will be generated", cxxopts::value<std::string>(), "PATH")
                 ("n,num_indices", "number of buckets the fragment ion index will be split in", cxxopts::value<unsigned int>()->default_value("64"), "NUM")
                 ("min_pep_length", "Minimum peptide length for the reference spectrum to be loaded into the index", cxxopts::value<unsigned int>()->default_value("7"), "NUM")
+                ("label", "Give the library a label (1: target; -1: decoy)", cxxopts::value<int>()->default_value("1"), "NUM")
                 ("t,threads", "number of threads (experimental)\n - 1 thread for reading, other threads for processing. Has increased RAM costs (try using more threads or GLIBC_TUNABLES=glibc.malloc.tcache_count=0 for compensation)", cxxopts::value<int>()->default_value("1"), "NUM");
 
         options.parse_positional({"input", "output"});
@@ -52,18 +53,25 @@ cxxopts::ParseResult parseArgs(int argc, const char* argv[], std::vector<std::st
             }
 
             input_directories.push_back(dir_list.substr(start_pos));
+        } else {
+            std::cout << "Argument Error: Missing input directory." << std::endl;
+            exit(1);
         }
         if (result.count("output")) {
             config->idx_path = result["output"].as<std::string>();
+        } else {
+            std::cout << "Argument Error: Missing output directory." << std::endl;
+            exit(1);
         }
         config->minimum_peptide_length = result["min_pep_length"].as<unsigned int>();
+        config->label = result["label"].as<int>();
 
 
         return result;
 
     }
     catch (const cxxopts::OptionException& e) {
-        std::cout << "error parsing options: " << e.what() << std::endl;
+        std::cout << "Error parsing options: " << e.what() << std::endl;
         exit(1);
     }
 }

diff --git a/src/configuration.cpp b/src/configuration.cpp
@@ -26,6 +26,7 @@ bool configuration::save_configuration_to_file(const std::string& config_file_pa
         f << lim << delimiter;
     }
     f << "\n";
+    f << "Label: " << label << "\n";
     f << "Min peptide length: " << minimum_peptide_length << "\n";
     f << "Build command: " << build_command << "\n";
     f.close();
@@ -65,6 +66,14 @@ bool configuration::load_configuration_from_file(const std::string& config_file_
         return false;
     }
 
+    //Third line (Label)
+    getline(f, line);
+    if(line.rfind("Label: ", 0) == 0) {
+        label = std::stoi(line.substr(7, std::string::npos));
+    } else {
+        label = 1;
+    }
+
     f.close();
 
     idx_path = config_file_path.substr(0,config_file_path.rfind('/') + 1);

diff --git a/src/configuration.h b/src/configuration.h
@@ -15,9 +15,11 @@
 class configuration {
 public:
 
-    std::string idx_path = "./test/";
+    std::string idx_path = "";
     std::string precursor_index_path;
     unsigned int num_indices = 24;
+    int label = 1;
+
 
     unsigned int sub_idx_range;
     std::vector<unsigned int> sub_idx_limits;

diff --git a/src/indexing_manager.cpp b/src/indexing_manager.cpp
@@ -69,6 +69,10 @@ indexing_manager::indexing_manager(std::vector<std::string> &input_paths, std::s
             for (const auto & entry : std::filesystem::directory_iterator(path)) {
                 if (entry.path().extension() == ".msp") {
                     lib_files.push_back(entry.path());
+                    file_format = MSP;
+                } else if (entry.path().extension() == ".mgf") {
+                    lib_files.push_back(entry.path());
+                    file_format = MGF;
                 }
             }
         } else {
@@ -87,6 +91,15 @@ indexing_manager::indexing_manager(std::vector<std::string> &input_paths, std::s
         }
 
     }
+    if (!std::filesystem::exists(config->idx_path) || !std::filesystem::is_directory(config->idx_path)) {
+        std:cerr << "Bad output directory" << std::endl;
+        std::cerr << config->idx_path << " is not a directory or does not exist!" << std::endl;
+        exit(1);
+    }
+    if (!config->idx_path.ends_with('/')) {
+        config->idx_path += "/";
+    }
+
 }
 
 
@@ -150,7 +163,7 @@ bool indexing_manager::build_indices() {
 }
 
 bool indexing_manager::set_up_output_streams() {
-
+    
     for (int i = 0; i < config->num_indices; ++i) {
         string file_name = config->idx_path + "frag_idx_" + to_string(i) + ".bin";
         //cout << file_name << endl;
@@ -170,7 +183,6 @@ bool indexing_manager::parse_file(unsigned int file_num) {
 
     string buffer;
 
-
     /*
      * Main loop reading library and creating preliminary indices on the fly
      */

diff --git a/src/match.h b/src/match.h
@@ -13,27 +13,40 @@ class match {
     unsigned int target_id;
 
     float mass_difference;
+    float abs_mass_difference;
+    float ppm_difference;
     unsigned int charge;
     std::vector<std::string> isomers; //tracking homologous peptides
 
     // Default scores
     float similarity;
     float bias;
     float dot_product;
+
 
     //Annotation scores
     float annotation_similarity;
     float annotation_bias;
+    float annotation_sim2;
+
+    //Contrast angles
+    float dot_contrast_angle;
+    float similarity_contrast_angle;
+    float annotation_contrast_angle;
 
     //Additional score
     float delta_dot;
     float delta_similarity;
+    float delta_annotation_sim;
     float delta_sim2;
     int peak_count_query;
     int peak_count_target;
-
+    float peak_mz_standard_deviation;
+    float peak_mz_weighted_standard_deviation;
 
     //Advanced scores
+    float avg_bias_adj_similarity;
+    float delta_avg;
     float sim2;
     float spectraST_score;
     float spectraST_score_dot;