Skip to content

Commit

Permalink
Merge pull request #2 from BAMeScience/dev
Browse files Browse the repository at this point in the history
Dev
  • Loading branch information
ch4perone authored Feb 28, 2023
2 parents 79be708 + 1256dd9 commit 40449f4
Show file tree
Hide file tree
Showing 16 changed files with 349 additions and 34 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
**/.idea
**/cmake-*
**/build
**/pBuild
**/.vscode
23 changes: 14 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ For building the project, please create (mkdir) a separate build directory. Chan

In order to make use of SIMD instruction AVX2 or AVX512 build with -DAVX_2=ON or -DAVX_512=ON compiler flag. Check if your CPU supports these. If necessary adjust CMakeList.txt according to the preferences of your CPU.

Optionally, export the directory where *mistle* was built as an executable PATH in the *~/.bashrc* file. Add line
Optionally, export the directory where *mistle* was built as an executable PATH in the *~/.bashrc* file. Add the following line:

export PATH="/home/$USER/path/to/mistle/build:$PATH"

Expand All @@ -29,16 +29,18 @@ Optionally, export the directory where *mistle* was built as an executable PATH

Build Mistle's fragment ion index from spectral library.

mistle-build [OPTION...] [optional args]
mistle-build -i /path/to/library/ -o /path/to/index/ [optional args]

Required arguments are the input directory, which must contain spectral library files (.msp or .mgf format), and the output directory for the fragment index.

### Mistle search

Search experimental mass spectra in Mistle's fragment ion index.


mistle-search [OPTION...] [optional args]
mistle-search -s /path/to/search_file.mgf -i /path/to/index/ [optional args]

Use *-h* flag to print the help message. Refer to the [EXAMPLE README](example/README.md) and the example directory to test the program.
Required arguments are the search file (.mgf or .msp format) and the path to the fragment index. Additionally, output directory and formats can be specified as well as various search parameters. Use *-h* flag to print the help message for more information. Also, refer to the [EXAMPLE README](example/README.md) and the example directory to test the program.

## Output format

Expand All @@ -49,10 +51,17 @@ The next line is the header listing all tracked attributes (tab separated).

id spectrum charge hit_rank match peptide isomers similarity bias [...]

A large number of scores and statistics are appended as additional columns (marked [...]). A detailed explanation of the scores can be found in the next section.

Below the header, all matched experimental spetra are listed and indexed by their scan name and the rank of the matched library spectrum. (Rank R is appended with /R to the scan name). See example [output](example/example_results_control.csv).

Alternatively, a pin-tab format that is readable by Percolator (Käll *et al.*, 2007) can be produced, listing the same scores as features. To obtain this output format, the user needs to specify the output path (*-o*) during mistle-search with the file extension *.pin*. Note that the library label needs to be set correctly at index construction (1: target, -1: decoy libary) and the *results.pin* files of target and decoy search need to be concatenated or merged before using Percolator. It's recommended to use the this python [script](scripts/merge_pin_output.py) to merge the query results and correctly update delta scores.

Below, all matched experimental spetra are listed and indexed by their scan name and the rank of the matched library spectrum. (Rank R is appended with /R to the scan name). See example [output](example/index/example_results_control.csv).
## Scores

*Similarity* is the preferred baseline score, which is a refined version of the normalized dot product based on square root transformed peak intensities. A *bias* measurement highlights how biased the *similarity* is on a few matching peaks, and a *delta_similarity* score describes the *similarity* difference between the top hit and second-best hit. Additionally, an *annotation_similarity* version of these scores exists, which accounts only for peak intensities matching reference peaks. This is useful when the library consists of fewer annotated or predicted peaks and is less noisy than the query spectra.

As a high-quality discriminant scoring function we suggest the *avg_bias_adjusted_similarity*, which is composed equally of the *similarity* and *annotation similarity* metrics. Specifically, a *bias-adjusted similarity* (*sim2*) is calculated by the product of *similarity* and *(1-bias)* and is averaged between standard and annotation version. This scoring function provides excellent discrimination between target and decoy matches.



Expand All @@ -65,7 +74,3 @@ Input files coming from Windows distributions may have a line ending with \r\n (
Remove \r character (char 13) using the following commad line
* *tr -d '\r' < FILE.mgf > FILE_FIXED.mgf*

### Scores

Similarity is the preferred baseline score. For the dot product certain properties (e.g. bias, lgamma score) are not defined or defined based on the similarity.

2 changes: 1 addition & 1 deletion example/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ Next, perform example searches

mistle-search -s yeast_exp.mgf -i index/ -o example_results.csv -p 10 -b 0.2 -t 1 --hits_per_spectrum 1

This should produce an *example_results.csv* file in the index/ directory. Compare the output to the *example_results_control.csv*, which is already present in the index/ directory. If they align, *mistle* is configured correctly and is read to use. (Note that floating-point inaccuracy may occur when using different hardware or advanced vector extensions. Results and scores may deviate from the control in the last digit(s)).
This should produce an *example_results.csv* file in the current directory. Compare the output to the *example_results_control.csv*, which already resides in this directory. If they align, *mistle* is configured correctly and is read to use. (Note that floating-point inaccuracy may occur when using different hardware or advanced vector extensions. Results and scores may deviate from the control in the last digit(s)).

Below, the first match of the list is displayed by a mirror plot (experimental spectrum top; matched simulated spectrum bottom) generated using the python *spectrum_utils* package.

Expand Down
File renamed without changes.
139 changes: 139 additions & 0 deletions scripts/merge_pin_output.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@

import pandas as pd
import argparse
import random
from matplotlib import pyplot as plt

def parse_args():

parser = argparse.ArgumentParser()
parser.add_argument("-t", "--target",
help="target search results file (.pin) format",
type=str,required=True)
parser.add_argument("-d", "--decoy",
help="decoy search results file (.pin) format",
type=str,required=True)
parser.add_argument("-o", "--output",
help="output file (.pin) format",
type=str, required=True)
parser.add_argument("--score",
help="discriminant score for target decoy competition",
type=str, default="avg_bias_adjusted_similarity")
parser.add_argument("--update_delta_scores",
help="Update delta scores when target/decoy are 1st and 2nd ranked hit",
action='store_true')
parser.add_argument("--main_features",
help="Track main features only. Otherwise all features will be tracked, which can reduce separation performance.",
action='store_true')
parser.add_argument("--drop_redundant_features",
help="Track main features only. Otherwise all features will be tracked, which can reduce separation performance.",
action='store_true')



args = parser.parse_args()
return args


def update_delta_scores(df1, idx1, df2, idx2):
if df1.at[idx1, "delta_avg"] > df1.at[idx1, "avg_bias_adjusted_similarity"] - df2.at[idx2, "avg_bias_adjusted_similarity"]:
df1.at[idx1, "delta_similarity"] = df1.at[idx1, "similarity"] - df2.at[idx2, "similarity"]
df1.at[idx1, "delta_dot"] = df1.at[idx1, "dot_product"] - df2.at[idx2, "dot_product"]
df1.at[idx1, "delta_annotation_similarity"] = df1.at[idx1, "annotation_similarity"] - df2.at[idx2, "annotation_similarity"]
df1.at[idx1, "delta_sim2"] = df1.at[idx1, "sim2"] - df2.at[idx2, "sim2"]
df1.at[idx1, "delta_avg"] = df1.at[idx1, "avg_bias_adjusted_similarity"] - df2.at[idx2, "avg_bias_adjusted_similarity"]
return

def merge_files(args):

print("+++ Merging target and decoy results (.pin format) +++")
df = pd.read_csv(args.target, sep='\t', comment='#', low_memory=False)
df_decoy = pd.read_csv(args.decoy, sep='\t', comment='#', low_memory=False)

target_nans = df.isnull().any(axis=1).sum()
decoy_nans = df_decoy.isnull().any(axis=1).sum()
if target_nans > 0 or decoy_nans > 0:
print(f"Waring: NaN values detected. Dropping {target_nans} target and {decoy_nans} decoy matches.")
df.dropna(inplace=True)
df_decoy.dropna(inplace=True)

if not all(df["Label"].unique() == 1):
print("Warning: Not all target labels match expected value of 1.")

if not all(df_decoy["Label"].unique() == -1):
print("Warning: Not all decoy labels match expected value of -1.")

print(f"Detected {df.shape[0]} target and {df_decoy.shape[0]} decoy matches.")
scans = df["ScanNr"].unique()

for num in scans:
decoy_match = df_decoy[df_decoy["ScanNr"] == num]
if len(decoy_match) == 0:
continue
elif len(decoy_match) > 1:
print("Error: multiple occurance of a ScanNr")
exit(1)
else:
decoy_idx = decoy_match.index[0]
decoy_match = decoy_match.iloc[0]

target_match = df[df["ScanNr"] == num]
target_idx = target_match.index[0]
target_match = target_match.iloc[0]

# Equal peptide -> Drop decoy
if target_match["Peptide"].replace("L", "I") == decoy_match["Peptide"].replace("L", "I"):
df_decoy.drop(decoy_idx, inplace=True)
continue

#print(target_idx, decoy_idx)
# Compare score -> Keep higher scoring match
if target_match[args.score] > decoy_match[args.score]:
if args.update_delta_scores:
update_delta_scores(df, target_idx, df_decoy, decoy_idx)
df_decoy.drop(decoy_idx, inplace=True)
else:
if args.update_delta_scores:
update_delta_scores(df_decoy, decoy_idx, df, target_idx)
df.drop(target_idx, inplace=True)



df = pd.concat([df, df_decoy], ignore_index=True)
#df["sim2_half"] = df["sim2"] / 2.0
#df["sim2_double"] = 2.0 * df["sim2"]

#cols = ["PSMId", "Label", "ScanNr", "sim2", "sim2_half", "sim2_double", "Peptide", "Proteins"]
#df = df[cols]

if args.main_features:
features = ["charge", "similarity", "bias", "delta_similarity", "sim2", "delta_sim2", "annotation_similarity", "annotation_bias", "annotation_sim2", "delta_annotation_similarity", "peak_count_ref", "avg_bias_adjusted_similarity", "delta_avg", "abs_mass_difference", "ppm_difference", "peptide_length", "precursor_mz"]
col = ["PSMId", "Label", "ScanNr"] + features + ["Peptide", "Proteins"]
df = df[col]
if args.drop_redundant_features:
df.drop(columns=["x_score", "x_score_dot"], inplace=True)
#df.drop(columns=["fragment_standard_deviation", "fragment_weighted_standard_deviation"], inplace=True)
#if args.experimental:
# df["exp1"] =
#df.drop(columns=["x_score", "x_score_dot"], inplace=True)
#df.drop(columns=["fragment_standard_deviation", "fragment_weighted_standard_deviation"], inplace=True)
df.to_csv(args.output, sep="\t", index=False)

num_targets = sum(df["Label"] == 1)
num_decoys = sum(df["Label"] == -1)


print(f"Files merged successfully! {num_targets} targets and {num_decoys} decoys remaining after competition.")



def main():
args = parse_args()
merge_files(args)

if __name__ == "__main__":
main()


# bug fix: Remove -inf values
# sed -i 's/-inf/-9999/g' yeast_td.pin
8 changes: 8 additions & 0 deletions scripts/test.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@

PSMId Label ScanNr charge similarity bias annotation_similarity annotation_bias avg_bias_adjusted_similarity dot_product delta_dot delta_similarity delta_annotation_similarity delta_sim2 delta_avg dot_contrast_angle similarity_contrast_angle annotation_contrast_angle mass_difference abs_mass_difference ppm_difference peptide_length precursor_mz peak_count_query peak_count_ref fragment_standard_deviation fragment_weighted_standard_deviation sim2 annotation_sim2 x_score x_score_dot x_lgamma x_lgamma_dot st_score st_score_dot Peptide Proteins
64 1 189 2 1.00149 0.220921 0.823551 0.220921 0.710924 0.873902 0.444778 0.342503 0.202489 0.304451 0.248828 0.676836 0.61602 -0.000854492 0.000854492 1.30494 11 654.814 1000 142 0.408765 0.421254 0.780238 0.641611 0.0 0.0 560.171 560.034 0.73769 0.727923 X.YGRPPDSHHSR.X Unknown


charge similarity bias annotation_similarity annotation_bias avg_bias_adjusted_similarity dot_product delta_dot delta_similarity delta_annotation_similarity delta_sim2delta_avg dot_contrast_angle similarity_contrast_angle annotation_contrast_angle mass_difference abs_mass_difference ppm_difference peptide_length precursor_mz peak_count_query peak_count_ref fragment_standard_deviation fragment_weighted_standard_deviation sim2 annotation_sim2 x_score x_score_dot x_lgamma x_lgamma_dot st_score st_score_dot m0
-0.0833 -0.2382 -0.1786 2.4159 -0.1786 -0.0354 1.3627 0.3950 -0.4690 -0.1501 0.1838 0.9446 0.0305 -1.3467 -0.4678 -0.0915 -0.0932 -0.4207 0.0715 0.1952 0.0000 1.7078 0.0120 0.0322 -0.7453 0.3660 0.1577 0.1706 0.0408 0.1718 0.3820 0.4089 -0.9830

10 changes: 9 additions & 1 deletion src/build_index.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ cxxopts::ParseResult parseArgs(int argc, const char* argv[], std::vector<std::st
("o,output", "output directory where indices will be generated", cxxopts::value<std::string>(), "PATH")
("n,num_indices", "number of buckets the fragment ion index will be split in", cxxopts::value<unsigned int>()->default_value("64"), "NUM")
("min_pep_length", "Minimum peptide length for the reference spectrum to be loaded into the index", cxxopts::value<unsigned int>()->default_value("7"), "NUM")
("label", "Give the library a label (1: target; -1: decoy)", cxxopts::value<int>()->default_value("1"), "NUM")
("t,threads", "number of threads (experimental)\n - 1 thread for reading, other threads for processing. Has increased RAM costs (try using more threads or GLIBC_TUNABLES=glibc.malloc.tcache_count=0 for compensation)", cxxopts::value<int>()->default_value("1"), "NUM");

options.parse_positional({"input", "output"});
Expand Down Expand Up @@ -52,18 +53,25 @@ cxxopts::ParseResult parseArgs(int argc, const char* argv[], std::vector<std::st
}

input_directories.push_back(dir_list.substr(start_pos));
} else {
std::cout << "Argument Error: Missing input directory." << std::endl;
exit(1);
}
if (result.count("output")) {
config->idx_path = result["output"].as<std::string>();
} else {
std::cout << "Argument Error: Missing output directory." << std::endl;
exit(1);
}
config->minimum_peptide_length = result["min_pep_length"].as<unsigned int>();
config->label = result["label"].as<int>();


return result;

}
catch (const cxxopts::OptionException& e) {
std::cout << "error parsing options: " << e.what() << std::endl;
std::cout << "Error parsing options: " << e.what() << std::endl;
exit(1);
}
}
Expand Down
9 changes: 9 additions & 0 deletions src/configuration.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ bool configuration::save_configuration_to_file(const std::string& config_file_pa
f << lim << delimiter;
}
f << "\n";
f << "Label: " << label << "\n";
f << "Min peptide length: " << minimum_peptide_length << "\n";
f << "Build command: " << build_command << "\n";
f.close();
Expand Down Expand Up @@ -65,6 +66,14 @@ bool configuration::load_configuration_from_file(const std::string& config_file_
return false;
}

//Third line (Label)
getline(f, line);
if(line.rfind("Label: ", 0) == 0) {
label = std::stoi(line.substr(7, std::string::npos));
} else {
label = 1;
}

f.close();

idx_path = config_file_path.substr(0,config_file_path.rfind('/') + 1);
Expand Down
4 changes: 3 additions & 1 deletion src/configuration.h
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,11 @@
class configuration {
public:

std::string idx_path = "./test/";
std::string idx_path = "";
std::string precursor_index_path;
unsigned int num_indices = 24;
int label = 1;


unsigned int sub_idx_range;
std::vector<unsigned int> sub_idx_limits;
Expand Down
16 changes: 14 additions & 2 deletions src/indexing_manager.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,10 @@ indexing_manager::indexing_manager(std::vector<std::string> &input_paths, std::s
for (const auto & entry : std::filesystem::directory_iterator(path)) {
if (entry.path().extension() == ".msp") {
lib_files.push_back(entry.path());
file_format = MSP;
} else if (entry.path().extension() == ".mgf") {
lib_files.push_back(entry.path());
file_format = MGF;
}
}
} else {
Expand All @@ -87,6 +91,15 @@ indexing_manager::indexing_manager(std::vector<std::string> &input_paths, std::s
}

}
if (!std::filesystem::exists(config->idx_path) || !std::filesystem::is_directory(config->idx_path)) {
std:cerr << "Bad output directory" << std::endl;
std::cerr << config->idx_path << " is not a directory or does not exist!" << std::endl;
exit(1);
}
if (!config->idx_path.ends_with('/')) {
config->idx_path += "/";
}

}


Expand Down Expand Up @@ -150,7 +163,7 @@ bool indexing_manager::build_indices() {
}

bool indexing_manager::set_up_output_streams() {

for (int i = 0; i < config->num_indices; ++i) {
string file_name = config->idx_path + "frag_idx_" + to_string(i) + ".bin";
//cout << file_name << endl;
Expand All @@ -170,7 +183,6 @@ bool indexing_manager::parse_file(unsigned int file_num) {

string buffer;


/*
* Main loop reading library and creating preliminary indices on the fly
*/
Expand Down
15 changes: 14 additions & 1 deletion src/match.h
Original file line number Diff line number Diff line change
Expand Up @@ -13,27 +13,40 @@ class match {
unsigned int target_id;

float mass_difference;
float abs_mass_difference;
float ppm_difference;
unsigned int charge;
std::vector<std::string> isomers; //tracking homologous peptides

// Default scores
float similarity;
float bias;
float dot_product;


//Annotation scores
float annotation_similarity;
float annotation_bias;
float annotation_sim2;

//Contrast angles
float dot_contrast_angle;
float similarity_contrast_angle;
float annotation_contrast_angle;

//Additional score
float delta_dot;
float delta_similarity;
float delta_annotation_sim;
float delta_sim2;
int peak_count_query;
int peak_count_target;

float peak_mz_standard_deviation;
float peak_mz_weighted_standard_deviation;

//Advanced scores
float avg_bias_adj_similarity;
float delta_avg;
float sim2;
float spectraST_score;
float spectraST_score_dot;
Expand Down
Loading

0 comments on commit 40449f4

Please sign in to comment.