Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

alignproteome added #875

Open
wants to merge 33 commits into
base: master
Choose a base branch
from
Open

Conversation

martin-steinegger
Copy link
Member

No description provided.

@@ -632,6 +632,15 @@ std::vector<Command> baseCommands = {
{"targetDB", DbType::ACCESS_MODE_INPUT, DbType::NEED_DATA, &DbValidator::sequenceDb },
{"resultDB", DbType::ACCESS_MODE_INPUT, DbType::NEED_DATA, &DbValidator::resultDb },
{"alignmentDB", DbType::ACCESS_MODE_OUTPUT, DbType::NEED_DATA, &DbValidator::alignmentDb }}},
{"alignproteome", alignproteome, &par.alignproteome, COMMAND_ALIGNMENT,
"Within-result all-vs-all gapped local alignment",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would improve this text.

{"alignproteome", alignproteome, &par.alignproteome, COMMAND_ALIGNMENT,
"Within-result all-vs-all gapped local alignment",
NULL,
"Martin Steinegger <martin.steinegger@snu.ac.kr>",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add you @Gyuuul2

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file looks incomplete

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

@@ -136,6 +137,8 @@ template <typename T> bool DBReader<T>::open(int accessType){
}
}
if (dataMode & USE_LOOKUP || dataMode & USE_LOOKUP_REV) {
Debug(Debug::INFO) << "ReadLookup file: " << dataFileName << "\n"; //gyuri
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove this print out.

@@ -144,7 +147,9 @@ template <typename T> bool DBReader<T>::open(int accessType){
}
char* lookupDataChar = (char *) lookupData.getData();
size_t lookupDataSize = lookupData.size();
Debug(Debug::INFO) << "Lookup Data size is " << lookupDataSize << "\n"; //gyuri
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove this print out.

lookupSize = Util::ompCountLines(lookupDataChar, lookupDataSize, threads);
Debug(Debug::INFO) << "Lookup size is " << lookupSize << "\n"; //gyuri
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove this print out.

@@ -52,6 +53,7 @@ set(commons_source_files
commons/BaseMatrix.cpp
commons/Command.cpp
commons/CommandCaller.cpp
# commons/ClusterSpecies.cpp
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove it.

@@ -217,7 +221,7 @@ class Matcher{


static size_t resultToBuffer(char * buffer, const result_t &result, bool addBacktrace, bool compress = true, bool addOrfPosition = false);

static size_t resultToBuffer_str(char * buffer, const result_t &result, bool addBacktrace, bool compress, bool addOrfPosition=false);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed?

@@ -27,7 +27,9 @@ int alignall(int argc, const char **argv, const Command &command) {
}
unsigned int swMode = Alignment::initSWMode(par.alignmentMode, par.covThr, par.seqIdThr);

DBReader<unsigned int> tdbr(par.db1.c_str(), par.db1Index.c_str(), par.threads, DBReader<unsigned int>::USE_DATA|DBReader<unsigned int>::USE_INDEX);
// DBReader<unsigned int> tdbr(par.db1.c_str(), par.db1Index.c_str(), par.threads, DBReader<unsigned int>::USE_DATA|DBReader<unsigned int>::USE_INDEX);
DBReader<unsigned int> tdbr(par.db1.c_str(), par.db1Index.c_str(), par.threads, DBReader<unsigned int>::USE_DATA|DBReader<unsigned int>::USE_INDEX|DBReader<unsigned int>::USE_LOOKUP);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is not needed anymore

@@ -56,6 +58,7 @@ int alignall(int argc, const char **argv, const Command &command) {
EvalueComputation evaluer(tdbr.getAminoAcidDBSize(), subMat, gapOpen, gapExtend);
const size_t flushSize = 100000000;
size_t iterations = static_cast<int>(ceil(static_cast<double>(dbr_res.getSize()) / static_cast<double>(flushSize)));
Debug(Debug::INFO) << "Number of iterations: " << iterations << " Size of linclust dbr_res : " << dbr_res.getSize() << '\n';
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

}
};

struct __attribute__((__packed__)) memberProteinEntry{
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MemberProteinEntry

@@ -187,7 +187,7 @@ class DBReader : public MemoryTracker {

size_t getSize() const;

unsigned int getProteomeTotalLen(size_t id); //gyuri
unsigned int getProteomeTotalLen(size_t id);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would name this getSourceTotalLen or getSetTotalLen to be consistent with the .source naming

}

// void setEasyAlignproteomeMustPassAlong(Parameters *p){

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to implement this whenever you use the createParameterString(..., true) parameter. Or it won't get passed along.

par.filenames.pop_back();

CommandCaller cmd;
cmd.addVariable("ALIGNPROTEOME_PAR", par.createParameterString(par.alignproteome,true).c_str()); // what?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLUSTER_PAR, THREADS_PAR etc are missing

hash = FileUtil::getHashFromSymLink(tmpDir + "/latest");
}
tmpDir = FileUtil::createTemporaryDirectory(tmpDir, hash);
par.filenames.pop_back();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to pop_back() the output path too and pass it as an environment variable. Or else it will also be passed to createdb.

@@ -22,7 +22,7 @@ class IndexReader {
) : sequenceReader(NULL), index(NULL) {
int targetDbtype = FileUtil::parseDbType(dataName.c_str());
if (Parameters::isEqualDbtype(targetDbtype, Parameters::DBTYPE_INDEX_DB)) {
index = new DBReader<unsigned int>(dataName.c_str(), (dataName + ".index").c_str(), 1, DBReader<unsigned int>::USE_DATA|DBReader<unsigned int>::USE_INDEX);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happened here?

const unsigned int ProteomeId = lookup[i].fileNumber;
const unsigned int ProteinId = lookup[i].id;
ProteomeList[ProteomeId].addSeqLen(tProteinDB.getSeqLen(ProteinId));
if (ProteomeList[ProteomeId].proteomeKey == -1){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: please add spaces around if:
if[space](...)[space]{

std::vector<unsigned int> memberKeys;
memberKeys.reserve(50); // store key for every protein in a cluster

std::vector<bool> isProteomeInCluster(localMemCount.size(), false); ;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double ;

also careful about bool vectors. They are not multi-threading friendly. Here its fine, as you don't do any multi-threading

if (memberKeys.size() > 1) { //If not a singleton cluster
ClusterEntry eachClusterRep(memberKeys.size());
//init MemberProteinEntry and add it to memberProteins vector
for (auto& eachMemberKey : memberKeys){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: missing space after the for

for (size_t idx = 0; idx < ProteomeList.size(); idx++){
if (ProteomeList[idx].isCovered()) {
continue;
}else{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: missing spaces around else

DBReader<unsigned int>::LookupEntry* tLookup = tProteinDB.getLookup();
const size_t tLookupSize = tProteinDB.getLookupSize();
unsigned int totalProteomeNumber = tLookup[tLookupSize - 1].fileNumber;
std::vector<ProteomeEntry> ProteomeList(totalProteomeNumber + 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: should be lowercase: proteomeList


int gapOpen, gapExtend;
BaseMatrix *subMat;
subMat = new SubstitutionMatrix(par.scoringMatrixFile.values.aminoacid().c_str(), 2.0, par.scoreBias);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can avoid the explicit allocation here:

SubstitutionMatrix subMat(par.scoringMatrixFile.values.aminoacid().c_str(), 2.0, par.scoreBias);
...

#pragma omp critical
{
for (size_t idx=0; idx < localMemCount.size(); idx++){
ProteomeList[idx].incrementMemCount(localMemCount[idx]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you use __sync_fetch_and_add inside incrementMemCount you don't need the critical

#pragma omp critical
{
for (size_t idx = 0; idx < ProteomeList.size(); idx++) {
ProteomeList[idx].addSeqId(localSeqIds[idx]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here you might need the critical as atomic floats can be tricky to do correctly

proteomeClustWriter.close();
proteinClustWriter.close();
tProteinDB.close();
delete subMat;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you rewrite the subMat init above, you can remove this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants