A short "genome" sequence causes a hang, not a fail #9

rmcolq · 2019-02-15T16:52:00Z

I realise this is not the typical use case, but I was trying to simulate reads with the error profile of nanopore and did not care about the length. I ran the example case just fine, but found that when I simulated reads from a short sequence (2000bps) this caused nanosim-h to start and just hang until my server timed out. Adding 15,000 "A"s to the beginning and end of my short sequence enabled it to work. A sanity check on user input would be a good idea.

nanosim-h -p ecoli_R9_2D -n 500 random_path.fa --unalign-rate 0 --max-len 10000

The text was updated successfully, but these errors were encountered:

karel-brinda · 2019-02-15T16:56:48Z

Hi Rachel, thanks for the report! I completely agree that this needs to be fixed.

mbhall88 · 2019-03-05T19:35:18Z

For anyone coming across this, I found a workaround.
I was working with a small toy reference 300bp long. If run with default parameters it hangs forever due to the length of one of the simulated reads being longer than my actual reference. So if you set --max-len to be the length of your reference then it works fine.

@karel-brinda I wonder if max_readlength should be set by default to the length of the reference rather than infinity?

NanoSim-H/nanosimh/simulate.py

Line 785 in 1e2d374

max_readlength = float("inf")

It doesn't make sense to be able to generate reads longer than the reference?

mbhall88 · 2019-03-05T19:50:50Z

Actually, I have been digging into this a little more and found something interesting. If I indeed set --max-len to the length of the reference (300) it gets part of the way through and then hangs. If I take that down to about 290 it runs fine.
But if I also set --min-len to the same as --max-len then it just hangs.

However, if I make --min-len 1 less than --max-len it has an ETA of 2 hours for 1000 reads - this seems to be expected though as mentioned in the nanosim README

If you want the fasta file I am working with I'll attach it below (it says it's a .txt file but it is actually fasta).

test.txt

ExplodingCabbage · 2019-06-21T11:36:36Z

Pretty sure there's a bug in extract_read. In the use-case where your reference sequence is a single short non-circular chromosome (perhaps because the "chromosome" is really a PCR amplicon you want to simulate reads of), it is highly likely that get_length will return some values greater than your sequence length. When extract_read gets called with such a length value, length gets set to max(seq_len.values()) which in turn equals genome_len. Then we repeatedly try to generate a read start position with this line:

ref_pos = random.randint(0, genome_len)

Then we hit this if/elif/else block:

if ref_pos + length < seq_len[key]:
    new_read = seq_dict[key][ref_pos:ref_pos + length]
    read_info = (key, ref_pos)
    break
elif ref_pos < seq_len[key]:
    break
else:
    ref_pos -= seq_len[key]

But even in the best case where ref_pos is 0, ref_pos + length < seq_len[key] is guaranteed to be false in the single-short-chromosome scenario because length equals seq_len[key]. So new_read never gets set and we stay in the while True: loop generating new random start positions forever.

I'll have a fiddle around and see if there's an easy fix.

ExplodingCabbage · 2019-06-21T13:27:24Z

#16 fixes the hang. However, when running against a short reference with that fix applied, I now see this crash instead:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.6/bin/nanosim-h", line 11, in <module>
    load_entry_point('NanoSim-H', 'console_scripts', 'nanosim-h')()
  File "/Users/markamery/NanoSim-H/nanosimh/__init__.py", line 19, in nanosimh_simulate
    simulate.main()
  File "/Users/markamery/NanoSim-H/nanosimh/simulate.py", line 1010, in main
    min_l=min_readlength, merge=merge, rnf=rnf, rnf_cigar=rnf_cigar
  File "/Users/markamery/NanoSim-H/nanosimh/simulate.py", line 455, in simulation
    read_mutated, cigar = mutate_read(new_read, new_read_name, out_error, error_dict, kmer_bias)
  File "/Users/markamery/NanoSim-H/nanosimh/simulate.py", line 702, in mutate_read
    tmp_bases.remove(read[key + i])
IndexError: string index out of range

I'll keep debugging. :)

ExplodingCabbage · 2019-06-21T14:09:35Z

I've pushed another commit to #16 to fix the error mentioned above.

ExplodingCabbage · 2020-02-19T11:37:12Z

I think this is fixed on devel and can be closed?

Not yet released to master or PyPI, though.

karel-brinda · 2020-02-19T15:57:30Z

It's included in the devel branch, but not yet in master (sorry for the delay).

You can install the devel version by:

pip install git+https://github.com/karel-brinda/NanoSim-H@devel

karel-brinda added the bug label Feb 15, 2019

ExplodingCabbage mentioned this issue Jun 24, 2019

Fix bugs preventing tool from working on short reference sequences #16

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A short "genome" sequence causes a hang, not a fail #9

A short "genome" sequence causes a hang, not a fail #9

rmcolq commented Feb 15, 2019

karel-brinda commented Feb 15, 2019

mbhall88 commented Mar 5, 2019

mbhall88 commented Mar 5, 2019 •

edited

Loading

ExplodingCabbage commented Jun 21, 2019 •

edited

Loading

ExplodingCabbage commented Jun 21, 2019

ExplodingCabbage commented Jun 21, 2019

ExplodingCabbage commented Feb 19, 2020

karel-brinda commented Feb 19, 2020

A short "genome" sequence causes a hang, not a fail #9

A short "genome" sequence causes a hang, not a fail #9

Comments

rmcolq commented Feb 15, 2019

karel-brinda commented Feb 15, 2019

mbhall88 commented Mar 5, 2019

mbhall88 commented Mar 5, 2019 • edited Loading

ExplodingCabbage commented Jun 21, 2019 • edited Loading

ExplodingCabbage commented Jun 21, 2019

ExplodingCabbage commented Jun 21, 2019

ExplodingCabbage commented Feb 19, 2020

karel-brinda commented Feb 19, 2020

mbhall88 commented Mar 5, 2019 •

edited

Loading

ExplodingCabbage commented Jun 21, 2019 •

edited

Loading