You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
However, in the output file, I noticed that the distribution of read lengths is very different from the estimate provided by the MinKNOW UI:
I first assumed this was a problem with the base calling itself. I confirmed that the discrepancy was not due to failure to base call reads, as the number of reads in the dorado output (1175) matched the number in the minKNOW summary table (1176) minus one that was filtered (I haven't had the chance to figure out why, but I confirmed that this read was not particularly long).
I read in the dorado github that other users had noticed read length discrepancies due to splitting of concatemers and/or reads containing internal adapters. This manifested itself as the appearance of new read IDs in the dorado output file that were not present in the minKNOW pod5 file. I confirmed that this wasn't the case for me. All read IDs in the dorado output file are present in the pod5 file and vice versa, minus the single read that was filtered.
Many resolved issues on the dorado github mentioned problems with demultiplexing and barcode and/or adapter trimming that have been addressed in the most recent release (v0.7.1). So, to cover my bases, I reran my base calling script with this dorado version instead. There were slight differences in summary statistics, as expected, but no major difference in read length distribution that could explain the discrepancy with minKNOW.
Finally, I attempted to perform a read-by-read comparison of lengths from the minKNOW UI and the dorado output. Based on my quick search, it appeared that minKNOW's method for calculating read length (correct me if I'm wrong) in the absence of real time base calling is to multiply read duration by translocation speed (400bp/s in my case). I performed this calculation to get the read-by-read length estimation from the minKNOW summary table and then plotted this against the read length determined by dorado. I also grouped reads by the end reason.
Strikingly, the correlation to the line y=x drops severely after around 14kb. I guess this is not the most unexpected given the method of estimating read length, but I am surprised at the large discrepancy (8.97 Mb total estimated by minKNOW and only 2.74 Mb called by dorado).
I am wondering if anyone else has encountered this problem and if there are any known workarounds to get more accurate estimates on the minKNOW UI side. We were super excited to see a read almost 1 Mb long until the analysis showed it was actually under 200 kb. Hoping not to get our hopes up again :(
The text was updated successfully, but these errors were encountered:
I think the same thing is happening to us.
You saved me a lot of work figuring out where these discrepancies come from.
Thanks for looking at this in so much detail!
Glad to hear we're not the only ones, and happy to share our insights! I was surprised that I couldn't find any mention of this elsewhere, but I'm assuming that's because most folks do use real time base calling, which then enables tracking of actual translocation speed. Hopefully this can get resolved!
I recently did a run with the following configuration and output:
I base called and demultiplexed this data using the following script run on a Linux-based cluster:
However, in the output file, I noticed that the distribution of read lengths is very different from the estimate provided by the MinKNOW UI:
I first assumed this was a problem with the base calling itself. I confirmed that the discrepancy was not due to failure to base call reads, as the number of reads in the dorado output (1175) matched the number in the minKNOW summary table (1176) minus one that was filtered (I haven't had the chance to figure out why, but I confirmed that this read was not particularly long).
I read in the dorado github that other users had noticed read length discrepancies due to splitting of concatemers and/or reads containing internal adapters. This manifested itself as the appearance of new read IDs in the dorado output file that were not present in the minKNOW pod5 file. I confirmed that this wasn't the case for me. All read IDs in the dorado output file are present in the pod5 file and vice versa, minus the single read that was filtered.
Many resolved issues on the dorado github mentioned problems with demultiplexing and barcode and/or adapter trimming that have been addressed in the most recent release (v0.7.1). So, to cover my bases, I reran my base calling script with this dorado version instead. There were slight differences in summary statistics, as expected, but no major difference in read length distribution that could explain the discrepancy with minKNOW.
Finally, I attempted to perform a read-by-read comparison of lengths from the minKNOW UI and the dorado output. Based on my quick search, it appeared that minKNOW's method for calculating read length (correct me if I'm wrong) in the absence of real time base calling is to multiply read duration by translocation speed (400bp/s in my case). I performed this calculation to get the read-by-read length estimation from the minKNOW summary table and then plotted this against the read length determined by dorado. I also grouped reads by the end reason.
Strikingly, the correlation to the line y=x drops severely after around 14kb. I guess this is not the most unexpected given the method of estimating read length, but I am surprised at the large discrepancy (8.97 Mb total estimated by minKNOW and only 2.74 Mb called by dorado).
I am wondering if anyone else has encountered this problem and if there are any known workarounds to get more accurate estimates on the minKNOW UI side. We were super excited to see a read almost 1 Mb long until the analysis showed it was actually under 200 kb. Hoping not to get our hopes up again :(
The text was updated successfully, but these errors were encountered: