[skip ci] docs: update single cell

bcbio · Jan 8, 2021 · 9c8ad93 · 9c8ad93
1 parent 7fae0fa
commit 9c8ad93
Showing 1 changed file with 25 additions and 9 deletions.
diff --git a/docs/contents/single_cell.md b/docs/contents/single_cell.md
@@ -5,11 +5,30 @@
 Bcbio installation paths in this workflow correspond to [O2 bcbio installation](https://wiki.rc.hms.harvard.edu/display/O2).
 Adjust to bcbio installation you are working with.
 
-### 1. Check reference genome and transcriptome - is it a mouse project?
+### 1. Check reference genome and transcriptome
+mouse project:
 - mm10 reference genome: /n/shared_db/bcbio/biodata/genomes/Mmusculus/mm10
 - transcriptome_fasta: /n/shared_db/bcbio/biodata/genomes/Mmusculus/mm10/rnaseq/ref-transcripts.fa
 - transcriptome_gtf: /n/shared_db/bcbio/biodata/genomes/Mmusculus/mm10/rnaseq/ref-transcripts.gtf
 
+human project:
+- hg38 reference genome: /n/shared_db/bcbio/biodata/genomes/Hsapiens/hg38
+- transcriptome_fasta: /n/shared_db/bcbio/biodata/genomes/Hsapiens/hg38/rnaseq/ref-transcripts.fa
+- transcriptome_gtf: /n/shared_db/bcbio/biodata/genomes/Hsapiens/hg38/rnaseq/ref-transcripts.gtf
+
+Those are *spliced* references. 
+
+To prepare the unspliced reference, use:
+```bash
+#!/bin/bash
+$1 = ref-transcripts.gtf
+bname=`basename $1 .gtf`
+awk 'BEGIN{FS="\t"; OFS="\t"} $3 == "transcript"{ $3="exon"; print}' $1 > $bname.premrna.gtf
+gffread -g /path/to/genome_reference/genome.fa $bname.premrna.gtf -w $bname.unspliced.fa
+```
+
+In some datasets using the unspliced reference allows to yield 1.5X more counts.
+
 ### 2. Create bcbio project structure in /scratch
 ```
 mkdir sc_mouse
@@ -18,32 +37,29 @@ mkdir config input final work
 ```
 
 ### 3. Prepare fastq input in sc_mouse/input
-- some FC come in 1..4 lanes, merge lanes for every read:
+- if data comes in >1 lanes, merge lanes for every read:
 ```
 cat lane1_r1.fq.gz lane2_r1.fq.gz > project_1.fq.gz
 cat lane1_r2.fq.gz lane2_r2.fq.gz > project_2.fq.gz
 ```
-- cat'ing gzip files sounds ridiculous, but works for the most part, for purists:
-```
-zcat KM_lane1_R1.fastq KM_lane2_R1.fastq.gz | gzip > KM_1.fq.gz
-```
 
-- some cores send bz2 files not gz
+- some sequencing cores send bz2 files not gz
 ```
 bunzip2 *.bz2
 cat *R1.fastq | gzip > sample_1.fq.gz
 ```
 
-- some cores produce R1,R2,R3,R4, others R1,R2,I1,I2, rename them
+- some cores produce R1,R2,R3,R4, others R1,R2,I1,I2, rename files
 ```
 bcbio_R1 = R1 = 100 or 86 or 64 bp transcript read
 bcbio_R2 = I1 = 8 bp part 1 of cell barcode
 bcbio_R3 = I2 = 8 bp sample (library) barcode
 bcbio_R4 = R2 = 14 bp = 8 bp part 2 of cell barcode + 6 bp of transcript UMI
 ```
 
-*Some 100 bp libraries produce low count numbers, and trimming them to 61 bp improves counting*:
+*Some 100 bp R1 libraries yield low count numbers, and trimming them to 61 bp improves counting*:
 ```bash
+#!/bin/bash
 # $1 = sample_1.fq.gz
 java -jar /path/to/Trimmomatic-0.39/trimmomatic-0.39.jar SE \
 -threads 10 \