05-genelists.Rmd

# Defining the genelist

Starting from those differential expression results [here](http://degust.erc.monash.edu/degust/compare.html?code=5b2c7805ab8f8c5f2dc8c72e61b049b0#?plot=mds), how do we go about getting a genelist to calculate enrichment on? 


## Activities

Todays exercise follows the process of getting the differentially expressed gene list using excel. You could use another spreadsheet program, or some may prefer a programming language like R .

1. Download the full table of data from either degust, or the csv file here:
[Pezzini2016_SHSY5Ycelldiff_DE_table.csv](https://monashbioinformaticsplatform.github.io/enrichment_analysis_workshop/data/Pezzini2016_SHSY5Ycelldiff_DE_table.csv). Import into excel. <!-- File>Import --> 

2. How many genes are differentially expressed? In these results the FDR Column contains the corrected p-value, and the 'differentiated' column shows the log2 fold-change of differentiated cells vs untreated cells (log2(diff)-log2(undiff)); 0 is unchanged, 1 is doubled, -1 is halved.

    - Significant at 0.01?
    
    - That's a particularly large number of genes - perhaps not unexpected given how much the cells are changed this experiment. How many significant genes also have 2-fold change in expression?

    - For this workshop, get the genes with a FDR<0.01 and 2x fold change (`log2(4)`). Note - most experiments yeild far less differential expression, but the difference between these two cell conditions is pretty extreme! Typically you would only filter at p<0.01 (and occasionally 2-fold change) - you might see 10s to 100s of results. However, this arbitrary threshold gives a more typical number of differentially expressed genes for downstream analysis. An alternative approach could be to take the top 500 genes.

<details>
<summary>Show</summary>
There are 4923 differentially expressed genes, 2149 of which have a 2-fold change in expression. With the aggressive filtering, there are 792 genes left.
</details>

3. How many genes are _tested_? This is your background.

<details>
<summary>Show</summary>
14420 genes tested.
</details>

<!--But with ~20k human genes - why are there genes missing? **14420** --> 

---

## Common gotcha

Can you find SEPT4? Because [_Gene name errors are widespread in the scientific literature_](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1044-7)

You can't revert the gene names automatically (try converting it to text!). You have to avoid it in the first place by importing gene columns as 'text' columns in excel.  See video from HUGO : https://www.genenames.org/help/faq/#!/#tocAnchor-1-25-1

<!--NB: You can ignore these for this workshop, but you want this to be right for publication!-->

---

## Example

An example excel document showing this filtering process is here: [Pezzini2016_SHSY5Ycelldiff_DE_table_filtering.xlsx](https://monashbioinformaticsplatform.github.io/enrichment_analysis_workshop/data/Pezzini2016_SHSY5Ycelldiff_DE_table_filtering.xlsx).