Skip to content

Commit

Permalink
final changes to novel
Browse files Browse the repository at this point in the history
  • Loading branch information
calizilla committed Nov 20, 2024
1 parent 6325e38 commit 49bd63b
Show file tree
Hide file tree
Showing 2 changed files with 45 additions and 15 deletions.
2 changes: 1 addition & 1 deletion 14-novel-species-FEA.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ This axolotl data ticked A through D yet not E (RNA from various tissues were se

The [reference genome](https://www.axolotl-omics.org/dl/AmexG_v6.0-DD.fa.gz) and [GTF gene prediction file](https://www.axolotl-omics.org/dl/AmexT_v47-AmexG_v6.0-DD.gtf.gz) were downloaded from www.axolotl-omics.org. A proteome was created by extracting the predicted peptide sequences from the GTF then retaining the longest isoform per gene with `AGAT` v 1.4.0. The predicated proteome was then annotated with `eggNOG emapper` v 2.1.12.

The annotation output file has been provided to you, and we will import this into R and use the `dplyr` package v 1.1.4 to extract GO and KEGG IDs into the required format for R-based FEA with `clsuterProfiler` and `WebGestaltR`.
The annotation output file has been provided to you, and we will import this into R and use the `dplyr` package v 1.1.4 to extract GO and KEGG IDs into the required format for R-based FEA with `clusterProfiler` and `WebGestaltR`.

The predicted proteome was also annotated with [STRING](https://string-db.org/) v 12.0. As of version 12, STRING includes a feature `Add any organism to STRING / Annotate proteome` ([Szklarczyk etl al 2023](https://pmc.ncbi.nlm.nih.gov/articles/PMC9825434/)). The axolotl proteome was uploaded to STRING and annotation performed on STRING servers in less than 1 day. The [resulting annotation](https://version-12-0.string-db.org/organism/STRG0A90SNX) is persistent and shareable and can be used for all of STRING's search functions including ORA (`Multiple proteins`) and GSEA (`Proteins with Values/Ranks`).

Expand Down
58 changes: 44 additions & 14 deletions day2_Rnotebooks/novel_species.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -146,7 +146,9 @@ tail(ranked_df)

## 2.3 Create gene lists for ORA

For ORA, both tools require vector class gene lists. We will filter for adjusted P value < 0.01 and absolute log2 fold change greater than 1.5. The matrix has already filtered out genes with very low counts so we take all genes present as the background.
For ORA, both tools require vector class gene lists. We will filter for adjusted P value < 0.01 and absolute log2 fold change greater than 1.5.

The matrix has already filtered out genes with very low counts so we take all genes present as the background.

```{r create ORA vector gene lists}
Expand All @@ -169,6 +171,9 @@ head(background)
```

Note the large drop in gene numbers: 100K in GTF, 48K in predicted proteome, 24K expressed in the blastema! By reducing the number of background genes to what are expressed in the studied tissue, we can reduce falsely inflated P values and false positives within our list of enriched terms.


## 2.4 Save gene lists

Saving any outputs generated from R code is vital to reproducibility! You should include all analysed gene lists within the supplementary materials of your manuscript.
Expand All @@ -186,6 +191,8 @@ write.table(ranked_df, file = "Axolotl_rankedFC.txt", sep = "\t", quote = FALSE,

# 3. Reformat annotation files for `clusterProfiler` GO and KEGG analysis

Now we have annotation files and gene lists, we will bring those together to create the custom database files required for R FEA!

## 3.1 Create TERM2GENE files

These are 2 column text files with the term ID (one per line) alongside the ID of the gene that maps to the term. A gene can map to many terms and thus be present on multiple lines. A term can be mapped to more than one gene and thus be present on many lines.
Expand All @@ -200,7 +207,9 @@ We need `GOs` and `KEGG_Pathway` columns.

### 3.1.1 GO TERM2GENE

Next, we will extract the GO IDs from the `emapper` annotation file, and wrangle into the correct format for `clusterProfiler` `TERM2GENE`. There are several steps to this - comments have been included to outline what each step is doing.
Next, we will extract the GO IDs from the `emapper` annotation file, and wrangle into the correct format for `clusterProfiler` `TERM2GENE`.

There are several steps to this - comments have been included to outline what each step is doing.

```{r GO TERM2GENE}
go_term2gene <- eggnog_anno %>%
Expand Down Expand Up @@ -265,16 +274,15 @@ head(kegg_term2gene)

### 3.2.1 GO

The GO Core Ontology file (we have saved in our `workshop` folder and labelled the filename as R object `go_core`) will enable us to assign term descriptions to term IDs and create our `TERM2NAME` files.

Now we will assign term descriptions to term IDs and create our `TERM2NAME` files.

This may take a few moments to run. It will use the `ontology` object we created earlier.
This may take a few moments to run. It will use the `ontology` object we created earlier from the `go.obo` file.

```{r GO TERM2NAME}
# Create term to name table, removing duplicates, missing values and obsolete terms
go_term2name <- go_term2gene %>%
mutate(name = ontology$name[term]) %>%
go_term2name <- go_term2gene %>% # only keep terms that are in our term2gene object (ie, mapped to axolotl)
mutate(name = ontology$name[term]) %>%
dplyr::select(term, name) %>%
distinct() %>%
drop_na() %>%
Expand Down Expand Up @@ -318,7 +326,9 @@ head(kegg_term2name)

## 3.3 Count annotations

How much of our proteome was annotated? What about our DEGs and background? Genes that do not have any annotation are excluded from enrichment analysis, so having an understanding of the extent of annotation for your novel species is very important when interpreting results!
How much of our proteome was annotated? What about our DEGs and background?

Genes that do not have any annotation are excluded from enrichment analysis, so having an understanding of the extent of annotation for your novel species is very important when interpreting results!


Count the number of GO terms found within the genome, and the number of genes with GO annotations:
Expand All @@ -345,7 +355,7 @@ print(paste("Number of unique genes with 1 or more annotation terms:", kegg_uniq
```

47,196 putative axolotl proteins were annotated. That's around 1/4 of our genes mapped to KEGG Pathways, and less than half of our genes mapped to GO! Ouch. As much as we expect this with uncurated novel species genomes, it's still unpleasant to face :-)
47,196 putative axolotl proteins were annotated. That's around 1/4 of our predicted proteins mapped to KEGG Pathways, and less than half of our genes mapped to GO! Ouch. As much as we expect this with uncurated novel species genomes, it's still unpleasant to face :-)


What of the genes in our gene list specifically? We have an uncurated proteome, yet the genes in our input matrix were expressed at a meaningful level within axolotl, so these may actually have a higher annotation percentage than all genes in the proteome.
Expand All @@ -367,6 +377,7 @@ cat("Number of input genes with GO annotations:", unique_genes_with_go, "(",perc
```


```{r report gene list KEGG annotation percentages}
# Filter the term2gene table to only include genes in the background gene list
kegg_filtered_term2gene <- kegg_term2gene %>% filter(gene %in% background)
Expand Down Expand Up @@ -405,6 +416,8 @@ Let's review the help page:

There are parameters for both adjusted P value and q value. Terms must pass all thresholds (unadjusted P, adjusted P, and q value) so the important filter will be the most stringent test applied. Let's go with BH and 0.05 which we have used regularly within this workshop and are fairly common choices in the field.

we need to provide term2gene and term2name, and don't specify an organism.


```{r CP GO ORA }
Expand All @@ -428,7 +441,11 @@ cp_go_ora

91 significantly enriched terms at P.adj < 0.05.

Note that we used the gene list object `degs` which had 247 genes, but the tool has applied the input size as 145 - this is because it is automatically discarding any that do not have annotations. Results would be the same if we instead used `annotated_degs` object. Likewise, the background size is being reported as 15072 (the number annotated) not 24,419 (the total in background list).
Look at the geneRatio column: our gene list object `degs` has 247 genes, but the tool has applied the input size as 145 - this is because it is automatically discarding any that do not have annotations.

Results would be the same if we instead used `annotated_degs` object.

Likewise, the background size is being reported as 15072 (the number annotated) not 24,419 (the total in background list).


Save the results to a text file:
Expand Down Expand Up @@ -574,7 +591,7 @@ Note that in the below code, the first command is identical the one that created
```{r GO GMT}
# Step 1: Extract relevant columns (GO terms and gene IDs) from eggnog_anno
go_data <- eggnog_anno %>%
go_data <- eggnog_anno %>% # use the emapper annotations for axolotl
dplyr::select(GOs, `#query`) %>% # Select the GO terms and the gene IDs
dplyr::filter(GOs != "-") %>% # Filter out rows where the GO ID is missing ("-")
separate_rows(GOs, sep = ",") %>% # Split comma-delimited list of GO terms into separate rows
Expand All @@ -589,7 +606,7 @@ colnames(go_data) <- c("term", "gene")
go_data <- go_data %>%
dplyr::mutate(external_link = paste0("https://www.ebi.ac.uk/QuickGO/term/", term))
# Step 3: Group genes by GO term and concatenate gene list by tab
# Step 3: Group genes by GO term and concatenate gene list by tab so all genes per term are on the same row
go_term_grouped <- go_data %>%
dplyr::group_by(term) %>%
dplyr::summarize(genes = paste(gene, collapse = "\t"), .groups = "drop")
Expand All @@ -606,6 +623,9 @@ go_gmt <- go_term_grouped %>%
# Save to file
write.table(go_gmt, file = "Axolotl_GO.gmt", sep = "\t", quote = FALSE, row.names = FALSE, col.names = FALSE)
# check only the first line (the lines can be long re all genes per term!:
cat(go_gmt$gmt_entry[1:1], sep = "\n")
```


Expand Down Expand Up @@ -652,6 +672,9 @@ kegg_gmt <- kegg_term_grouped %>%
# Save to file
write.table(kegg_gmt, file = "Axolotl_KEGG-pathways.gmt", sep = "\t", quote = FALSE, row.names = FALSE, col.names = FALSE)
# check only the first line (the lines can be long re all genes per term!:
cat(kegg_gmt$gmt_entry[1:1], sep = "\n")
```

## 5.2 Create description objects
Expand Down Expand Up @@ -726,6 +749,11 @@ We will use the same approach as with `clusterProfiler` and run ORA with GO and

## 6.1 WebGestaltR ORA of GO terms

Parameters to note for novel species:

- `organism = others`
- `enrichDatabaseFile = "Axolotl_GO.gmt"`
- `enrichDatabaseDescriptionFile = "Axolotl_GO.des"`

```{r WGR ORA GO }
Expand All @@ -745,7 +773,6 @@ WebGestaltR(
fdrThr = 0.05, # FDR significance threshold
minNum = 10, # Minimum number of genes per category
maxNum = 500, # Maximum number of genes per category
boxplot = TRUE,
outputDirectory = outputDirectory,
projectName = project
)
Expand All @@ -757,7 +784,9 @@ Open the HTML report `WebGestaltR_results/Project_Axolotl_ORA_GO/Report_Axolotl_

Note some term similarity to what we have seen with the past 2 analyses (that's reassuring!)

Change the 'Enrichment Results' view from table to 'Bar chart', then try the 'Affinity propagation' and 'Weighted set cover' term clustering algorithms. Both methods help focus the results.
We no longer have GO Slim, as this needs to call the actual GO database, which we haven't used.

Change the 'Enrichment Results' view from table to 'Bar chart', then try the 'Affinity propagation' and 'Weighted set cover' term clustering algorithms. 'All' has more terms with higher specificty, and the term redundancy has performed clustering to give fewwer terms and provide a more concise overview. It's up to you as the researcher to decided which approach is best suited to your dataset!

Confirm that our GMT file correctly included the term link by selecting a term and clicking the hyperlink at `Analyte set`. Pretty neat huh :-)

Expand All @@ -771,6 +800,7 @@ There is no `seed` parameter for `WebGestaltR` GSEA as there is for `clusterProf
set.seed(123)
```

Again we are specifying `organism = "others"` and providing our GMT and description file:

```{r WGR GSEA KEGG }
Expand Down

0 comments on commit 49bd63b

Please sign in to comment.