Skip to content

Commit

Permalink
MGnify Proteins docs - restructure navigation (#54)
Browse files Browse the repository at this point in the history
* MGnify Proteins docs - restructure navigation

* More careful wording around the ESM Atlas

* minor wording, formatting, and visual improvements to proteins docs

* Fix typos on big query doc

---------

Co-authored-by: Sandy Rogers <sandyr@ebi.ac.uk>
  • Loading branch information
mberacochea and SandyRogers authored Oct 3, 2024
1 parent f7f2946 commit 78fd332
Show file tree
Hide file tree
Showing 13 changed files with 367 additions and 8 deletions.
26 changes: 23 additions & 3 deletions _quarto.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ project:
render:
- src/docs.qmd
- src/docs/*.md
- src/docs/*.qmd
- src/notebooks_list.qmd
- src/notebooks/Python Examples/*.ipynb
- src/notebooks/R Examples/*.ipynb
Expand Down Expand Up @@ -36,9 +37,28 @@ website:
search: true

contents:
- section: "Documentation"
href: src/docs.qmd
contents: src/docs/*
- href: src/docs/about.md
- section: "Data flow"
contents:
- src/docs/dataflow.md
- src/docs/multiomics_submission.md
- section: "Analyses"
contents:
- src/docs/analysis.md
- src/docs/additional-analyses.md
- section: "Website & API"
contents:
- src/docs/portal.md
- src/docs/api.md
- href: src/docs/mgnify-genomes.md
- section: "MGnify Proteins"
href: src/docs/mgnify-proteins.md
contents:
- src/docs/mgnify-proteins-web.md
- src/docs/mgnify-proteins-sequence-search.md
- src/docs/mgnify-proteins-big-query.qmd
- href: src/docs/faqs.md
- href: src/docs/glossary.md
- href: src/notebooks_list.qmd

tools:
Expand Down
2 changes: 1 addition & 1 deletion src/docs/dataflow.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: Dataflow from submission to results
title: From submission to results
author:
- name: MGnify Team
url: https://www.ebi.ac.uk/metagenomics
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion src/docs/genome-viewer.md → src/docs/mgnify-genomes.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: MGnify genomes
title: MGnify Genomes
author:
- name: MGnify
url: https://www.ebi.ac.uk/metagenomics
Expand Down
242 changes: 242 additions & 0 deletions src/docs/mgnify-proteins-big-query.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,242 @@
---
title: Big Query public dataset
author:
- name: MGnify
url: https://www.ebi.ac.uk/metagenomics
affiliation: EMBL-EBI
affiliation-url: https://www.ebi.ac.uk
date: last-modified
citation: true
description: Description of the MGnify Proteins BigQuery public dataset
---

# MGnify Proteins Big Query public dataset

The MGnify Protein Database release 2024_04 is hosted on
[Google Cloud Public Datasets](https://console.cloud.google.com/marketplace/product/bigquery-public-data/XXXXX),
and is available to download at no cost under a
[CC0 1.0 Universal Licence](https://creativecommons.org/publicdomain/zero/1.0/legalcode).

A Google Cloud account is required to use the dataset, but the data can be freely
used under the terms of the [CC0 1.0 Universal Licence](https://creativecommons.org/publicdomain/zero/1.0/legalcode).

BigQuery provides a serverless and highly scalable analytics tool enabling SQL
queries over large datasets.

## Creating a Google Cloud Account

Downloading from the Google Cloud Public Datasets requires a Google Cloud account. See the
[Google Cloud get started](https://cloud.google.com/docs/get-started) page, and
explore the [free tier account usage limits](https://cloud.google.com/free).


::: {.callout-warning}
### Pricing information
After the trial period has finished (90 days), to continue access,
you are required to upgrade to a billing account. While your free tier access
(including access to the Public Datasets storage bucket) continues, usage beyond
the free tier will incur costs – please familiarise yourself with the pricing
for the services that you use to avoid any surprises.

The [free tier](https://cloud.google.com/bigquery/pricing#free-tier) of Google Cloud
comes with [BigQuery Sandbox](https://cloud.google.com/bigquery/docs/sandbox)
with 1 TB of free processed query data each month.
This should be sufficient for running several queries on the MGnify Protein Database,
though the usage depends on the queries.
Please look at the
[BigQuery pricing page](https://cloud.google.com/bigquery/pricing) for more
information.
**Repeated queries within a
month could exceed this limit and if you have
[upgraded to a paid Cloud Billing account](https://cloud.google.com/free/docs/gcp-free-tier#how-to-upgrade)
you may be charged.**

This is the user's responsibility so please ensure you keep track of your
billing settings and resource usage in the console.
:::

1. Go to
[https://cloud.google.com/datasets](https://cloud.google.com/datasets).
2. Create an account:
1. Click "get started for free" in the top right corner.
2. Read and agree to the terms of service.
3. Follow the setup instructions. Note that a payment method is required,
but this will not be used unless you enable billing.
4. Access to the Google Cloud Public Datasets storage bucket is always at
no cost and you will have access to the
[free tier.](https://cloud.google.com/free/docs/gcp-free-tier#free-tier-usage-limits)
3. Set up a project:
1. In the top left corner, click the navigation menu (three horizontal bars
icon).
2. Select: "Cloud overview" -> "Dashboard".
3. In the top left corner there is a project menu bar (likely says "My
First Project"). Select this and a "Select a Project" box will appear.
4. To keep using this project, click "Cancel" at the bottom of the box.
5. To create a new project, click "New Project" at the top of the box:
1. Select a project name.
2. For location, if your organization has a Cloud account then select
this, otherwise leave as is.


#### Setup

Follow the
[BigQuery Sandbox set up guide](https://cloud.google.com/bigquery/docs/sandbox).

## Database structure

The dataset in BigQuery has the following schema:

```{mermaid}
erDiagram
ARCHITECTURE ||--o{ PROTEIN : has
PROTEIN {
string mgyp PK
string sequence
string sequence_sha256sum
      string cluster_representative
string architecture_hash
json pfam
}
PROTEIN ||--o{ METADATA : has
CONTIG ||--o{ METADATA : has
ASSEMBLY ||--o{ METADATA : has
GENE_CALLER ||--o{ METADATA : has
METADATA {
string mgyp FK
string mgyc FK
int assembly_id FK
int gene_caller_id FK
int start_position
int end_position
int strand
       bool complete
string truncation
}
STUDY ||--o{ ASSEMBLY : belongs
BIOME ||--o{ ASSEMBLY : has
ASSEMBLY {
int assembly_id PK
string accession
       int study_id FK
int biome_id FK
string pipeline_version
}
STUDY {
int study_id PK
string accession
}
ASSEMBLY ||--|{ CONTIG : belongs
CONTIG {
string mgyc PK
string assembly_id FK
string contig_name
string sequence_hash
int contig_length
float kmer_coverage
}
BIOME {
int biome_id PK
string lineage
}
ARCHITECTURE {
sring architecture_hash PK
string architecture
}
GENE_CALLER {
int gene_caller_id PK
string gene_caller
string version
}
```

### Tables

#### Protein

| Column Name | Mode | Data type | Description |
|--------------------------|----------|-----------|-----------------------------------------------------------------------|
| `mgyp` | REQUIRED | STRING | The MGnify Protein accession |
| `sequence` | | STRING | The protein amino acid sequence |
| `sequence_sha256sum` | | STRING | SHA-256 checksum of the amino acid sequence |
| `cluster_representative` | | STRING | The accession of the protein cluster representative. For cluster representatives, this value is equal to the MGYP. |
| `pfam` | | JSON | Pfam domains annotations for the protein |
| `architecture_hash` | | STRING | |

#### Study

| Column Name | Mode | Data type | Description |
|-------------|----------|-----------|---------------------------------|
| `study_id` | REQUIRED | INTEGER | |
| `accession` | REQUIRED | STRING | The ENA study accession |

#### Metadata

| Column Name | Mode | Data type | Description |
|-------------------|----------|-----------|-----------------------------------------------------------------------|
| `mgyp` | REQUIRED | STRING | Protein MGYP accession |
| `mgyc` | REQUIRED | STRING | Contig MGYC accession |
| `assembly_id` | REQUIRED | INTEGER | Assembly ID |
| `gene_caller_id` | REQUIRED | INTEGER | Gene Caller ID |
| `start_position` | | INTEGER | Start position coordinate of the protein in the contig |
| `end_position` | | INTEGER | End position coordinate of the protein in the contig |
| `strand` | | INTEGER | Strand of the protein on the contig: 1 for positive-strand, -1 for negative-strand. |
| `complete` | | BOOLEAN | True if the protein is full-length; false if it is a fragment. |
| `truncation | | STRING | Prodigal truncation notation: 00 full, 01 10 11 fragments. |

#### Gene caller

| Column Name | Mode | Data type | Description |
|------------------|----------|-----------|--------------------------------------|
| `gene_caller_id` | REQUIRED | INTEGER | Gene caller ID |
| `gene_caller` | | STRING | The gene caller software name |
| `version` | | STRING | Software version |

#### Contig

| Column Name | Mode | Data type | Description |
|-------------------|----------|-----------|-----------------------------------------------------------------------|
| `mgyc` | REQUIRED | STRING | The contig MGYC accession |
| `assembly_id` | REQUIRED | INTEGER | Assembly ID |
| `contig_name` | | STRING | The contig name in the assembly files |
| `sequence_hash` | | STRING | SHA-256 checksum of the nucleotide sequence of the contig |
| `contig_length` | | INTEGER | Length of the contig in base pairs (bp) |
| `kmer_coverage` | | FLOAT | k-mer coverage as reported by the assembler |

#### Biome

| Column Name | Mode | Data type | Description |
|-------------------|----------|-----------|-----------------------------------------------------------------------|
| `biome_id` | REQUIRED | INTEGER | Biome ID |
| `lineage` | | STRING | Biome lineage encoded by separating the hierarchy with colons (:). The biomes are based on the GOLD classification |

#### Assembly

| Column Name | Mode | Data type | Description |
|-------------------|----------|-----------|-----------------------------------------------------------------------|
| `assembly_id` | | INTEGER | Assembly ID |
| `accession` | REQUIRED | STRING | The ENA assembly accession |
| `study_id` | | INTEGER | Study ID |
| `biome_id` | | INTEGER | Biome ID |
| `pipeline_version`| | STRING | The version of the MGnify pipeline used to call the proteins in this assembly |

#### Architecture

| Column Name | Mode | Data type | Description |
|---------------------|----------|-----------|---------------------------------------------|
| `architecture_hash` | REQUIRED | STRING | SHA-256 checksum of the architecture string |
| `architecture` | REQUIRED | STRING | The Pfam architecture string |


## Licence

Data is available for academic and commercial use, under a
[CC0 1.0 Universal Licence](http://creativecommons.org/licenses/by/4.0/legalcode).

If you make use of the MGnify Protein Database, please cite the following papers:

* [Richardson, L., Allen, B., Baldi, G., Beracochea, M., Bileschi, M. L., Burdett, T., Burgin, J., Caballero-Pérez, J., Cochrane, G., Colwell, L. J., Curtis, T., Escobar-Zepeda, A., Gurbich, T. A., Kale, V., Korobeynikov, A., Raj, S., Rogers, A. B., Sakharova, E., Sanchez, S., Wilkinson, D. J., Finn, R. D. MGnify: the microbiome sequence data analysis resource in 2023. *Nucleic Acids Research* (2023).](https://doi.org/10.1093/nar/gkac1080)
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,12 @@ author:
affiliation-url: https://www.ebi.ac.uk
date: last-modified
citation: true
description: Guide to using MGnify's peptide sequence search service
order: 9
description: Guide to using MGnify's Protein Database sequence search service
---

## Landing page

The sequence search (accessed by following the ‘Sequence search’ link from menu bar)
The sequence search (accessed by following the ‘Sequence search’ link from the MGnify web page menu bar)
provides a search against a catalogue of predicted peptides.

![The landing page of the sequence search tool](images/sequence_search/sequence_search_landing-v5.png){#fig-sequence-search-landing}
Expand Down
64 changes: 64 additions & 0 deletions src/docs/mgnify-proteins-web.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
---
title: Portal
author:
- name: MGnify
url: https://www.ebi.ac.uk/metagenomics
affiliation: EMBL-EBI
affiliation-url: https://www.ebi.ac.uk
date: last-modified
citation: true
description: Guide to using MGnify's Proteins website
---

# MGnify Proteins Portal

The MGnify Proteins portal provides detailed information for protein cluster representatives derived from metagenomic assemblies. Due to the size of the database, detailed pages are generated exclusively for cluster representatives, rather than for each individual protein sequence.

![Homepage of MGnify Proteins](images/proteins/mgnify-proteins-home-page.png)

## Sequence Search

The MGnify Proteins portal features a search form that allows users to submit queries to the [Sequence Search](mgnify-proteins-sequence-search.md). Search queries are processed by the search service, and the results link back to the protein detail page within the portal.

## Organization of the Protein Detail Page

Each cluster representative from the [Protein Database](mgnify-proteins.md) has a dedicated detail page. This page includes metadata about the protein and, when available, its 3D structure as predicted by the [ESM Metagenomics Atlas](https://esmatlas.com/) team. Not all clusters have a predicted structure, as the release cycles of MGnify Proteins and the ESM Atlas are independent. The MGnify team is not involved in the protein structure predictions; instead the MGnify Proteins portal links to the ESM Atlas service, which is generously provided for the community by the ESM Atlas team.

### Protein Information

This section at the top of the page provides essential details about the protein:

- **Protein Accession**: The unique identifier for the protein (e.g., MGYP000261684433).
- **Cluster Size**: Indicates the number of proteins in the cluster (e.g., 1 protein).
- **Full-Length ORF**: Specifies whether the sequence represents a full-length open reading frame (ORF). A checkmark indicates a full-length ORF, while an 'X' denotes a fragment.
- **Biome**: Displays the [biome(s)](glossary.md#biome) from which samples were sequenced that this protein was derived from (e.g., "Marine").
- **Pfam Annotations**: A table displaying [Pfam](https://www.ebi.ac.uk/interpro/entry/pfam/) domains identified within the protein.

![Protein Detail Header](images/proteins/mgnify-proteins-detail-header.png)

## 3D Structure

This section displays a 3D structure of the protein, predicted by the [ESM Metagenomics Atlas](https://esmatlas.com/) team using [ESMFold](https://www.science.org/doi/10.1126/science.ade2574):

- **3D Structure Visualization**: A graphic representation of the protein structure generated from the amino acid sequence.
- **More Information**: A link below the visualization directs users to additional details about the [ESM Metagenomics Atlas](https://esmatlas.com/).

![ESMFold Protein Predicted Structure](images/proteins/mgnify-proteins-detail-structure.png)

## Assemblies Information

This table lists the assemblies from which the protein sequence was derived:

- **Study**: The [Study](glossary.md#study) ID associated with the [assembly](glossary.md#assembly).
- **Assembly ID**: The specific ID for the [assembly](glossary.md#assembly).
- **Contig**: The contig within the assembly where the protein sequence is located.
- **Contig Start/End**: Coordinates indicating the start and end positions of the protein on the contig.
- **Strand**: Indicates whether the protein is on the positive or negative DNA strand.

![Assemblies where the Protein was Found](images/proteins/mgnify-proteins-detail-assemblies.png)

## Amino Acid Sequence Viewer

This section provides a detailed view of the amino acid sequence for the protein.

![Protein Amino Acid Sequence](images/proteins/mgnify-proteins-detail-sequence.png)
Loading

0 comments on commit 78fd332

Please sign in to comment.