diff --git a/_quarto.yml b/_quarto.yml index 27d074d..96c63d8 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -3,6 +3,7 @@ project: render: - src/docs.qmd - src/docs/*.md + - src/docs/*.qmd - src/notebooks_list.qmd - src/notebooks/Python Examples/*.ipynb - src/notebooks/R Examples/*.ipynb @@ -36,9 +37,28 @@ website: search: true contents: - - section: "Documentation" - href: src/docs.qmd - contents: src/docs/* + - href: src/docs/about.md + - section: "Data flow" + contents: + - src/docs/dataflow.md + - src/docs/multiomics_submission.md + - section: "Analyses" + contents: + - src/docs/analysis.md + - src/docs/additional-analyses.md + - section: "Website & API" + contents: + - src/docs/portal.md + - src/docs/api.md + - href: src/docs/mgnify-genomes.md + - section: "MGnify Proteins" + href: src/docs/mgnify-proteins.md + contents: + - src/docs/mgnify-proteins-web.md + - src/docs/mgnify-proteins-sequence-search.md + - src/docs/mgnify-proteins-big-query.qmd + - href: src/docs/faqs.md + - href: src/docs/glossary.md - href: src/notebooks_list.qmd tools: diff --git a/src/docs/dataflow.md b/src/docs/dataflow.md index 7e48da9..1958f34 100644 --- a/src/docs/dataflow.md +++ b/src/docs/dataflow.md @@ -1,5 +1,5 @@ --- -title: Dataflow from submission to results +title: From submission to results author: - name: MGnify Team url: https://www.ebi.ac.uk/metagenomics diff --git a/src/docs/images/proteins/mgnify-proteins-detail-assemblies.png b/src/docs/images/proteins/mgnify-proteins-detail-assemblies.png new file mode 100644 index 0000000..21627fd Binary files /dev/null and b/src/docs/images/proteins/mgnify-proteins-detail-assemblies.png differ diff --git a/src/docs/images/proteins/mgnify-proteins-detail-header.png b/src/docs/images/proteins/mgnify-proteins-detail-header.png new file mode 100644 index 0000000..880ac6d Binary files /dev/null and b/src/docs/images/proteins/mgnify-proteins-detail-header.png differ diff --git a/src/docs/images/proteins/mgnify-proteins-detail-sequence.png b/src/docs/images/proteins/mgnify-proteins-detail-sequence.png new file mode 100644 index 0000000..3252980 Binary files /dev/null and b/src/docs/images/proteins/mgnify-proteins-detail-sequence.png differ diff --git a/src/docs/images/proteins/mgnify-proteins-detail-structure.png b/src/docs/images/proteins/mgnify-proteins-detail-structure.png new file mode 100644 index 0000000..27049bc Binary files /dev/null and b/src/docs/images/proteins/mgnify-proteins-detail-structure.png differ diff --git a/src/docs/images/proteins/mgnify-proteins-home-page.png b/src/docs/images/proteins/mgnify-proteins-home-page.png new file mode 100644 index 0000000..5d164f2 Binary files /dev/null and b/src/docs/images/proteins/mgnify-proteins-home-page.png differ diff --git a/src/docs/images/proteins/mgnify-proteins-schematic.png b/src/docs/images/proteins/mgnify-proteins-schematic.png new file mode 100644 index 0000000..d12dff4 Binary files /dev/null and b/src/docs/images/proteins/mgnify-proteins-schematic.png differ diff --git a/src/docs/genome-viewer.md b/src/docs/mgnify-genomes.md similarity index 99% rename from src/docs/genome-viewer.md rename to src/docs/mgnify-genomes.md index 8810e65..576d0da 100644 --- a/src/docs/genome-viewer.md +++ b/src/docs/mgnify-genomes.md @@ -1,5 +1,5 @@ --- -title: MGnify genomes +title: MGnify Genomes author: - name: MGnify url: https://www.ebi.ac.uk/metagenomics diff --git a/src/docs/mgnify-proteins-big-query.qmd b/src/docs/mgnify-proteins-big-query.qmd new file mode 100644 index 0000000..7757a6b --- /dev/null +++ b/src/docs/mgnify-proteins-big-query.qmd @@ -0,0 +1,242 @@ +--- +title: Big Query public dataset +author: + - name: MGnify + url: https://www.ebi.ac.uk/metagenomics + affiliation: EMBL-EBI + affiliation-url: https://www.ebi.ac.uk +date: last-modified +citation: true +description: Description of the MGnify Proteins BigQuery public dataset +--- + +# MGnify Proteins Big Query public dataset + +The MGnify Protein Database release 2024_04 is hosted on +[Google Cloud Public Datasets](https://console.cloud.google.com/marketplace/product/bigquery-public-data/XXXXX), +and is available to download at no cost under a +[CC0 1.0 Universal Licence](https://creativecommons.org/publicdomain/zero/1.0/legalcode). + +A Google Cloud account is required to use the dataset, but the data can be freely +used under the terms of the [CC0 1.0 Universal Licence](https://creativecommons.org/publicdomain/zero/1.0/legalcode). + +BigQuery provides a serverless and highly scalable analytics tool enabling SQL +queries over large datasets. + +## Creating a Google Cloud Account + +Downloading from the Google Cloud Public Datasets requires a Google Cloud account. See the +[Google Cloud get started](https://cloud.google.com/docs/get-started) page, and +explore the [free tier account usage limits](https://cloud.google.com/free). + + +::: {.callout-warning} +### Pricing information +After the trial period has finished (90 days), to continue access, +you are required to upgrade to a billing account. While your free tier access +(including access to the Public Datasets storage bucket) continues, usage beyond +the free tier will incur costs – please familiarise yourself with the pricing +for the services that you use to avoid any surprises. + +The [free tier](https://cloud.google.com/bigquery/pricing#free-tier) of Google Cloud +comes with [BigQuery Sandbox](https://cloud.google.com/bigquery/docs/sandbox) +with 1 TB of free processed query data each month. +This should be sufficient for running several queries on the MGnify Protein Database, +though the usage depends on the queries. +Please look at the +[BigQuery pricing page](https://cloud.google.com/bigquery/pricing) for more +information. +**Repeated queries within a +month could exceed this limit and if you have +[upgraded to a paid Cloud Billing account](https://cloud.google.com/free/docs/gcp-free-tier#how-to-upgrade) +you may be charged.** + +This is the user's responsibility so please ensure you keep track of your +billing settings and resource usage in the console. +::: + +1. Go to + [https://cloud.google.com/datasets](https://cloud.google.com/datasets). +2. Create an account: + 1. Click "get started for free" in the top right corner. + 2. Read and agree to the terms of service. + 3. Follow the setup instructions. Note that a payment method is required, + but this will not be used unless you enable billing. + 4. Access to the Google Cloud Public Datasets storage bucket is always at + no cost and you will have access to the + [free tier.](https://cloud.google.com/free/docs/gcp-free-tier#free-tier-usage-limits) +3. Set up a project: + 1. In the top left corner, click the navigation menu (three horizontal bars + icon). + 2. Select: "Cloud overview" -> "Dashboard". + 3. In the top left corner there is a project menu bar (likely says "My + First Project"). Select this and a "Select a Project" box will appear. + 4. To keep using this project, click "Cancel" at the bottom of the box. + 5. To create a new project, click "New Project" at the top of the box: + 1. Select a project name. + 2. For location, if your organization has a Cloud account then select + this, otherwise leave as is. + + +#### Setup + +Follow the +[BigQuery Sandbox set up guide](https://cloud.google.com/bigquery/docs/sandbox). + +## Database structure + +The dataset in BigQuery has the following schema: + +```{mermaid} +erDiagram + ARCHITECTURE ||--o{ PROTEIN : has + PROTEIN { + string mgyp PK + string sequence + string sequence_sha256sum +       string cluster_representative + string architecture_hash + json pfam + } + + PROTEIN ||--o{ METADATA : has + CONTIG ||--o{ METADATA : has + ASSEMBLY ||--o{ METADATA : has + GENE_CALLER ||--o{ METADATA : has + METADATA { + string mgyp FK + string mgyc FK + int assembly_id FK + int gene_caller_id FK + int start_position + int end_position + int strand +        bool complete + string truncation + } + + STUDY ||--o{ ASSEMBLY : belongs + BIOME ||--o{ ASSEMBLY : has + ASSEMBLY { + int assembly_id PK + string accession +        int study_id FK + int biome_id FK + string pipeline_version + } + + STUDY { + int study_id PK + string accession + } + + ASSEMBLY ||--|{ CONTIG : belongs + CONTIG { + string mgyc PK + string assembly_id FK + string contig_name + string sequence_hash + int contig_length + float kmer_coverage + } + BIOME { + int biome_id PK + string lineage + } + ARCHITECTURE { + sring architecture_hash PK + string architecture + } + GENE_CALLER { + int gene_caller_id PK + string gene_caller + string version + } +``` + +### Tables + +#### Protein + +| Column Name | Mode | Data type | Description | +|--------------------------|----------|-----------|-----------------------------------------------------------------------| +| `mgyp` | REQUIRED | STRING | The MGnify Protein accession | +| `sequence` | | STRING | The protein amino acid sequence | +| `sequence_sha256sum` | | STRING | SHA-256 checksum of the amino acid sequence | +| `cluster_representative` | | STRING | The accession of the protein cluster representative. For cluster representatives, this value is equal to the MGYP. | +| `pfam` | | JSON | Pfam domains annotations for the protein | +| `architecture_hash` | | STRING | | + +#### Study + +| Column Name | Mode | Data type | Description | +|-------------|----------|-----------|---------------------------------| +| `study_id` | REQUIRED | INTEGER | | +| `accession` | REQUIRED | STRING | The ENA study accession | + +#### Metadata + +| Column Name | Mode | Data type | Description | +|-------------------|----------|-----------|-----------------------------------------------------------------------| +| `mgyp` | REQUIRED | STRING | Protein MGYP accession | +| `mgyc` | REQUIRED | STRING | Contig MGYC accession | +| `assembly_id` | REQUIRED | INTEGER | Assembly ID | +| `gene_caller_id` | REQUIRED | INTEGER | Gene Caller ID | +| `start_position` | | INTEGER | Start position coordinate of the protein in the contig | +| `end_position` | | INTEGER | End position coordinate of the protein in the contig | +| `strand` | | INTEGER | Strand of the protein on the contig: 1 for positive-strand, -1 for negative-strand. | +| `complete` | | BOOLEAN | True if the protein is full-length; false if it is a fragment. | +| `truncation | | STRING | Prodigal truncation notation: 00 full, 01 10 11 fragments. | + +#### Gene caller + +| Column Name | Mode | Data type | Description | +|------------------|----------|-----------|--------------------------------------| +| `gene_caller_id` | REQUIRED | INTEGER | Gene caller ID | +| `gene_caller` | | STRING | The gene caller software name | +| `version` | | STRING | Software version | + +#### Contig + +| Column Name | Mode | Data type | Description | +|-------------------|----------|-----------|-----------------------------------------------------------------------| +| `mgyc` | REQUIRED | STRING | The contig MGYC accession | +| `assembly_id` | REQUIRED | INTEGER | Assembly ID | +| `contig_name` | | STRING | The contig name in the assembly files | +| `sequence_hash` | | STRING | SHA-256 checksum of the nucleotide sequence of the contig | +| `contig_length` | | INTEGER | Length of the contig in base pairs (bp) | +| `kmer_coverage` | | FLOAT | k-mer coverage as reported by the assembler | + +#### Biome + +| Column Name | Mode | Data type | Description | +|-------------------|----------|-----------|-----------------------------------------------------------------------| +| `biome_id` | REQUIRED | INTEGER | Biome ID | +| `lineage` | | STRING | Biome lineage encoded by separating the hierarchy with colons (:). The biomes are based on the GOLD classification | + +#### Assembly + +| Column Name | Mode | Data type | Description | +|-------------------|----------|-----------|-----------------------------------------------------------------------| +| `assembly_id` | | INTEGER | Assembly ID | +| `accession` | REQUIRED | STRING | The ENA assembly accession | +| `study_id` | | INTEGER | Study ID | +| `biome_id` | | INTEGER | Biome ID | +| `pipeline_version`| | STRING | The version of the MGnify pipeline used to call the proteins in this assembly | + +#### Architecture + +| Column Name | Mode | Data type | Description | +|---------------------|----------|-----------|---------------------------------------------| +| `architecture_hash` | REQUIRED | STRING | SHA-256 checksum of the architecture string | +| `architecture` | REQUIRED | STRING | The Pfam architecture string | + + +## Licence + +Data is available for academic and commercial use, under a +[CC0 1.0 Universal Licence](http://creativecommons.org/licenses/by/4.0/legalcode). + +If you make use of the MGnify Protein Database, please cite the following papers: + +* [Richardson, L., Allen, B., Baldi, G., Beracochea, M., Bileschi, M. L., Burdett, T., Burgin, J., Caballero-Pérez, J., Cochrane, G., Colwell, L. J., Curtis, T., Escobar-Zepeda, A., Gurbich, T. A., Kale, V., Korobeynikov, A., Raj, S., Rogers, A. B., Sakharova, E., Sanchez, S., Wilkinson, D. J., Finn, R. D. MGnify: the microbiome sequence data analysis resource in 2023. *Nucleic Acids Research* (2023).](https://doi.org/10.1093/nar/gkac1080) diff --git a/src/docs/sequence-search.md b/src/docs/mgnify-proteins-sequence-search.md similarity index 97% rename from src/docs/sequence-search.md rename to src/docs/mgnify-proteins-sequence-search.md index 395c728..7e53574 100644 --- a/src/docs/sequence-search.md +++ b/src/docs/mgnify-proteins-sequence-search.md @@ -7,13 +7,12 @@ author: affiliation-url: https://www.ebi.ac.uk date: last-modified citation: true -description: Guide to using MGnify's peptide sequence search service -order: 9 +description: Guide to using MGnify's Protein Database sequence search service --- ## Landing page -The sequence search (accessed by following the ‘Sequence search’ link from menu bar) +The sequence search (accessed by following the ‘Sequence search’ link from the MGnify web page menu bar) provides a search against a catalogue of predicted peptides. ![The landing page of the sequence search tool](images/sequence_search/sequence_search_landing-v5.png){#fig-sequence-search-landing} diff --git a/src/docs/mgnify-proteins-web.md b/src/docs/mgnify-proteins-web.md new file mode 100644 index 0000000..447f700 --- /dev/null +++ b/src/docs/mgnify-proteins-web.md @@ -0,0 +1,64 @@ +--- +title: Portal +author: + - name: MGnify + url: https://www.ebi.ac.uk/metagenomics + affiliation: EMBL-EBI + affiliation-url: https://www.ebi.ac.uk +date: last-modified +citation: true +description: Guide to using MGnify's Proteins website +--- + +# MGnify Proteins Portal + +The MGnify Proteins portal provides detailed information for protein cluster representatives derived from metagenomic assemblies. Due to the size of the database, detailed pages are generated exclusively for cluster representatives, rather than for each individual protein sequence. + +![Homepage of MGnify Proteins](images/proteins/mgnify-proteins-home-page.png) + +## Sequence Search + +The MGnify Proteins portal features a search form that allows users to submit queries to the [Sequence Search](mgnify-proteins-sequence-search.md). Search queries are processed by the search service, and the results link back to the protein detail page within the portal. + +## Organization of the Protein Detail Page + +Each cluster representative from the [Protein Database](mgnify-proteins.md) has a dedicated detail page. This page includes metadata about the protein and, when available, its 3D structure as predicted by the [ESM Metagenomics Atlas](https://esmatlas.com/) team. Not all clusters have a predicted structure, as the release cycles of MGnify Proteins and the ESM Atlas are independent. The MGnify team is not involved in the protein structure predictions; instead the MGnify Proteins portal links to the ESM Atlas service, which is generously provided for the community by the ESM Atlas team. + +### Protein Information + +This section at the top of the page provides essential details about the protein: + +- **Protein Accession**: The unique identifier for the protein (e.g., MGYP000261684433). +- **Cluster Size**: Indicates the number of proteins in the cluster (e.g., 1 protein). +- **Full-Length ORF**: Specifies whether the sequence represents a full-length open reading frame (ORF). A checkmark indicates a full-length ORF, while an 'X' denotes a fragment. +- **Biome**: Displays the [biome(s)](glossary.md#biome) from which samples were sequenced that this protein was derived from (e.g., "Marine"). +- **Pfam Annotations**: A table displaying [Pfam](https://www.ebi.ac.uk/interpro/entry/pfam/) domains identified within the protein. + +![Protein Detail Header](images/proteins/mgnify-proteins-detail-header.png) + +## 3D Structure + +This section displays a 3D structure of the protein, predicted by the [ESM Metagenomics Atlas](https://esmatlas.com/) team using [ESMFold](https://www.science.org/doi/10.1126/science.ade2574): + +- **3D Structure Visualization**: A graphic representation of the protein structure generated from the amino acid sequence. +- **More Information**: A link below the visualization directs users to additional details about the [ESM Metagenomics Atlas](https://esmatlas.com/). + +![ESMFold Protein Predicted Structure](images/proteins/mgnify-proteins-detail-structure.png) + +## Assemblies Information + +This table lists the assemblies from which the protein sequence was derived: + +- **Study**: The [Study](glossary.md#study) ID associated with the [assembly](glossary.md#assembly). +- **Assembly ID**: The specific ID for the [assembly](glossary.md#assembly). +- **Contig**: The contig within the assembly where the protein sequence is located. +- **Contig Start/End**: Coordinates indicating the start and end positions of the protein on the contig. +- **Strand**: Indicates whether the protein is on the positive or negative DNA strand. + +![Assemblies where the Protein was Found](images/proteins/mgnify-proteins-detail-assemblies.png) + +## Amino Acid Sequence Viewer + +This section provides a detailed view of the amino acid sequence for the protein. + +![Protein Amino Acid Sequence](images/proteins/mgnify-proteins-detail-sequence.png) diff --git a/src/docs/mgnify-proteins.md b/src/docs/mgnify-proteins.md new file mode 100644 index 0000000..1636104 --- /dev/null +++ b/src/docs/mgnify-proteins.md @@ -0,0 +1,34 @@ +--- +title: MGnify Proteins Resource +author: + - name: MGnify + url: https://www.ebi.ac.uk/metagenomics + affiliation: EMBL-EBI + affiliation-url: https://www.ebi.ac.uk +date: last-modified +citation: true +description: Description of the MGnify Proteins and related services +--- + +# MGnify Proteins Resource + +## Introduction + +The MGnify Protein Database comprises sequences predicted from [assemblies](glossary.md#assembly) generated from publicly available [metagenomic](glossary.md#metagenomic) datasets. Since its initial release in August 2017, which comprised just under 50 million sequences, it has grown to over 2.4 billion sequences. All sequences have stable accessions, prefixed with MGYP, such as [MGYP000261684433](https://www.ebi.ac.uk/metagenomics/proteins/MGYP000261684433/). Due to the dataset's size, sequences are clustered at 90% identity using [MMSeq2/Linclust](https://github.com/soedinglab/MMseqs2). Despite clustering, the sequences still capture the biological complexity inherent in metagenomic data. + +The dataset is accessible via several platforms: + +- **FTP Server**: Available for download from our [FTP server](http://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/). +- **HMMER Sequence Search Webservice**: Accessible through our [Sequence Search service](mgnify-proteins-sequence-search.md). +- **MGnify Proteins Portal**: Explore the data on the [MGnify Proteins web portal](mgnify-proteins-web.md). +- **Google Cloud Public Dataset**: Available as a [Big Query public dataset](mgnify-proteins-big-query.qmd) on [Google Cloud](https://cloud.google.com/). + +![Schematic of MGnify Proteins resource](images/proteins/mgnify-proteins-schematic.png) + +## License + +The data is available for both academic and commercial use under a [CC0 1.0 Universal License](http://creativecommons.org/licenses/by/4.0/legalcode). + +If you make use of the MGnify Protein Database, please cite the following paper: + +* Richardson, L., Allen, B., Baldi, G., Beracochea, M., Bileschi, M. L., Burdett, T., Burgin, J., Caballero-Pérez, J., Cochrane, G., Colwell, L. J., Curtis, T., Escobar-Zepeda, A., Gurbich, T. A., Kale, V., Korobeynikov, A., Raj, S., Rogers, A. B., Sakharova, E., Sanchez, S., Wilkinson, D. J., Finn, R. D. MGnify: the microbiome sequence data analysis resource in 2023. *Nucleic Acids Research* (2023). [https://doi.org/10.1093/nar/gkac1080](https://doi.org/10.1093/nar/gkac1080)