GRUPS-rs : A high-performance rust port and update of grups.
GRUPS-rs (Get Relatedness Using Pedigree Simulations) is an ancient-DNA genetic relatedness estimation software relying on pedigree simulations using existing genotype callsets - typically, the 1000G-phase3 dataset. This software is a pure rust implementation and update of the "GRUPS" method, initially developed in Martin, et al. (2017).
GRUPS-rs currently supports the simulation of user-defined pedigree in thousand of replicates, to compute the expected genetic distances of two individuals under various relatedness scenarios (See section: Defining custom pedigrees), including complex inbreeding scenarios such as double first-cousins, three-quarters siblings, sesqui first-cousins, etc.
Modern human contamination, sequencing errors and allele-fixation rate parameters can also be introduced during simulations to account for the biases they may introduce in real-life datasets.
If you plan to install from source, you'll need:
If you plan to use the grups.plots
companion shiny dashboard, you'll need to have R, along with the devtools
library. A version of R
that is >=4.1.2
is recommended.
This project is written in Rust, and thus requires cargo for source compilation. The current minimum supported rust version is 1.66
.
To install the latest version of cargo:
curl --proto '=https' --tlsv1.2 https://sh.rustup.rs -sSf | sh
See Cargo.toml
for a complete list of crate dependencies (these are automatically downloaded by cargo when setting up compilation)
Rust uses ordinally partitionned Sequence Vector Machines (SVMOPs) to find the best separation between expected distributions of relatedness and thus requires the libsvm library to be installed on your workspace. (a version >=2.34
is recommended.)
- On Ubuntu:
sudo apt-get install libsvm-dev
- On Mac OS:
brew install libsvm
The original publication of libsvm can be found here. More specifically, see this paper for a more detailled explanation regarding SVMOPs.
-
Clone this repository (Note the
--recursive
flag is required if you wish to download the appropriate version ofgrups.plots
)git clone --recursive https://github.com/MaelLefeuvre/grups-rs.git cd grups-rs
-
(MacOS only) Manually target the
include
andlib
directory of libsvm, by exporting theLIBSVM_INCLUDE
andLIBSVM_LIBRARY
environment variablesexport LIBSVM_INCLUDE="$(brew --prefix libsvm)/include" export LIBSVM_LIBRARY="$(brew --prefix libsvm)/lib"
-
(Optional) Run the test-suite from the project's root
cargo test --workspace
-
Compile and install
RUSTFLAGS="-Ctarget-cpu=native" cargo install --path .
-
grups-rs
should be located within~/.cargo/bin/
and included in your PATHgrups-rs
grups-rs 0.2.0 Maël Lefeuvre <mael.lefeuvre@mnhn.fr> GRUPS-rs: Get Relatedness Using Pedigree Simulations USAGE: grups-rs [OPTIONS] <SUBCOMMAND> OPTIONS: -h, --help Print help information -q, --quiet Disable warnings -v, --verbose Set the verbosity level (-v -vv -vvv -vvvv) -V, --version Print version information SUBCOMMANDS: cite from-yaml Run GRUPS using a previously generated .yaml config file fst Convert VCF files into FST-indexes (Finite State Transcucer) help Print this message or the help of the given subcommand(s) pedigree-sims Run pwd-from-stdin and perform pedigree simulations in one go pwd-from-stdin Compute the average `PairWise Difference` (PWD) within a pileup file
Installing grups.plots
can be installed using a single command, provided you already have R
and the devtools
library preinstalled.
-
(Optional) Install the devtools library, if it is not preinstalled
R --slave -e 'if (!require("devtools")) install.packages("devtools", repos = c(CRAN = "https://cloud.r-project.org"))'
-
Install
grups.plots
R --slave -e 'devtools::install("./grups.plots")'
See the documentation of grups.plot
to learn how to launch this companion package
There are multiple ways to obtain information about the different submodules of grups-rs
- Typing
grups-rs [submodule] -h
will display a short description of every available command line argument for a given command - Typing
grups-rs [submodule] --help
orgrups-rs [submodule] help
will display a verbose description of every available command line argument.
See the section Parameter List, for a detailled description of every module's command line interface.
grups-rs
requires 4 types of additional input files and datasets to function:
- A reference dataset of phased, modern human genotypes, such as the 1000g-phase3 dataset (See subsection SNP callset).
- An input panel definition file, specifying the name, population, and super-population of each sample found within the reference dataset (See subsection Input panel definition file).
- A set of per-chromosome recombination map files, such as the phaseII HapMap dataset (See subsection Recombination Maps).
- A user-defined pedigree definition file. A set of pre-defined files can be found in the
resources/pedigrees
directory of this repository. See section Defining custom pedigrees, for a detailled explanation on how to create custom template pedigrees.
GRUPS-rs requires an SNP-callset in the form of .vcf
or .vcf.gz
files to perform pedigree simulations. For most intents and purposes, the 1000g-phase3 dataset may provide with a good start, but any dataset of input VCF files will work, provided they carry phased diploid genotypes, and contain the appropriate required tags within the INFO
field - see Caveats.
The 1000g-phase3 dataset can be downloaded here.
GRUPS-rs will require an input panel definition file to distinguish (super-)populations and define samples within your SNP Callset. If you plan to use the 1000G-phase3
callset, a predefined panel can be previewed and downloaded here
This file must be unheaded, tab-separated, and should at least contain the following columns: <SAMPLE-ID> <POP-ID> <SUPER-POP-ID>
GRUPS-rs requires a genetic recombination map to simulate meiosis. For main intents and purposes, and when using the 1000g-phase3
callset, we recommend the HapMap-II-b37 map, which can be downloaded here
If you plan to use an alternative SNP Callset, here are a few caveats you should keep in mind when preparing your input:
-
GRUPS-rs will require an input panel definition file to your SNP Callset. See the Input panel definition file section for the appropriate format, and/or check the 1000G-phase3
.panel
file as a template here. -
As of now, GRUPS-rs does not calculate population allele frequencies by default, and will rather look through the
INFO
field (column 7) of your callset for the appropriate tag. Thus, if you plan to simulate genomes using theEUR
population, each entry within your VCF files should carry aEUR_AF
info tag, specifying the alternative allele frequency for that population. The bcftools +fill-tags plugin documentation to may be of help, should you wish annotate population-specific allele frequencies on your dataset. -
As of now, GRUPS-rs distinguishes relevant SNP coordinates within the callset using various
INFO
field annotations. If you plan to use a different SNP-callset, either ensure your.vcf
files are correctly annotated with the following tags, or make sure to filter all position that are not bi-allelic SNPs have been thoroughly filtered-out from your dataset before using GRUPS-rs:- Poly-allelic sequence variations are distinguished (and ignored) by searching for the
MULTI_ALLELIC
tag. - SNPs are distinguished from other types of mutation by searching for the
VT=SNP
tag.
- Poly-allelic sequence variations are distinguished (and ignored) by searching for the
-
By default, GRUPS-rs will consider the provided SNP callset as being called on the
GRCh37
reference genome. If your callset has been generated using another reference genome, we recommend to provide the software with a fasta index file (.fa.fai
) of your reference, using the--genome
argument (See the pwd-from-stdin parameter list section for more information).
The following section provides an quick explanation of each module of grups-rs
, skip to section Quick Start if you wish to jump in straight ahead with a more pragmatic example.
pedigree-sims
is the main module of grups-rs
, and will both compute the observed pairwise mismatch rates for your samples, and perform pedigree simulations to estimate the most likely relationship between each pair.
A basic example, using provided dummy test files:
grups-rs pedigree-sims --pileup ./tests/test-data/pileup/parents-offspring.pileup \
--data-dir ./tests/test-data/vcf/binary-2FIN-1ACB-virtual/ \
--recomb-dir ./tests/test-data/recombination-map/ \
--pedigree ./tests/test-data/pedigree/tiny_pedigree.txt \
--samples 0-2 \
--sample-names MDH1 MDH2 MDH3 \
--output-dir pedigree-sims-output \
--reps 1000 \
--quiet
This should create a single output directory called grups-output
. More specifically, the pedigree-sims
module will generate all of the output files generated by pwd-from-stdin
, as well as:
- an additional
.result
file, containing summary statistics and results for all pedigree simulations. - a set of
.sims
files, one for each pairwise comparison. These file contain raw simulation results for each pairwise comparison, and are located in thesimulations
subdirectory.
See the section Output Files for a detailled explanation of each output files.
This module is available if you simply wish to quickly examine the pairwise mismatch rate between samples without performing any pedigree simulations.
A basic example, using provided dummy test files:
grups-rs pwd-from-stdin --pileup ./tests/test-data/pileup/parents-offspring.pileup \
--samples 0 2 \
--sample-names MDH1 MDH3 \
--min-depth 2 2 \
--self-comparison
This should create a single output directory called grups-output
. More specifically, pwd-from-stin
is expected to output:
- A
.pwd
file, containing summary statistics and the raw observed pairwise mismatch rate for all pairwise comparisons. - A set of
.blk
files, one for each pairwise comparison. These file contain raw observed pairwise mismatch rates in non-overlapping blocks (--blocksize
= 1Mb, by default).
See the section Output Files for a detailled explanation of each input files.
As GRUPS-rs must repeatly fetch genotypes from randomly sampled individuals during pedigree simsulations, I/O operations can become a strong performance bottleneck when running this software. Working with heavy .vcf.gz
files such as the 1000g project can thus have its toll on performance. For extended uses of grups-rs
on a predefined dataset, indexing your .vcf
files as a set of FSA can in many cases greatly increase performances.
This operation can be done with the fst
module of grups-rs
.
On top of this, FST-indexation has the added benefit of performing prefiltration of unwanted entries within the input `.vcf`` file. Most notably:
- Duplicate coordinate entries.
- Coordinates containing a
MULTI_ALLELIC
tag within theINFO
field. - Entries which are not of type
VT=SNP
.
Furthermore, the fst
module can also be useful to filter out individuals from unwanted population entries, as well as (re-)computing population allele frequencies (see section Performing population subsets with the fst
module).
grups-rs fst --vcf-dir ./tests/test-data/vcf/binary-2FIN-1ACB-virtual/ --output-dir ./test-fst-index
In this example, grups-rs
index any .vcf[.gz]
file found within the provided binary-2FIN-1ACB-virtual
input directory, and output its contents within the test-fst-index
. The expected output is a set of two finite state automaton (.fst
and .fst.frq
), one for each discovered input .vcf[.gz]
file:
.fst
files indexes the genotype information of all retained samples, at each valid genotype coordinate..fst.frq
files indexes population allele frequencies for each valid genotype coordinate.
Altough it remains a one-time operation, FSA-indexation can be quite long and resource intensive (e.g.: around 40 minutes is required to encode the ALL.chr1.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz
file of the 1000g-phase database). Thus, the use of multithreading across .vcf.gz
files is highly recommended, provided your computer is equipped with multiple cores.
grups-rs fst --threads 22 --vcf-dir ./tests/test-data/vcf/binary-2FIN-1ACB-virtual/ --output-dir ./test-fst-index
Note that multithreading is performed across the number of discovered input .vcf[.gz]
files. Thus, if the directory specified by --vcf-dir
contains 22 files, there is no point in recruiting more than 22 threads.
If the user expects to use only a single pedigree and contaminating population, FSA indexation can be used to filter-out unused samples from the original VCF file. Furthermore, the use of the optional --compute-pop-afs
flag can be useful to (re-)compute population allele frequencies.
grups-rs fst \
--vcf-dir ./tests/test-data/vcf/binary-2FIN-1ACB-virtual/ \
--output-dir ./test-fst-index \
--pop-subset FIN AFR \
--compute-pop-afs
Once the indexation is completed, .fst
and .fst.frq
files can be used seamlessly when performing pedigree simulations. The user merely has to specify the input type using the --mode
argument. Specifying a target directory is performed in the same way, using --data-dir
.
grups-rs pedigree-sims --pileup ./tests/test-data/pileup/parents-offspring.pileup \
--data-dir ./test-fst-index \
--recomb-dir ./tests/test-data/recombination-map/ \
--pedigree ./tests/test-data/pedigree/tiny_pedigree.txt \
--samples 0-2 \
--sample-names MDH1 MDH2 MDH3 \
--reps 1000 \
--quiet \
--mode fst-mmap
FSA-encoded files can be used in one of two ways:
--mode fst-mmap
: This mode is highly recommended if your files are located within an SSD drive, as it allows grups-rs
to efficiently make use of memory-mapped files, to reduce its memory consumption.
--mode fst
: This mode will load each .fst
and .fst.frq
file within grups-rs
heap memory. This may make grups-rs
faster, at the cost of a higher memory footprint. In general, this mode is only recommended if you are sure your computer has enough RAM, and your files are located within an HDD drive.
When executing the pedigree-sims
or pwd-from-stdin
modules, GRUPS-rs will automatically serialize your command line arguments and generate a timestamped .yaml
configuration file containing every provided argument for the given run.
This file will be located at the root of your output directory (which can be specified using --output-dir
).
To relaunch grups-rs
using the exact same configuration, simply run grups-rs
using the from-yaml
module, and provide the path to the desired .yaml
file
grups-rs from-yaml ./grups-output/2022-06-13T162822-pedigree-sims.yaml
Note that the --quiet
and --verbose
arguments found within the .yaml
file are ignored by the from-yaml
module, and must be-respecified when running that command :
grups-rs from-yaml ./grups-output/2022-06-13T162822-pedigree-sims.yaml --verbose
In this example, our objective is to apply grups-rs
on a set of three individuals found buried within a collective grave from Mentesh-Tepe (MT23
MT26
and MT7
). Two of these samples - MT23
and MT26
- were previously determined to be siblings (Garino-Vignon et al. 2023).
In this example, we'll be starting from scratch, and assume that you don't have any of the data required to begin this analysis on your computer. Thus, our workflow can be subdivided in three steps:
- Download and process all of the datasets and files required for this analysis (i.e: (i) 1000g-phase3 dataset, (ii) HapMap-phase2 recombination map, (iii) GRCh37 reference genome, and (iv) Allen Ancient DNA Resource (AADR) "1240k" SNP variant callset).
- Download and pileup our three input individuals
MT23
MT26
andMT7
usingsamtools mpileup
. Preprocessed alignment files for these samples are available on the European Nucleotide Archive, with accession number PRJEB54894. - Index the 1000g-phase3 dataset into a set of FSA-encoded files, using the
fst
module ofgrups-rs
. - Perform kinship analysis, by applying the
pedigree-sims
module on the input pileup file, and visualize results usinggrups.plots
-
Download the 1000g-phase3 dataset (chromosomes 1-22), along with its input panel definition file:
mkdir -p data/1000g-phase3/ URL="ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502" wget -nd -r -P data/1000g-phase3/ ${URL}/ALL.chr*.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz* wget -nd -P data/1000g-phase3/ ${URL}/integrated_call_samples_v3.20130502.ALL.panel
-
Download the HapMapII recombination map (chromosomes 1-22).
mkdir -p data/hapmap-phase2-b37/ TARBALL="http://ftp.ncbi.nlm.nih.gov/hapmap/recombination/2011-01_phaseII_B37/genetic_map_HapMapII_GRCh37.tar.gz" wget -qO- ${TARBALL} | tar -xvzf- -C data/hapmap-phase2-b37/ --wildcards 'genetic_map_GRCh37_chr[0-9]*.txt'
-
Download a predefined pedigree definition file.
mkdir -p data/pedigrees PED_URL="https://gist.githubusercontent.com/MaelLefeuvre/b6f3baff1964defea432ebd2c6dca6dc/raw/465ab9478724dba1575d20e45fb88f06055ac55f/simple-pedigree.txt" wget -P data/pedigrees ${PED_URL}
-
Download and decompress the
GRCh37
reference genomeREFGEN_URL="http://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz" mkdir -p data/reference/ wget -P data/reference/ ${REFGEN_URL} bgzip -d data/reference/$(basename ${REFGEN_URL})
-
Download and format the AADR 1240K SNP callset
mkdir -p data/targets AADR_SNP_URL="https://reichdata.hms.harvard.edu/pub/datasets/amh_repo/curated_releases/V52/V52.2/SHARE/public.dir/v52.2_1240K_public.snp" wget -O- ${AADR_SNP_URL} | awk 'BEGIN{OFS="\t"}{print $2, $4}' > data/targets/v52.2_1240K_public.bed
-
Download samples from ENA and pileup these using samtools
- Fetch the ENA URL of
MT23
,MT26
andMT7
; Store these inena-bam-urls.tsv
ENA_TSV="https://www.ebi.ac.uk/ena/portal/api/filereport?accession=PRJEB54894&result=read_run&fields=submitted_ftp&format=tsv&download=true&limit=0" wget -qO- ${ENA_TSV} | grep -oP "ftp.*/MT[0-9]{1,2}.fastq.*.bam(?=;)" | sed 's/^/ftp:\/\//' > ena-bam-urls.tsv
- run
samtools mpileup
samtools mpileup \ -RBqQ25 \ -l data/targets/v52.2_1240K_public.bed \ -f data/reference/Homo_sapiens.GRCh37.dna.primary_assembly.fa \ -b ena-bam-urls.tsv \ > mentesh-tepe.garino-vignon-commsbio-2023.RBqQ25.1240k.GRCh37.mpileup
- Fetch the ENA URL of
mkdir -p data/fst/
grups-rs fst --threads 22 --vcf-dir data/1000g-phase3/ --output-dir data/fst/EUR --pop-subset EUR --verbose
-
Run
pedigree-sims
grups-rs pedigree-sims \ --pileup mentesh-tepe.garino-vignon-commsbio-2023.RBqQ25.1240k.GRCh37.mpileup \ --data-dir data/fst/EUR \ --recomb-dir data/hapmap-phase2-b37 \ --output-dir grups-rs-output-mentesh-tepe \ --pedigree data/pedigrees/simple-pedigree.txt \ --samples 0-2 \ --sample-names MT7 MT23 MT26 \ --mode fst-mmap \ --reps 1000 \ --min-qual 25 \ --pedigree-pop EUR \ --contam-pop EUR \ --verbose
-
Visualize results with
grups.plots
This can be done in a single command, directly from bash...R --slave -e 'grups.plots::app(data_dir="grups-rs-output-mentesh-tepe")'
...or from an interactive R session.
library(grups.plots) grups.plots::app(data_dir="grups-rs-output-mentesh-tepe")
Defining pedigrees within GRUPS-rs is performed through simple definition files. See the main example template pedigree definition file here. Other examples may be found in the resources/pedigrees subdirectory of this repository. GRUPS-rs currently supports two alternative formats, which are described below.
The standard pedigree definition format of grups-rs
can be subdivided into two sections:
-
The first section takes charge of defining the individuals found within the family tree, and its topology. Here, this particular section extensively, mirrors the commonly found
.ped
file format of thePEDSTATS/QTDT
software, or PLINK's.fam
files.- Here, merely three columns are required from these previously mentionned file formats. Here, the
iid
column specifies the within-family id of the individual, whilefid
andmid
both specify the ids of the individuals parents (See below):# First section: define the pedigree's topology iid fid mid Ind1 0 0 # Ind1 and Ind2 are defined as founder individuals Ind2 0 0 Ind3 Ind1 Ind2 # Ind3 is defined as the offspring of Ind1 and Ind2
- Here, merely three columns are required from these previously mentionned file formats. Here, the
-
The second section of the file takes charge of specifying which pairwise comparisons should
GRUPS-rs
specifically investigate within the template family tree. Here, every line beginning with theCOMPARE
keyword is considered as a "comparison definition" entry byGRUPS-rs
, and is expected to adhere to the following scheme:COMPARE <label> <iid-1> <iid-2>
Where,
<label>
is the user-defined name for the given comparison (e.g. 'first-degree', 'cousins', 'unrelated', etc.)<iid-1>
is the individual id of the first sample being compared<iid-2>
is the individual id of the second sample being compared
Example:
COMPARE Unrelated Ind1 Ind2 COMPARE First Ind1 Ind3 COMPARE Self Ind3 Ind3
Note that, while requiring only three columns, GRUPS-rs
is able to directly parse the previously mentionned .fam
and .ped
formats, provided that users manually annotate the required second section at the bottom of these files. Hence, an quick and intuitive way to design custom pedigree files for GRUPS-rs is to:
- Visually generate the first section of the file, using the interactive
QuickPed
online software (Vigeland M.D. 2022). - Export and save the output of QuickPed as a
.ped
file - Manually append the desired 'COMPARE' entries at the bottom of this file.
On top of the current standard format, GRUPS-rs
remains entirely backwards compatible with the previous file format of `GRUPS``.
In essence, the legacy pedigree definition file of GRUPS is defined and parsed in three distinct steps, each one tied to a keyword within the definition file:
-
INDIVIDUALS
: Define the individuals within the pedigree.-
Individuals are then defined by a unique, line-separated id or name.
-
Ids must not contain any whitespace
-
Dashes, underscores, and other special characters are allowed, but we recommend that users stick to using alphanumeric characters.
-
Example:
# Define individuals within the pedigree INDIVIDUALS father mother child
-
-
RELATIONSHIPS
: Define the parents for each offspring.- Relationships are parsed by targeting the
=repro()
regular expression, through this nomenclature:<offspring-id>=repro(<parent1-id>, <parent2-id>)
- Example:
# Define offspring relationships RELATIONSHIPS child=repro(father,mother)
- Relationships are parsed by targeting the
-
COMPARISONS
Define which pairwise comparisons should GRUPS-rs investigate to compute genetic distances.- Each comparison is defined by a unique, line-separated id or name (e.g. 'parents', 'siblings').
- comparison ids can contain whitespaces, and various special characters (though we recommend sticking to alphanumeric characters and underscores).
- Comparisons are then parsed by targeting the
=compare()
regular expression, through this nomenclature:<relationship-id>=compare(<individual1-id>, individual2-id>)
- Example:
# Define pairwise comparisons COMPARISONS unrelated=compare(father,mother) 1st_degree=compare(child, mother)
Note that comment lines are allowed within the definition file: any line starting with a #
character is ignored during parsing.
Below is a more complete definition file, where the intent would be to investigate the following six kinship ties:
label | description |
---|---|
inbred-self |
compare an inbred individual to itself |
self |
compare an outbred individual to itself (r=1) |
first |
compare two first-degree relatives (r=0.5) |
second |
compare two outbred second-degree relatives (r=0.25) |
inbred-second |
compare two three-quarter siblings relatives due to inbreeding (r=0.25+0.125). |
unrelated |
compare two unrelated individuals (r=0) |
Thus, an appropriate family tree topology could be as follows:
Where founder and simulated individuals are colored in teal and lavander, respectively. Green arrows represents the comparisons that GRUPS-rs is requested to perform.
Here, a template family tree such as this one can be defined as the following:
Standard format
# standard format
Ind1 0 0
Ind2 0 0
Ind3 Ind1 Ind2 # Ind3 and Ind4 are defined as the childreb of Ind1 and Ind2
Ind4 Ind1 Ind2
Ind5 Ind3 Ind4 # Ind5 is defined as an inbred individual, since Ind3 and Ind4 are siblings.
Ind6 0 0
Ind7 Ind4 Ind6
COMPARE inbred-self Ind5 Ind5
COMPARE self Ind3 Ind3
COMPARE first Ind2 Ind4
COMPARE inbred-second Ind5 Ind7
COMPARE second Ind2 Ind7
COMPARE Unrelated Ind1 Ind2
Legacy format:
Alternatively, one could also define this family tree using the legacy format of GRUPS
in such a manner
# legacy format
INDIVIDUALS
Ind1
Ind2
Ind3
Ind4
Ind5
Ind6
Ind7
RELATIONSHIPS
Ind3=repro(Ind1,Ind2) # Ind3 is defined as the child of Ind1 and Ind2.
Ind4=repro(Ind1,Ind2) # Ind4, as the child of Ind1 and Ind2.
Ind5=repro(Ind3,Ind4) # Ind5, as the child of Ind3 and Ind4. (Ind5 is thus an inbred individual, since Ind3 and Ind4 are siblings)
Ind7=repro(Ind4,Ind6) # Ind7, as the child of Ind4 and Ind6.
COMPARISONS
inbred-self=compare(Ind5,Ind5) # Compare inbred individual Ind5 to itself. label this relationship as 'inbred-self'
self=compare(Ind3,Ind3) # Compare outbred individual Ind3 to itself. label this relationship as 'self'
first=compare(Ind2,Ind4) # Compare Ind2 and Ind4. label this relationship as 'first'
inbred-second=compare(Ind5,Ind7) # Compare Ind5 and Ind7. label this relationship as 'inbred-second'
second=compare(Ind2,Ind7) # Compare Ind2 and Ind7. label this relationship as 'second'
unrelated=compare(Ind1,Ind2) # Compare Ind1 and Ind2. label this relationship as 'unrelated'
Note that founder individuels (i.e. Ind1
, Ind2
and Ind6
) are only defined in the INDIVIDUALS
section and do not require to be defined in the RELATIONSHIPS
section.
Other example pedigree definition files may be found in the resources/pedigrees subdirectory of this repository.
.pwd
output files contain summary statistics and results regarding all pairwise raw pwd-from-stdin
module, are tab-separated and headed.
Column | Type | Description |
---|---|---|
Pair_name |
string | Descriptive label for a given pairwise comparison. Labels are in the form <IND i>-<IND j>
|
Raw.Overlap |
integer | Number of overlapping SNPs, which met the provided treshold of --min-depth on both samples |
Raw.Sum.PWD |
float | Sum of the long-term average pairwise mismatch rates for all overlapping positions |
Raw.Avg.PWD |
float | Average Pairwise Mismatch Rate, i.e.: raw Raw.Sum.PWD / Raw.Overlap
|
Raw.CI.95 |
float | Raw 95% Confidence interval for Raw.Avg.PWD
|
Raw.Avg.Phred |
float | Average Phred score for all overlapping positions (Scale: PHRED-33) |
.result
files contain summary statistics and results regarding pedigree simulations results. This most notably contains information regarding the most likely estimated relationship, given pedigree simulations results, as well as all pairwise corrected $\widehat{PWD}^{obs}. This file emanates from the pedigree-sims
module, are tab-separated and headed.
Column | Type | Description |
---|---|---|
Pair_name |
string | Descriptive label for a given pairwise comparison. Labels are in the form <IND i>-<IND j>
|
Most_Likely_rel |
string | Estimated Most Likely Relationship, given pedigree simulations. Note that this column says nothing about significance |
Corr.Overlap |
integer | Corrected Number of overlapping SNPs, after filtering out positions not found within --data-dir or below the provided --maf threshold |
Corr.Sum.PWD |
float | Corrected um of long-term average pairwise mismatch rates, after filtering positions not found within --data-dir , or below the provided --maf threshold |
Corr.Avg.PWD |
float | Corrected Average Pairwise Mismatch Rate, i.e.: Corr.Sum.PWD / Corr.Overlap
|
Corr.CI.95 |
float | Corrected 95% confidence interval for Corr.Avg.PWD
|
Corr.Avg.Phred |
float | Corrected average Phred score, after filtering positions not found within --data-dir , or below the provided --maf threshold` |
Sim.Avg.PWD |
float | Average Most_Likely_rel ) distribution. |
Min.Z_Score |
float | Z-score between Corr.Avg.PWD and the distribution of the most_likely relationship (Most_Likely_rel ) |
.sims
files contain pairwise-specific raw simulation results. In other terms, this file contains information, and most notably the pedigree-sims
module, are tab-separated and unheaded, and are located within the simulations
subdirectory.
Field number | Type | Description |
---|---|---|
0 | integer | Pedigree simulation replicate index |
1 | string | User-defined pedigree comparison label |
2 | string | User-defined label of the first pedigree individual being compared |
3 | string | User-defined label of the second pedigree individual being compared |
4 | string | label of the randomly selected reference for the first individual being compared (Founder individuals only - None , when the individual is simulated) |
5 | string | label of the randomly selected refereence for the second individual being compared (Founder individuals only - None when the individual is simulated) |
6 | integer | Sum of simulated pairwise differences for this comparison |
7 | integer | Sum of simulated overlap for this comparison |
8 | float | Average simulated pairwise mismatch rate, or $\widehat{PWD}^{sim} for this comparison |
.blk
files contain pairwise-specific raw observed pairwise mismatch rates within non-overlapping windows. These files are generated from the pwd-from-stdin
module, are tab-separated and unheaded, and located within the blocks
subdirectory.
Field number | Type | Description |
---|---|---|
0 | integer | Chromosome number |
1 | integer | Block Start (bp) |
2 | integer | Block End (bp) |
3 | integer | Block-specific number of observed overlapping SNPs, which met the provided treshold of --min-depth |
4 | float | Block-specific Sum of long-term average pairwise mismatch rates for all overlapping positions |
The .probs
file is an optional output file of the pedigree-sims
module, which is only emitted when --assign-method
is set to svm
. This file will contain per-class SVM probabilities for every investigated relationship during simulations, and for every pairwise comparison.
.yaml
contain the serialized output of the grups-rs
command line interface. These files may be generated by any submodule of grups-rs
, strictly follow the YAML format specification v1.2. The objective of these yaml files is mainly to ensure reproducibility, by keeping a record of all parameters used during a grups-rs
command. These yaml files may also be used to re-run a grups-rs
command using the exact same configuration, using the from-yaml
submodule (See the from-yaml
module section, for a more detailled explanation as to how to use this submodule).
If you plan to use GRUPS-rs, please cite (Lefeuvre M. et al 2024), as well as the the original publication of GRUPS (Martin D. et al 2017):
Lefeuvre M, Martin M, Jay F, Marsolier M, Bon C. GRUPS-rs, a high-performance ancient DNA genetic relatedness estimation software relying on pedigree simulations. Hum Popul Genet Genom 2024; 4(1):0001. https://doi.org/10.47248/hpgg2404010001
Martin MD, Jay F, Castellano S, Slatkin M. Determination of genetic relatedness from low-coverage human genome sequences using pedigree simulations. Mol Ecol. 2017;26(16):4145-4157. doi:10.1111/mec.14188
Detailled citation instructions can be found by running the grups-rs cite
module.
Print help information for a specific command.
Set the verbosity level of this program. With multiple levels:
-v
: Sets the INFO log level.-vv
: Sets the DEBUG log level.-vvv
: Sets the TRACE log level.
Print version information.
Note that the program will still output warnings by default. Use --quiet
argument to
disable them
Disable warnings.
By default, warnings are emited and redirected to the console, even without verbose mode on. Use this argument to disable this behavior and only display errors.
Specify an input pileup file containing samples to investigate.
Note that in the absence of a --pileup
argument, the program may accept a data stream
from the standard input. i.e:
samtools mpileup -B -q30 -Q30 ./samples/* | pwd_from_stdin [args]
Provide with a list of SNP coordinates to target within the pileup.
Pileup positions which are not found within the provided --targets
file will be exluded
from comparison.
Accepted file formats:
- '.snp' : EIGENSTRAT format. See the convertf documentation for additional information regarding this type of file.
- '.vcf' : Variant Call Format. See the VCF specification for additional information regarding this type of file.
- '.tsv' : a tab-separated file with columns
<CHROMOSOME> <POSITION> <REFERENCE> <ALTERNATE>
- '.csv' : a comma-separated file with columns
<CHROMOSOME> <POSITION> <REFERENCE> <ALTERNATE>
- '.txt' : a space-separated file with columns
<CHROMOSOME> <POSITION> <REFERENCE> <ALTERNATE>
Change the window size of each jackknife block (found within the output .blk
files).
Note that lower block values may drastically increase the memory footprint of grups-rs
.
Restrict comparison to a given set of chromosomes.
This argument may accept slices, such as --chr 9-11
and/or discrete integers such as --chr 1 4 13
.
Fasta indexed reference genome.
By default, grups-rs
will use GRCh37 as a reference genome. Use this argument if you wish to specify an alternative reference. Note that a .fasta.fai
genome index file must be present at the same directory.
Example: specifying --chr 9-11 13 19-22
...will be parsed as: [9, 10, 11, 13, 19, 20, 21, 22]
Minimal required Base Quality (BQ) score to select an observation as a valid candidate for comparison.
Nucleotides whose base quality is lower than the provided threshold are simply filtered-out prior to comparison. Value should be expressed in the Phred+33 scale.
Provide with a list of sample names for printing.
By default, individuals will be referred to using their pileup index. e.g. "Ind0, Ind1, Ind2, etc."
The provided list must of course follow the sorted index order which was provided by --samples
. Note that the length of --samples
and --sample-names
do not need to match. When such is the case, default values are used for the missing names.
Specify the output directory where results will be written.
Note that grups-rs will create the specified leaf directory if it is not present. However, grups-rs
does not allow itself from creating parent directories.
0-based column index of the individuals that should be compared within the file specified by --pileup
Argument may accept slices (inclusive), such as --samples 0-3
and/or discrete integer values such as --samples 7 8
.
Example: --samples 0-3 7 8
will be parsed as: [0, 1, 2, 3, 7, 8]
Provide with the minimal sequencing depth required to perform comparison.
Note that the length of the provided vector does not need to be the same as the one provided for individuals. When such is the case, values of --min-depth will "wrap" around, and start again from the beginning for the next individuals.
Example: Specifying --samples 0-4 --min-depth 2 3 5
would result in:
Sample | Depth |
---|---|
0 | 2 |
1 | 3 |
2 | 5 |
3 | 2 |
4 | 3 |
Do not perform comparison, but rather print out the pileup lines where a comparison could be made.
This is a legacy argument from the initial implementation of GRUPS. Note that lines are printed as soon as there is a valid comparison between ANY pair of individuals. Thus, applying --filter-sites
may not prove very effective when used in conjunction with --self-comparison
Consider deletions when performing comparison.
By default, deletions ('*' character) are not counted as selectible positions for comparison. Using this flag will instead mark them as valid matching positions.
Exclude transitions from the input targets file.
Note that this argument requires the use of --targets
to provide the program with a list of coordinates. The given coordinate file must furthermore explicitely provide with the REF/ALT alleles. See the --targets
argument for additiotnal information regarding valid file formats.
Do not output jackknife blocks for each pair of individuals (.blk
files).
By default, grups-rs
will keep track of the pairwise mismatch rate within windows of size --blocksize
. This information can prove useful to investigate PMR values in sliding windows (this can be visualized with the grups.plots
companion shiny interface), or to compute jackknife estimates of variance.
Using --no-print-block will prevent grups-rs
from computing average pairwise differences in non-overlapping blocks, and will prevent it from outputting any .blk
files.
Filter out tri-allelic sites when given a list of SNP-coordinates.
Note that this arguement requires the use of --targets
to provide the program with a
list of coordinates. The given coordinate file must furthermore explicitely provide with
the REFERENCE and ALTERNATE alleles.
See the --targets
argument for additional information regarding valid formats.
Enable self-comparison mode on individuals.
Note that a minimal depth of 2 is required when comparing individuals to themselves,
since at least two alleles are required to effectively compute pairwise differences at a
given site. This can be activated by specifying a library-wise value of --min-depth
>=2
for all samples.
Note that in cases where the specified --min-depth
was set to 1 for a given individual,
grups-rs will automatically temporarily rescale the given value of --min-depth
to 2 when
performing self-comparisons.
Overwrite existing output files.
By default, grups-rs does not allow itself from overwriting existing results files. Use this flag to force this behaviour.
Path to a directory containing a database of phased modern human genotypes, such as the 1000g-phase3 dataset.
This database may be in the form of a set of VCF files (default). In that case, grups-rs will look for, and load any file ending with the .vcf[.gz] file extension.
Alternatively, users may use a set of FSA-encoded files. To use FSA-encoded files, users must explicitly specify the use of this type of data input, with the --mode argument. When specified, grups-rs will search for any file ending with the .fst[.frq] file extension.
See the documentation of the fst
module, for a more detailled explanation regarding how to generate FSA-encoded databases.
Path to a directory containing a set of chromosome-specific genetic recombination maps, such as the HapMap-phaseII dataset.
Note that GRUPS-rs will search for, and attempt to load any file ending with the '.txt' file extension within that directory. It is therefore highly recommended that this directory ONLY contains the required recombination map files, and nothing else.
The expected input is a headed, tab separated file with columns <chr> <pos(bp)> <rate(cM/Mb)> <Map(cM)>
Example recombination file (click to unroll)
Chromosome Position(bp) Rate(cM/Mb) Map(cM)
chr22 16051347 8.096992 0.000000
chr22 16052618 8.131520 0.010291
... ... ... ...
Path to input pedigree definition file. Examples of such definition files may be found within the resources/pedigrees
subdirectory of this github repository. See section Defining Custom pedigrees for a detailled explanation on how to write custom input pedigree definition files.
Allele frequency downsampling rate, i.e. the proportion of SNPs to keep at true frequency (in percentage).
grups-rs
will use the specified `--af-downsampling-rate`` as a probability to randomly simulate allele fixation during pedigree simulations.
Note that the defined rates should be specified as percentages.
Proportion of filtered SNP positions to include in the analysis (in percentage).
grups-rs
will use the specified --snp-downsampling-rate
as a probability to randomly ignore positions during pedigree simulations.
Note that the defined rates should be specified as percentages.
Contamination rates (or rate ranges) for each pileup individual.
The provided argument(s) may accept hard set values, such as --contam-rate 3.0
, or ranges, such as --contam-rate 0.0-5.0
.
When a contamination rate is provided in the form of a range, pedigree-specific values are picked from a uniform distribution, within that defined range.
Note that contamination rates are specified as percentage, and are tied to the individuals contained within the input pileup. Specifying a constant rate for all individuals is also possible, by defining a unique value or range.
**Example: grups-rs pedigree-sims [...] --samples 0-5 --contam-rate 2.0
implies that all five pileup indidivuals will be assigned a set contamination rate of 2%, during pedigree simulations.
In general, keep in mind that contamination rate values are recycled, if the number of specified values is lower than the number of examined pileup individuals.
Sequencing error rates (or rate ranges) for each pileup individual.
The provided argument(s) may accept hard set values, such as --seq-error-rate 1.0
, or ranges, such as --seq-error-rate 1.0-3.0
. When a sequencing error rate is provided in the form of a range, pedigree-specific values are picked from a uniform distribution, within that defined range.
Note that sequencing error rates are specified as percentage, and are tied to the individuals contained within the input pileup. Specifying a constant rate for all individuals is also possible, by defining a unique value or range.
Example: grups-rs pedigree-sims [...] --samples 0-7 --seq-error-rate 1.0
implies that all seven pileup indidivuals will be assigned a set sequecing error rate of 1%, during pedigree simulations.
In general, keep in mind that sequencing error rate values are recycled if the number of specified values is lower than the number of examined pileup individuals.
Define the expected data input type for pedigree simulations.
This argument is closely tied to the --data-dir
argument, and will define which type of files should grups-rs
look for, as well as how to load them into memory.
(tl;dr: --mode fst-mmap
is recommended for most applications. Use --mode fst
when runtime performance is critical, but memory usage is not an issue.)
-
When using
--mode fst-mmap
grups-rs
will search for files ending with the.fst[.frq]
file extention, and will make use of memory-mapped files to query this reference dataset. This mode is highly recommended if the directory targeted with--data-dir
is located within an SSD drive. However, applying the fst-mmap when working with data found within an HDD drive may have a slight negative impact on performance. -
When using
--mode fst
,grups-rs
will search for files ending with the.fst[.frq]
file extention. These files are then sequentially loaded into RAM and queried. The fst mode may be faster if you have a lot of comparisons to apply, but has has a higher memory footprint compared to the fst-mmap mode. Thus, it is only recommended if your input.fst[.frq]
files are located on an HDD drive. -
When using
--mode vcf
(default),grups-rs
will search for files ending with the.vcf[.gz]
extention, and will directly query these files. This mode is generally slower, but may yield higher performance if you have highly covered data, and your vcf files were previously pre-filtered to only contain bi-allelic SNPs.
Number of pedigree simulation replicates to perform for each pairwise comparisons.
The default provided value of 100 should be considered a bare-minimum, for quick screening. Values in the range of 500 to 1000 replicates is recommended.
Minimum required (super-)population allele frequency to include a given SNP during simulations.
Note that grups-rs
will re-compute the observed average PWD after simulations, to excluding and correct for any position previously excluded through this threshold.
Source population from which founder individuals are selected during pedigree simulations.
Note that when using --mode vcf
--mode vcf
, grups-rs may only use populations for which a <POP>_AF
annotation is present in each INFO field of the VCF files used. To generate finer grained population allele frequencies, we recommend either the use of the bcftools +fill-tags
plugin, or the use of the '--compute-pop-afs' argument when generating FSA-encoded dataset with grups-rs
' fst
module.
Population from which contaminating individuals are selected during pedigree simulations.
Note that when using --mode vcf
, grups-rs may only use populations for which a _AF annotation is present in each INFO field of the VCF files used. To generate finer grained population allele frequencies, we recommend
either the use of the bcftools +fill-tags
plugin, or the use of the '--compute-pop-afs''--compute-pop-afs' argument when generating FSA-encoded dataset with the grups-rs
' fst
module.
Number of random individual genomes with which to contaminate pedigree simulations.
Note that the number of contaminating individuals, are tied to the samples contained within the input pileup, and that the specified values will be recycled if their length is lower than the number of examined pileup samples.
Path to an input panel definition file.
By default, grups-rs
will automatically search for a file ending with the .panel
extension within the directory targeted by --data-dir
. Use the --panel
argument to override this behaviour, and specify a definition file located somewhere else.
Number of additional parallel decompression threads.
Can increase performance when working with BGZF compressed .vcf.gz
files. Note that this parameter has no effect when working with uncompressed .vcf
or .fst[.frq]
files.
Provide the random number generator with a set seed.
Select the method for most likely relationship assignment
tl;dr: use the default --assign-method svm
; only use --assign-method zscore
if you wish to replicate the classification method of the first version of GRUPS.
-
svm: Assign a most likely relationship using Ordinally Partitionned Support Vector Machines. Binary SVMs are instantiated sequentially, from the lowest relatedness order to the highest, in terms of average PWD. Each SVM is fitted against the hypothesis that the observed PWD belongs to a higher degree than the given distribution. These SVMs are then used to generate per-class probabilities of belonging to a given class of relationship. The most likely relationship is assigned by selected the relationship class with the highest per-class probability.
-
zscore: Perform minimum zscore assignation, i.e. the distribution with the lowest z-score from the observed PWD is selected as the most likely candidate. This approach is computationally inexpensive, but may provide with spurious results, when the different distributions carry drastically different standard deviations.
Path leading to the directory containing the set of input reference VCF files that you wish to encode.
Output directory where the generated FSA-encoded files (ending with the .frq[.fst]
extension) should be written to.
Path to an input panel definition file.
By default, grups-rs
will automatically search for a file ending with the .panel
extension within the directory targeted by --data-dir
. Use the --panel
argument to override this behaviour, and specify a definition file located somewhere else.
Population Subset
Subset the index by a given number of (super)-population. (e.g. EUR, AFR). Note that at least one pedigree population and one contaminating population are required to obtain valid index files. (Although, the source and contaminating population may be the same.)
Number of parallel CPU processes when performing FST-Indexation
Note that parallelization is dispatched according to the number of separate .vcf[.gz] files
. Thus, there is no point in invoking more threads than there are reference VCF files to encode.
Number of additional parallel decompression threads when decompressing BGZF-compressed VCF files. (This parameter has no effect when working with uncompressed .vcf
files.)
Also note that decompression threads have a multiplicative effect when combined with '--threads'. Thus, setting --decompression-threads 2
and --threads 22
, will in fact consume up to 44 worker threads.
Recalculate population allele frequencies for each population and super-population tag that can be found within the provided input
panel definition file (specified with --panel
).
For some publicly available datasets, such as the 1000g-phase3 adding this flag can allow for the use of smaller populations as reference or to use customly defined populations.
When unspecified, the program will instead look for <POP>_AF
tags within the VCF's INFO field. These tags can be generated using the bcftools +fill-tags
plugin.
-
Install cargo tarpaulin
cargo install cargo-tarpaulin
-
Run coverage tests
cargo tarpaulin --workspace --command test --out Lcov
The complete documentation for the GRUPS-rs public API can be built using cargo
:
cargo doc --workspace --open --no-deps --document-private-items
Using this command, the documentation's html
entry-point will be located under target/doc/pedigree_sims/index.html
and should automatically open in your default browser.