Skip to content

D) hogwash inputs

Katie Saund edited this page May 28, 2021 · 25 revisions

Input data

Required data

Phenotype

The required structure of the phenotype data object is a matrix. The rows correspond to samples and should be ordered to match the tips of the phylogenetic tree. There should only be one column, which contains the phenotype data. The matrix should have both row names and column names. The row names must exactly match the tree’s tip labels. The phenotype can either be binary (0/1) or continuous. At this time hogwash does not support multiple categorical phenotypes (eg. ‘A’, ‘B’, & ‘C’).

Discrete phenotype:

Antibiotic_resistance
sample_1 0
sample_2 0
sample_3 1
sample_4 1

Continuous phenotype:

Toxin_production
sample_1 0.10
sample_2 1.20
sample_3 0.05
sample_4 2.70

Genotype

The required structure of the genotype data object is a matrix. The rows correspond to samples and should be ordered to match the tips of the phylogenetic tree. The columns correspond to individual genotypes. The matrix should have both row names and column names. The row names must exactly match the tree’s tip labels. Genotypes can be SNPs (core genome), genes (accessory genome) or other types (indels, pathways, etc...). Genotypes must be coded in binary (0/1).
Genotype:

SNP_1 SNP_2 SNP_3 SNP_4 SNP_5
sample_1 0 1 1 0 0
sample_2 0 0 0 1 1
sample_3 1 0 0 1 0
sample_4 1 1 1 0 1

Phylogenetic tree

The phylogenetic tree should be rooted. If the tree is not rooted, either:

  • root the tree either to an outgroup and then remove the outgroup from the tree, phenotype, and genotype (in this example assume tip t4 is the outgroup)
  • use the midpoint rooting method and reorder your phenotype and genotype to the new order
  • or supply it to hogwash and the function will midpoint root the tree automatically.

The tree must be fully bifurcating. I recommend building your phylogenetic tree with an outgroup, root using the outgroup, and then remove the outgroup prior to running hogwash.

Optional data

Grouping genotypes key

hogwash allows the user to create ancestral reconstructions for individual genotypes and then condense them into meaningful groups.

Requiring that an individual SNP occur in multiple lineages may be too stringent, but instead if all relevant SNPs from a gene are grouped together the power to identify convergent evolution will be increased because this grouping method could capture larger trends in functional impact at the gene level and a reduce the multiple testing correction burden. Use cases for this method could be to group SNPs into genes or genes into pathways.

The required structure of the grouping genotypes key data object is a matrix. Each row corresponds to a genotype. The first column must have the name of a genotype included in the genotype matrix. The second column must have a name for a group to which the item in the first column belongs. Row names are not required. The column names are used in output plots and therefore must be included.

SNP GROUP
SNP_1 GENE_A
SNP_1 PATH_A
SNP_2 GENE_A
SNP_3 GENE_B
SNP_4 GENE_C
SNP_5 GENE_A
SNP_6 GENE_D
SNP_6 PATH_A

Grouping method

The user can select either "post-ar" or "pre-ar" as the preferred grouping method. The default is "post-ar". For more, please revisit the grouping section.

Permutation number

The default value is 10,000.

False discovery rate

The default value is 0.15.

Tree type

The user can select to plot the tree as a "phylogram" (right-facing, square tree; default) or a "fan" (circular). Note, the phylogram is plotted ignoring tree edge lengths and the fan is plotted using tree edge lengths.

Bootstrap support value confidence threshold

The default value is 0.70 based on the value found in Farhat et al’s 2013 Nature genetics paper. However, think carefully about the bootstrap confidence threshold you choose. For example, IQ-TREE is an increasingly popular method by which to create phylogenetic trees. The ultrafast bootstrap (UFBoot) support values are not the same as normal bootstrap support values. UFBOOT support values are only considered high confidence for >= 0.95.

Strain ID

This is a key that categorizes each sample into a user-defined group. It could be strain type, isolation location, species of host, etc... The ID supplied in the column just needs to be a character. In the output PDF the tree tips will be colored by this supplied ID. ID colors are automatically generated (you can't pick the colors).

strain_id
sample_1 "ribotype_027"
sample_2 "ribotype_014"
sample_3 "ribotype_014"
sample_4 "ribotype_014"

Data set size

Note: Hogwash is computationally slow so very large datasets are unlikely to finish within a reasonable amount of time. We have typically been working with <500 samples and <500,000 genomic variants. In practice, it is possible to split large genotype matrices into multiple sub-matrices and run hogwash in parallel on those sub-matrices in order to complete faster.

Next: the hogwash outputs.