GitHub - sabeelmansuri/Epigenomics: Basic lesson on elementary epigenomics.

Overview

In examining the diversity of life, we're often directed to the sequence of an organism's DNA as the source of variation. Differences in these genomes are directly correlated with changes in genetic products, regulatory effects, and molecular interactions--all of which contribute to the great phenotypic variation observed in our world.

However, in this paper, we wish to look beyond the genome. Instead, we introduce the topic of epigenomics, the analysis of gene expression that is not attributable to mutations to the DNA sequence of a genome. Rather, epigenetic expression alteration is caused by biochemical interactions with various proteins and other compounds.

We will briefly describe the biology of the two most well-known epigenomic modifications: DNA Methylation and Histone Modifications. Then, we will describe in detail the analytical techniques and technologies used to quantify each epigenomic modification.

Biology: DNA Methylation

Introduction

The first epigenomic modification discovered was DNA methylation. Uncovered as early as when Rollin Hotchkiss formalized the notion of DNA being the genetic material (1984), DNA methylation remains one of the most influential epigenomic modifications today.¹³

On the surface, it appears to be a simplistic process. DNA methyltransferases (DNMTs) transfer methyl groups from S-adenosyl methionine (SAM) to the fifth carbon of cytosine residues to form 5-methylcytosine (5mC), thereby directly methylating DNA.³ As we'll see, this simple change can have an astronomic effect. First, some basics.

I. CpG Islands

As mentioned above, it is cytosine bases that are methylated. Not all cytosines, however, are equally prone to methylation; the majority of methylation happens on cytosines that precede a guanine nucleotide, a motif called a CpG site.³ CpG sites are often grouped in close proximity, which forms a CpG island. The overall methylation of CpG islands is the primary driver of methylation changing gene expression.³

For a region to be considered a CpG island, it must have:

A sequence length longer than 200bp
A GC content of more than 50%
A statistical ratio of observed/expected CpG (cytosine followed by guanine) greater than 0.6

In the human genome, there are around 25,000 such CpG islands, about half of which contain transcription start sites.³ Such CpG islands can directly influence expression levels, which we'll explore next.

II. Repression Mechanism

In fact, there are two ways in which the methylation of a CpG island near a transcription start site may inhibit gene expression: Steric Bulk and Protein Binding.

Steric Bulk

The addition of a methyl group to cytosines increases the steric bulk of the molecule. Specifically, the addition of a CH₃ group adds a physical barrier that functions as a guard for the grooves in the double-helical structure of the DNA.¹³

Because these grooves are critical for protein specificity recognition, the bulk created by the methyl group inhibits the binding of transcription factors and, therefore, downstream gene expression.

Figure 1 Overview of steric hinderance caused by methylation of DNA, resulting in transcriptional repression.

Protein Binding

There exist proteins that bind to CpG motifs if and only if they are methylated. These proteins are denoted “Methyl-CpG-binding domain proteins” (MBDs).³ MBD binding innately amplifies the steric hinderance mentioned above.

Their main repressional mechanism, however, is not steric. Rather, MDBs recruit histone deacetylases, proteins that remove acetyl groups from histones. (As we’ll see below, acetyl groups binding to histones relaxes chromatin condensation, making DNA more accessible for transcription factors.) In turn, chromatin condenses and transcription is significantly repressed.

III. Case Study: DNA Methylation + Cancer

To underscore the importance of DNA methylation on a broader scale, we mention its application with respect to cancer.¹¹ Specifically, DNA hypermethylation of CpG islands near tumor suppressor genes has been observed in cancerous cells. At the same time, oncogenes are often found with abnormally low DNA methylation levels, leading to their overexpression. As these improper methylation patterns proliferate through cell division, a tumor can form.

An upside for this correlation, however, is that observing this pattern in tumor cells has opened the door for epigenetic treatment. There are currently drugs available that specifically target demethylation of tumor suppressor gene hypermethylation, and more are in development.

Figure 2 Example of hyper/hypomethylation in a tumor cell.

Biology: Histone Modifications

Introduction

Histones are proteins found in eukaryotic cell nuclei that order DNA into nucleosomes. These components of chromatin are subject to post-translational modifications including methylation, acetylation, phosphorylation, and others still being researched. The histone code hypothesis suggests that these modifications, along with epigenetic markers, influence the recruitment of proteins responsible for regulating gene expression. Multiple modifications work together simultaneously to regulate and change chromatin state and gene expression.¹⁰ Let's explore some of these modifications.

Figure 3 General structure of a DNA-histone complex. The two most common modifications (acetylation) and (methylation) are shown.

I. Histone Acetylation/Deacetylation

Acetylation connects a negative charge acetyl group to lysine residues of the N-terminal histone tails (specifically H3 and H4) by histone acetyltransferase (HAT).¹⁰ By doing so, negatively charged DNA is repelled, causing the chromatin to relax into euchromatin, allowing for transcription factors to bind and increase gene expression. Opposingly, deacetylation by histone deacetylase (HDAC) condenses chromatin into heterochromatin, therefore deactivating gene activity.⁷

Figure 4 Acetylation of histones, leading to relaxation of DNA into euchromatin. Note the larger regions of exposed DNA open for transcription after acetylation.

II. Histone Methylation/Demethylation

Unlike histone acetylation, methylation is a post-translational epigenetic modification that does not directly change histone charge or histone-DNA interactions. Instead, a methyl group is added to lysine or arginine residues of histone tails, each impacting transcription differently. More specifically, arginine methylation activates transcription and transcriptional activities while lysine methylation effects depend on the methylation site and length.¹⁰

Methylation at different sites result in either activation or deactivation. Some common methylation sites are⁷:

H3K4, K36, and K79 which result in transcriptional activation
H3K9, K27, and H4K20 while silence gene activity/expression

Figure 5 Histone methylations at various locations. Notice both activation and repression is possible depending on methylation type.

III. Histone Phosphorylation

Unique from both histone methylation and acetylation, histone phosphorylation employs interactions between other histone modifications and binding proteins. Chromatin remodeling happens when a phosphoryl group attaches to the histone tail. This can occur on all histone core proteins, each having distinct effects. This process plays a primary role in cell division, transcriptional regulation, and DNA damage repair.¹⁰

Figure 6 One example of histone phosphorylation. Various possible signals caused by this particular phosphorylation are described.

Analysis Techniques: DNA Methylation

Background

The goal of DNA Methylation analysis is fairly obvious: we wish to detect what parts of the genome are methylated to identify, confirm, or analyze downregulated regions.

Let's take an example: We have two bacterial samples that should both be expressing a gene that turns them blue. However, one culture appears blue while the other appears white. Given what we've learned, we may hypothesize that, for some reason, the white colony of bacteria has methylated the region containing the gene, thus downregulating it. How can we test this hypothesis?

Indeed, genetic sequencing shows the DNA sequence of both bacteria to be identical, yet the white colony shows little to no transcription of this gene. What we've just mentioned sets the context for DNA methylation analysis: genetic sequencing is unable to distinguish methylated and non-methylated cytosine.⁵ Therefore, we require other analysis techniques, one that allows us to differentiate between these two:

I. Bisulfite Sequencing

Overview

Bisulfite sequencing is the most widely-used and popular DNA Methylation analysis technique.¹ The core idea is to convert non-methylated cytosine into uracil, but keep methylated cytosine unchanged. Then, run a sequencing analysis. The converted cytosine (now uracil) will be detected during sequencing as thymine, so every detected cytosine will be a methylated cytosine, giving a clear indication of which regions are methylated.¹

Figure 7 The conversion of non-methylated cytosines into uracils, which are read as thymines by sequencing technology. Notice that methylated cytosines are unchanged and remain sequenced as cytosines.

Lab Technique

The first milestone in a bisulfite sequencing analysis is the treatment of DNA with bisulfite. There are three major steps¹ in this protocol:

Denaturation of DNA into single strands
Incubation with bisulfite solution at high temperature
Cleaning of DNA; removal of bisulfite and residues

The product of this will be DNA with non-methylated cytosine converted into uracil. Although whole-genome bisulfite sequencing (WGBS) is becoming increasingly viable, it is not yet a common method. Thus, we will only describe the lab technique for region-specific analysis. However, we will mention one WGBS analysis method below.

Having used bisulfite to convert all non-methylated cytosine to uracil, the next step is to use a Polymerase Chain Reaction (PCR) to amplify the region of interest. Traditionally, primers must be selected carefully (such as selecting for low cytosine or non-CpG island rich areas) so potentially converted cytosine do not inhibit PCR amplification. Additionally, in parallel, a PCR for the same region is run on DNA that has not been treated with bisulfite.

Both PCR products (bisulfite-treated and native DNA) are cleaned and sequenced using any modern sequencing technique. This yields two digital sequence files:

A bisulfite-treated sequence where each cytosine is a methylated cytosine.
A native DNA sequence that is true to the original sequence.

We now move to computational analyses conducted on these data files.

Computational Analysis

The simplest analysis possible with these two files is aligning the two sequences, and identifying where there is a C-T mismatch. For example, take the two sequences below:

Bisulfite:  GTATCTAT
Native:     GCATCTAC

We see the bisulfite sequence identifies the cytosines at positions 2 and 9 as thymines, but the cytosine at position 5 remains a cytosine. We would conclude that the cytosines at positions 2 and 9 were not methylated, while the cytosine at position 2 was.

A related but more interesting analysis than a base-by-base comparison is one that answers, "What areas of this region of interest are methylation-rich?" Tools such as MethylCoder⁸ provide the answer by aggregating the results of the base-by-base comparison, determining locales with high aggregate values, and reporting the raw and aggregated results.

Bonus - Whole Genome Bisulfite Sequencing Analysis

The quintessential question asked of WGBS is, "What regions of the whole genome are methylation-rich?"⁵ Here, the data required for analysis are slightly different than before:

Bisulfite sequencing data (FASTQ)
Reference genome of interest (FASTA)

Recently developed EPIC TABSAT⁴ is one excellent tool for such an analysis. Given these inputs, the following general analysis steps are performed:

Quality assessment of raw data
Read alignment to reference genome
Methylation site analysis and grouping

The tool will then output information about methylation-rich sites, relative methylation level, and more (shown below).

Figure 8 Summary of various data outputs by EPIC TABSAT (click above for clearer image). Note in particular 1) the lollipop plot that shows % methylation arranged according to samples' accurate chromosomal coordinates and 2) the patternmap showing methylation significance by sample.

II. HELP Assay

Overview

The HpaII tiny fragment Enrichment by Ligation-mediated PCR (HELP) Assay leverages restriction enzyme digestion analysis to determine DNA methylation patterns.⁵

Two restriction enzymes are used:

HpaII, which cuts DNA at CCGA sites where the inner cytosine is not methylated
MspI, which cuts DNA at CCGA sites regardless of cytosine methylation

This results in MspI cutting DNA into some number of additional fragments compared to HpaII, and calculating the magnitude of this difference provides a relative measurement of DNA methylation.

Figure 9 Overview of two (similar) variations of the HELP assay. Our focus is on the left protocol

Lab Technique

Two DNA samples are isolated and, in parallel, subject to either HpaII or MspI digestion.⁶ We assume the HpaII sample has been digested at only CCGA sites where the inner cytosine is not methylated, resulting in some number of fragments. Additionally, we assume the MspI sample has been digested at all of the sites that HpaII was, but additionally at CCGA sites where the inner cytosine is methylated.

Each sample is then subjected to ligation-mediated PCR (LM-PCR).⁶ This protocol first adds linker sequences to every fragment. These sequences are complementary to fluorescently labeled PCR primers, so each fragment is amplified without the worry of complementarity/primer specificity. This yields a fluorescently detectable pool of DNA that has a quantity relative to the initial number of fragments. Importantly, the HpaII and MspI PCR reactions use different fluorescent labels.

Figure 10 Simplified idea of LM-PCR. Note the linker binding, the primer complementarity, and fluorescent label.

Next, a microarray is set up such that it contains binding sites for expected CCGA site cuts (determined using reference sequence analysis).⁶ Equal amounts of each PCR product is added evenly across the microarray, creating a mosaic of MspI and HpaII bound sequences. The microarray is then scanned twice, once for each type of fluorescent label used. The difference in fluorescence between the two is representative of the methylation level.

Computational Analysis

One of the key benefits of this technique is that it's fairly light on dry-lab analysis. Though results are generally more qualitative and inexact, it is a far simpler and easier protocol than bisulfite sequencing.

However, there have been attempts to increase the quantitative power of the assay. A data analysis pipeline in R was developed to take signal intensity data of the microarrays and run various normalizations and to quantify the differences. This pipeline, however, is not packaged as a tool, but contains a series of computational steps in R that are beyond the scope of this introductory epigenomics lesson. The detailed paper, however, is linked here for reference.

Review of DNA Methylation Analysis Techniques

As mentioned before, bisulfite sequencing generally provides more quantifiably solid results than the HELP assay, but also requires greater wetlab and computational power.⁵ Indeed, the computational tools for bisulfite sequencing are plentiful, and will generate robust analyses. However, the HELP assay provides a great low-effort alternative for determining DNA methylation levels in a generic context.

From a broader perspective, a recurring limitation of DNA methylation analysis is that any given result is only a snapshot of a single cell at a given point in time. This means repeating an experiment on the same organism may yield vastly different results given that a different cell in a different point in time is used.

A better method would be one that uses continual measurement of methylation instead of endpoint analysis; this way, we may get a better glimpse into the dynamic mechanisms of true genomic methylation. Such tools and techniques are being developed and implemented today.

Analysis Techniques: Histone Modifications

Background

The goal of histone modification analysis is simply to identify the levels of modification or locate where on the genome these modifications are being made.

In this section we will explore two popular techniques to do so:

ChIP-seq, utilized in studying site-specific modifications
Mass Spectrometry (MS), used to precisely compare global levels of specific modifications among different samples

I. ChIP-Seq

Overview

Chromatin Immunoprecipitation (ChIP) is a powerful tool used to analyze protein interactions with DNA. Specific antibodies are utilized to isolate a specific protein or modification factor of interest. This is used to identify the location and abundance of the protein or modification is within the genome, giving us insight into chromatin structure and gene expression.⁷

Figure 11 Wetlab protocol for ChIP-Seq.

Let’s take an example: We have two samples of DNA, one for clear cell renal carcinoma and one for regular kidney cells.¹² We want to find sites where expression is higher in the clear cell renal carcinoma sample in order to potential histone modification sites. For this example, we will be using H3K27ac, meaning that there is acetylation at histone 3, at location 27. Given what we’ve learned about histone acetylation and ChIP-seq, how do we find these modification sites?

We know that histone acetylation happens when acetyl group attaches to the histone tails of certain proteins by HAT. So by using ChIP, we can find which proteins on our DNA sequence these acetyl groups are using protein-specific antibodies. In our case, we will use anti-histone H3K27ac.

Lab Technique

Here is the basic protocol⁷ for both samples:

Crosslink cells with formaldehyde
Isolate and shear DNA into chromatin fragments
Immunoprecipitate with protein-specific antibody
Reverse-crosslinks and purify DNA for sequencing

Next, we prepare the sequence libraries by attaching sequence adaptors to both ends of each fragment. We must then perform PCR to amplify the library and check its concentration. The next step is sequencing. Many sequencing techniques can be used in this step.

We now move to computational analyses conducted on these data files.

Computational Analysis

Before jumping into the fun stuff we must also:

Clean the raw reads by removing adaptors and PCR duplicates
Computationally align fragments to the reference genome

Next, we use peak-calling which utilizes different algorithms to identify regions where there are more reads than background. Popular software include: MACS, PeakSeq, SICER, CCAT, etc.² From here, we can visualize the data on a genome browser.

Figure 12 Visualization of peak-calling software output on a genome browser.

We can clearly see that the bottom two rows (clear cell renal carcinoma) have higher levels of binding than the top two rows (regular kidney cells). More specifically, this ChIP-seq shows an active ZNF395 super-enhancer only in the clear cell renal carcinoma cells. We have found where on the genome and which specific gene is overexpressed.¹²

II. Mass Spectrometry

Overview

Mass Spectrometry (MS) gives us an unbiased quantitative analysis of post-translational histone modifications. Unlike ChIP-seq, MS is designed to output a large variety of histone modifications and their relative abundance within a single analysis. However, this method involves more wet-lab preparation. The bottoms-up method is the most popular method, in which intact proteins are digested into short peptides for nano-liquid chromatography and mass spectroscopy.⁹

Let’s take this example: We want to analyze histones from human embryonic stem cells (hESCs) with and without retinoic acid (important in cell growth, differentiation, and organogenesis). We want to figure out the relative abundance of histone modified peptides⁹. How can we utilize MS to figure this out?

Lab Technique

Because of the sophistication of the lab protocol, we will not be focusing on the biology of the lab technique.

However, you may click here for a simplified step-by-step protocol⁹

Harvest cells of interest and isolate the nuclei
Perform histone purification
Perform histone variant fractionation
Histone quantification
Histone derivatization using Lysine
Histone digestion using Trypsin
Propionylation of Histone Peptides at N-termini
Stage-tip desalting

We will now focus on the computational analyses conducted on these MS raw data files.

Computational Analysis

We will import these raw files into a software to perform peak area integration. Two popular softwares for this are EpiProfile and Skyline.

Figure 13 Visualization of peak area integration output.

Using these ion chromatograms, we can find the area under the curve to estimate the abundance of each peptide. To find the relative abundance of a modification, we sum up all of the different modified forms of the peptide and divide by the total area for that peptide. Using tools like Mascot, we can find the relative abundance of a specific peptide by dividing its area by the total area of all of the modifications.⁹

Figure 12 Results of relative abundance analysis.

In figure A, we can see the relative quantification of the histone H3 peptide KQLATKAAR (aa 18 - 26). In figure B, we can see the relative quantification of the histone H3 peptide KSTGGKAPR (aa 9 - 17). In figure C, we can see the relative abundance of detected peptides for histone H3 with and without cell treatment with retinoic acid.⁹

We can see that hESCs had a reduction of acetylated peptides when stimulated for differentiation.

This data only shows the 35 modified forms of histone H3 that were quantified.⁹ However, MS has the power to more than 200 proteoforms, including all variants and low abundant modifications!

Review of Histone Modification Analysis Techniques

The most popular method of analyzing histone methylation method is ChIP-seq. It is a very powerful tool to analyze protein interactions with DNA, and is perfectly applicable to find and quantify histone modifications. However, this technique has a low throughput and bias against hyper modified proteins. Alternatively, although more tedious mass spectrometry is more precise, returning the relative abundance of several histone variants on a global level in a single analysis.

The analysis of histone modifications is becoming increasingly prevalent in cancer, pathology, and developmental research as well as precision medicine. With the immergence of these new technologies and research, we can expect exciting innovative approaches to medicine and healthcare.

Concluding Remarks

We've scratched the surface of major epigenomic modifications and some analysis techniques used to quantify them. Yet, the opportunity for exploration and development is vast.

We hope to have, at the very least, demonstrated the importance of these epigenomic modifications, and inspired further research into their future applications.

Name		Name	Last commit message	Last commit date
Latest commit History 110 Commits
assets		assets
README.md		README.md

sabeelmansuri/Epigenomics

Folders and files

Latest commit

History

Repository files navigation

Overview

Biology: DNA Methylation

Introduction

I. CpG Islands

II. Repression Mechanism

Steric Bulk

Protein Binding

III. Case Study: DNA Methylation + Cancer

Biology: Histone Modifications

Introduction

I. Histone Acetylation/Deacetylation

II. Histone Methylation/Demethylation

III. Histone Phosphorylation

Analysis Techniques: DNA Methylation

Background

I. Bisulfite Sequencing

Overview

Lab Technique

Computational Analysis

Bonus - Whole Genome Bisulfite Sequencing Analysis

II. HELP Assay

Overview

Lab Technique

Computational Analysis

Review of DNA Methylation Analysis Techniques

Analysis Techniques: Histone Modifications

Background

I. ChIP-Seq

Overview

Lab Technique

Computational Analysis

II. Mass Spectrometry

Overview

Lab Technique

Computational Analysis

Review of Histone Modification Analysis Techniques

Concluding Remarks

Citations

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Packages