RNA sequencing (RNA-seq) is a method of investigating the transcriptome of an organism using deep-sequencing techniques. The RNA content of a sample is directly sequenced after appropriate library construction, providing a rich data set for analysis. The high level of sensitivity and resolution provided by this technique makes it a valuable tool for investigating the entire transcriptional landscape. The quantitative nature of the data and the high dynamic range of the sequencing technology enables gene expression analysis with a high sensitivity. The single-base resolution of the data provides information on single nucleotide polymorphisms (SNPs), alternative splicing, exon/intron boundaries, untranslated regions, and other elements. Additionally, prior knowledge of the reference sequence is not required to perform RNA-seq, allowing for de novo transcriptome analysis and detection of novel variants and mutations. RNA-seq is an extremely powerful and revolutionary way to investigate transcriptomes, but requires care in order to achieve the highest quality of data.
Factors to consider in RNA-seq
The first factor to consider is enrichment of the sample. Total RNA generally contains only a very small percentage of coding or functional RNA; ribosomal RNA (rRNA: up to 80–90% of the total RNA), and to a lesser degree transfer RNA (tRNA), make up the majority of the RNA in a sample. In order not to use 80–90% of one’s sequencing capacity on repetitive rRNA sequences, generally rRNA is removed from the sample prior to sequencing. This is most often achieved either by specifically depleting rRNA or by selectively enriching for polyadenylated RNA by use of oligo-dT enrichment. Depletion of rRNA preserves information on both coding and noncoding RNA (an important research topic), while enrichment of the poly A fraction preserves only coding mRNA. Poly A enrichment may miss certain RNAs and RNAs with high turnover rates.
Some other methods of avoiding rRNA also exist, such as selective degradation of abundant transcripts or amplification techniques that are biased away from rRNA. However, these are not as common as rRNA depletion or poly A enrichment, and may have the side effect of skewing the transcript representation away from normal.
Another issue to consider is the size of the RNA to be investigated. RNA transcripts span a wide range of sizes; experiments focusing on small RNA (e.g., microRNA or RNAs in the 15–35 bp size range) generally require specialized purification and library construction protocols compared with general RNA analysis. Most other size fractions of RNA can be sequenced together (one of the common steps in RNA-seq is fragmentation of the RNA population down to a common size, such as 200–300 nt).
Once the method of ribosomal removal and the size fraction to be investigated have been chosen, the RNA is made into a library. For most sequencing machines, this involves first fragmenting the RNA, then creating double-stranded cDNA through reverse transcription. This double-stranded cDNA may then be handled as normal genomic DNA throughout the remaining library construction process. If directional information (strandedness) of the RNA is to be preserved, modified library construction protocols must be used, such as ligating adapters directly to mRNA or marking one of the cDNA strands such that it can be removed prior to sequencing.
When planning the sequencing run itself, the three major issues to consider are read depth, read length, and whether or not to use paired-end data. Read depth provides information on the abundance of RNA transcripts, and greater read depth allows more sensitive detection of rare transcripts. Read length is important in that longer reads have more sensitivity to detect splicing events (intron–exon boundaries, exon–exon boundaries). Paired end data provides greater information on transcript structure, particularly with widely spaced exons. Generally speaking, de novo analysis or searches for novel structural variation will require both high read depth and length, and will benefit from sequencing paired ends. A typical example may have 100–200 M reads, 2 x 50–100 bp. In contrast, expression analysis or profiling will benefit from high read depth, but read length and paired end data provide little extra advantage. A typical experiment for this application may have 10–30 M reads, 1 x 35–100 bp.