Chapter 5: Genomics
In this chapter, we will learn how to work with genomic data and genome annotations and associated file formats. It is hoped that this chapter will serve as a basic introduction to genomics, with the understanding that it is a far broader field than the types of genome annotations presented here. Many of the chapters, such as the next chapter “Transcriptomics”, and many before, could be considered as a subject matter as part of the field of genomics.
5.1 The Three Fundamental “Gotchas” of Genomics
Whereas many sciences have three or four fundamental laws that describe them, research in genomics often encounters exceptions to rules rather than universal principles. However, when learning genomics for the first time, researchers often encounter the same problems that cause the same strange things to happen in their data. These errors are so common for beginners in genomics, they can be thought of as the “fundamental gotchas” of genomics:
- Different genome assemblies
- Different chromosome deflines
- 0 vs 1-based coordinates
5.1.1 Different Genome Assemblies/Annotations
The first fundamental gotcha involves the fact that for any model organism, there are typically many different genome assembly versions. For example, one might download two different datasets, such as a set of genomic positions, and try to compare them, only to realize that they were compared against different assemblies of the genome. In many cases, such as human genome assembly versions, each being incremental improvements in the accuracy of the assembly.
5.1.2 Different Chromosome Deflines
Depending on from where you downloaded the genome annotation, you could possibly encounter an assembly that uses one defline for “Chromosome 1”, and another that uses a different one. For example, some annotations such as from UCSC Genome Bioinformatics, would use “chr1”, and other databases may just use “1”. You definitely need to double check these values when comparing different formats.
5.1.3 0 vs 1-based coordinates
The third fundamental gotcha involves comparing data from different file formats, and not being aware that some file formats use [latex]0[/latex]-base positions, and others use [latex]1[/latex]-based positions. Whereas [latex]0[/latex]-based coordinates consider the first position of a chromosome to be position [latex]0[/latex], [latex]1[/latex]-based coordinates consider the first position of a chromosome to be position [latex]1[/latex]. In many ways, [latex]0[/latex] is more natural for computer science, because typically programming languages use [latex]0[/latex]-based coordinates to describe strings. Biologists may be more accustomed to counting positions starting at [latex]1[/latex]. There are many different file formats for storing the positions of genomic locations, each use one of these two position systems, so it’s important to know which is which, and be aware that there is a difference.
Format | Position system |
GFF/GTF | 1-based |
BLAST results | 1-based |
BLAT results | 1-based |
maf | 0-based |
BAM | 0-based |
SAM | 1-based |
BED | 0-based |
wig | 1-based |
5.2 Genomic Data and File Formats
5.2.1 Formats for Genomic Locations
One of the most common forms of genome annotation is that of the genome location. Often we want to annotate and label a binding site for a protein, or a gene by the genomic locations of the exons. Therefore, we need to define file formats that specify genomic locations.
Often, we can describe a genomic location by four variables that specify the “genomic coordinate”. The “genomic coordinate” can be specified by a chromosome, a start position, a stop position, and a strand. Clearly, the fundamental gotchas are important when evaluating these four values. For strand, it is most common to see “+” and “-” to refer to the forward and reverse strand. However, if you look at enough databases you will find things like “F” and “R” to denote strand.
BED Files
BED files are a simple file format to describe genomic locations. The BED file was developed by UCSC Genome Bioinformatics, and is a common format found when downloading data from that site. These descriptions are derived from the descriptions at UCSC (https://genome.ucsc.edu/FAQ/FAQformat.html). The first three required BED fields are:
- chrom: The name of the sequence (e.g. chromosome or scaffold) that the feature is within.
- chromStart: The beginning position of the defined genomic region, in 0-based positions.
- chromEnd: The ending position of the defined genomic region, in 1-based positions. The chromEnd base is not included in the display of the feature, so it behaves much like ranges and substrings in python.
So using this core required format, the BED file can be a useful streamlined representation of genomic positions where strand is not important. We can expand the file format to include other information, for example when we want to store data for gene models. The next columns we could add are:
- name: Defines the name or ID of the feature or genomic region.
- score: A numerical value defining a score from 0 to 1000. Could be replaced with a dot.
- strand: The strand of the genomic region, represented by “+” for forward strand or “-” for reverse complement.
- thickStart: The beginning position of the portion of the genomic region to be drawn larger than the rest, for example a CDS to stand out from the rest of the gene.
- thickEnd: The ending position of the portion of the genomic region to be drawn larger.
- itemRgb: This is used to set the color of the genomic feature for some genome browsers, defined as a value for red, green, and blue as a number from 0 to 250 for each. For example, green would be (0,255,0), or blue would be (0,0,255), or many shades in between, such as a type of blue-green might be (0,100,100).
- blockCount: For features that can be in multiple pieces, such as eons, this sets the number of blocks.
- blockSizes: For features that can be in multiple pieces, such as exons, this comma-separated string of lengths, sets the length of each piece.
- blockStarts: This comma-separated string of positions defines the start position of the exons or blocks.
GFF Files
GFF format is a great way to store gene annotation information. In many ways the columns of GFF are designed for genes, but by including a dot “.” for the columns that don’t apply to your annotation, you can also store more simple annotations such as the locations of a motif instance.
- seqname – The name of the sequence. Must be a chromosome or scaffold.
- source – The program that generated this feature.
- feature – The name of this type of feature. Some examples of standard feature types are “CDS”, “start_codon”, “stop-codon”, and “exon”.
- start – The starting position of the feature in the sequence. The first base is numbered 1.
- end – The starting position of the feature (inclusive).
- score – A score between 0 and 1000. If the track line useScore attribute is set to 1 for this annotation data set, the score value will determine the level of gray in which this feature is displayed (higher numbers= darker gray). If there is no score value, enter “.”.
- strand – Valid entries include “+”, “-“, or “.” when strand doesn’t apply.
- frame – If the feature is a coding exon, frame should be a number between 0-2 that represents the reading frame of the first base. If the feature is not a coding exon, the value should be “.”.
- group – All lines with the same group are linked together into a single item.
GTF Files
A GTF file is very similar to a GFF file, but with a few different specifications. The first eight GTF fields are the same as GFF. The group field has been expanded into a list of attributes. Each attribute consists of a type/value pair. Attributes must end in a semi-colon, and be separated from any following attribute by exactly one space. The attribute list must begin with the two mandatory attributes:
- gene_id value – A globally unique identifier for the genomic source of the sequence.
- transcript_id value – A globally unique identifier for the predicted transcript.
BAM/SAM
Col | Field | Type | Brief description |
1 | QNAME | String | Query template NAME |
2 | FLAG | Int | bitwise FLAG |
3 | RNAME | String | Reference sequence NAME |
4 | POS | Int | 1-based leftmost mapping POSition |
5 | MAPQ | Int | MAPping Quality |
6 | CIGAR | String | CIGAR string |
7 | RNEXT | String | Ref. name of the mate read |
8 | PNEXT | Int | Position of the mate/next read |
9 | TLEN | Int | observed Template LENgth |
10 | SEQ | String | segment SEQuence |
11 | QUAL | String | ASCII of Phred-scaled base QUALity+33 |
A BAM file is essentially a binary, typically indexed version of SAM. The program [latex]\texttt{samtools}[/latex] can allow quick retrieval of reads from a genomic region. SAM files is an optional output format for many alignment algorithms, but many people prefer to convert them to BAM files because of faster retrieval of reads for a particular genomic position due to indexing.
Quantitive Tracks
We often would like to annotate quantitative data on a genome browser; hence, it is critical to have file formats devoted to this kind of data. Consider, for example, plotting the GC content as a function of position throughout the genome.
BedGraph
track type=bedGraph
chrom1 chromStart1 chromEnd1 dataValue1
chrom2 chromStart2 chromEnd2 dataValue2
Wiggle, and BigWig
Wiggle file format is a common way to display quantitative tracks. The format is relatively simple, but takes on two different version: [latex]\texttt{variableStep}[/latex], and [latex]\texttt{fixedStep}[/latex]. Each of these are specified at the top of the file. For the fixed step, we have a fixed number of bases between positions presented in the file:
fixedStep chrom=chrN start=position step=stepInterval [span=windowSize]
dataValue1
dataValue2
dataValue3
...
Because the start position is specified, and the step size is specified, positions don’t need to be specified in the file. For the variable width, we’ll need to specify the position for each value:
variableStep chrom=chrN [span=windowSize]
chromStart1 dataValue1
chromStart2 dataValue2
chromStart3 dataValue3
...
The optional parameter “span” does not include the brackets when used in practice, and here only indicates that it is optional. The span essentially indicates that a value can be specified as applying to a range of positions, starting at the given chromStart value.
5.3 Genome Browsers
Genome browsers provide an interactive way to navigate the data associated with a genome in a visual way. Much like a web browser or an interactive map application. The file formats given above are exactly the kind of files that a genome browser would read and present visually.
5.3.1 IGV
The Integrated Genome Viewer (IGV) is a powerful desktop genome browser that allows you to relatively easily add and remove tracks and modify them. IGV allows you to export to various image formats, including scalable vector formats such as SVG files. In some cases the available RAM on one’s computer may be a limitation for loading too many tracks into IGV.
5.3.2 UCSC Genome Browser
The UCSC genome browser is a web-based genome browser. You can download, install, and host a version of the UCSC genome browser on your own computer, but you can also add tracks to the genome browser hosted at https://genome.ucsc.edu/cgi-bin/hgGateway.
5.3.3 Gbrowse
Gbrowse is probably one of the most common genome browsers out there. Many databases such as Flybase, or Wormbase have Gbrowse integrated in to the database for users to navigate the genomic data presented. Gbrowse allows for the display to be exported into multiple file formats, including scalable file formats.
5.3.4 JBrowse
Jbrowse is very similar to Gbrowse, but allows for asynchronous queries to the database, effectively making for a faster experience. One can scroll the position rapidly and have features immediately presented to the user without having to reload the page. Jbrowse allows the display to be exported in PNG, but not in a scalable file format.
5.4 Lab 6: Genome Annotation Data
In this lab, we will examine a script that will take a GFF annotation as input, and will extract protein-coding exons, concatenate them, and then translate them into a protein sequence. As usual, let’s work on this project in its own directory.
5.4.1 Part I: Storing a Genome to a Dictionary
A dictionary is a data structure with key-value pairs. Dictionaries are indexed by a “key”, which can be any string. This could be a useful concept for dealing with many sequences from a FASTA file, with the defline as the key.
>>> dict = {}
>>> dict["R2D2"] = "droid"
>>> dict["Vader"] = "Sith"
>>> dict["Yoda"] = "Jedi"
>>> print(dict)
{'Vader': 'Sith', 'R2D2': 'droid', 'Yoda': 'Jedi'}
>>> print(dict["R2D2"])
droid
Note how the dictionary looks when you print it out. You see a list of key-value pairs, separated by a colon “:”. Moreover, whereas the list is enclosed by square brackets when printed out, the dictionary is enclosed by curly braces when printed out. Nevertheless, both of them allow you to access a particular term with square brackets such as [latex]\texttt{list[2]}[/latex] or [latex]\texttt{dict["R2D2"]}[/latex].
It turns out, it is a pretty good way to store a genome, namely with the chromosome names (deflines) as the key, and the sequence as the value.
genome = {}
sequences = SeqIO.parse(genomeFasta,"fasta")
for record in sequences:
genome[record.id] = record.seq
5.4.2 Part II: Storing a GFF to a list
Next, let’s review creating a list and appending values to it. In this case, we will append (chrom,start,stop) genomic locations corresponding to CDS exons:
coords = []
GFF = open(gffFile,'r')
for line in GFF:
if "#" not in line:
chrom,source,seqtype,start,stop,score,strand,frame,attributes = line.strip().split("\t")
if "CDS" in seqtype and name in attributes:
coords.append((chrom,int(start),int(stop)))
A couple of things to note. First, we have a gene name in mind, stored in the variable “name”. We expect this to be a transcript ID because a gene ID could end up printing too many lines (one for each transcript associated with that gene). If we want to extract coding regions, it should be on a per-transcript basis. Second, we note that we are converting the positions to integers upon reading in. By default they are stored as strings, so the conversion is important. Third, note that we are excluding lines with a hash [latex]\texttt{#}[/latex] because these are typically comments that do not contain the required number of columns.
5.4.3 Part III: Find a gene of interest in Drosophila melanogaster
Let’s use NCBI Protein or NCBI Gene to find a your favorite gene. Let’s then BLAST it to Drosophila (by setting Drosophila as the taxa), and find the best Drosophila ortholog.
Next, let’s find a [latex]\textbf{flybase transcript ID}[/latex] for one of the isoforms of the gene by navigating the flybase database and Gbrowse from flybase. Note that a [latex]\textbf{flybase transcript ID}[/latex] looks like this: FBtr001828
Go to Flybase (http://flybase.org/) and search for a gene of interest, then paste your gene name into the “Jump to Gene” box on the upper left.
$ wget http://hendrixlab.cgrb.oregonstate.edu/teaching/flybase_r5.56.gff3
To use this script, you’ll need to download a flybase annotation. Download this one with wget:
$ wget http://hendrixlab.cgrb.oregonstate.edu/teaching/flybase_r5.56.gff3
Next, let’s look at the gff file. Let’s use less, but turn off word-wrap with the -S option.
$ less -S flybase_r5.56.gff3
Next, you can create a symbolic link to the genome file you created in Lab 4 (section 3.5) to your current directory with a command like this, but you may need to updated it based on where the file is:
$ ln -s ../Lab4/dm3.fa genome.fa
In this command, the first file “../Lab4/dm3.fa” is your original genome file. The second file name “genome.fa” is the name of the symbolic link you will be creating (but you can name it what you want). Now, let’s take a look at the script.
import sys
import re
from Bio import SeqIO
from Bio.Seq import Seq
# this section takes care of reading in data from user
usage = "Usage: " + sys.argv[0] + " "
if len(sys.argv) != 4:
print(usage)
sys.exit()
# read the input files/args.
genomeFasta = sys.argv[1]
gffFile = sys.argv[2]
name = sys.argv[3]
# read the fasta file into a dictionary.
genome = {}
sequences = SeqIO.parse(genomeFasta,"fasta")
for record in sequences:
genome[record.id] = record.seq
# initialize some variables
revcomp = False
coords = []
# read through the gffFile
GFF = open(gffFile,'r')
for line in GFF:
if "#" not in line:
chrom,source,seqtype,start,stop,score,strand,frame,attributes = line.strip().split("\t")
if "CDS" in seqtype and name in attributes:
coords.append((chrom,int(start),int(stop)))
if strand == "-":
revcomp = True
# reverse the order of the exons if on "-" strand
coords.sort(reverse = revcomp)
# collect the coding exons of the transcript
ORF = Seq('')
for chrom,start,stop in coords:
CDS = genome[chrom][start-1:stop]
if revcomp:
CDS = genome[chrom][start-1:stop].reverse_complement()
# concatenate the coding sequence
ORF += CDS
# transcribe,translate, and print
RNA = ORF.transcribe()
Protein = RNA.translate()
print(Protein)
Finally, after you copy the text of the above script into a file called “extractGeneAndProtein.py”, and put into a scripts directory you can run the script using the transcript ID for your favorite gene. Note, here we are using the genome FASTA file created in a previous lab, Lab 4 (section 3.5), but with the symbolic link created above.
$ python scripts/extractGeneAndProtein.py genome.fa flybase_r5.56.gff3 FBtr0089362