Continuing from my last post… I am building the table for mtDNA sequences from GenBank. GenBank sequences are the main source for defining branches on the maternal tree. Therefore, they help us understand not only the origins of a branch but the current structure of the tree — what we know and its limits.

GenBank Sample Requirement

From the first post, we have a user story –What the user wants. As a user with mtDNA results, I would like results and origin information from GenBank samples.

Encyclopedia of mtDNA Origins - Geno 2 and Genbank Requirements

Encyclopedia of mtDNA Origins – Geno 2 and Genbank Requirements

GenBank Background

GenBank is the National Institute of Health database for genetic information.

Stairway to Heaven
There’s a feeling I get
When I look to the west
And my spirit is crying for leaving

The tune will come to you at last
When all are one and one is all, yeah
To be a rock and not to roll

It includes thousands of complete or nearly complete mtDNA sequences. These come from academics and citizen science donations. There are now at least 31,000 complete human mtDNA samples in the database.

GenBank samples are useful for several reasons. First, samples are from complete sequencing methods like Sanger sequencing and NexGen sequencing. This means there is the potential to find new branches. Second, many samples have demographic information.

GenBank samples have disadvantages. Some samples are incomplete. That may be due to partial testing or partial failure of testing. Many samples are medical, or do not have demographic information. Finally, samples are stored in FASTA format.

Encyclopedia of mtDNA Origins - FASTA format at GenBank

Encyclopedia of mtDNA Origins – FASTA format at GenBank

FASTA format is useful for many tools, but it is not easy for a reader to scan. Converting sequences to lists of variants compared to the RSRS reference sequence is much more useful.

Converting from FASTA to variant lists with MitoTool

This conversion is possible with the MitoTool website and the downloadable version of the program.

Encyclopedia of mtDNA Origins - MitoTools

Encyclopedia of mtDNA Origins – MitoTool

While there is an online version, running many thousands of sequences is best done with the desktop version. To build an initial data-set, I download zip files of the sequences at Phylotree.org. This also includes many sequences that are not in GenBank but that Dr. Van Oven has extracted from published journal articles. In a future update, I will download the samples in GenBank that are not on the Phylotree.org website.

I ran groups of samples through the program in groups of 2,000 using the Whole mtDNA tab with the RSRS option.

Encyclopedia of mtDNA Origins - Samples Processed by MitoTools

Encyclopedia of mtDNA Origins – Samples Processed by MitoTool

Note that the desktop version of MitoTool currently only goes up to Phylotree Build 16. MitoTool results save as XML files that can be opened with Excel.

Encyclopedia of mtDNA Origins - MitoTool Results Opened in Excel

Encyclopedia of mtDNA Origins – MitoTool Results Opened in Excel

There are five columns: Name, Haplogroup, Missing, Private, and Variants. Name is identifying information. If the sample is from GenBank, it includes the GenBank Accession number. Haplogroup is the Phylotree Build 16 haplogroup. Where more than one haplogroup determination is possible, there is a comma separated list. Missing contains mutations that are expected based on the path from the RSRS sequence to the haplogroup and subclade. Private contains mutations that are present in the sample and not part of the path from the RSRS sequence to the haplogroup and subclade. They are not private in the sense of unique. The total number of people sequenced is too small to show in most cases if a mutation is unique to one person. Variants is the RSRS based haplotype.

To clean up the data, I replaced all blank fields in the Missing and Private columns with none. I then added a column and added Hg IDs. I put additional information about samples in its own filed. Finally, I cleaned the name column so that it contained only GenBank Accession numbers.

Where there was more than one possible haplogroup, I moved the sample to a separate file to be manually reviewed latter.

Building the GenBank Sample Database

Like the Geno 2 content type, I am creating GenBank as an Advanced Content Type. Most fields match those in the MitoTool file. I have again created both build 16 and build 17 fields as relationships to the mtDNA Stories. In addition, I created another relationship filed, authorship. This one links the sample to the journal article where it was published in the Journal Article Archive on the site.

Encyclopedia of mtDNA Origins - GenBank Fields (P1)

Encyclopedia of mtDNA Origins – GenBank Fields (P1)

Encyclopedia of mtDNA Origins - GenBank Fields (P2)

Encyclopedia of mtDNA Origins – GenBank Fields (P2)

GenBank Sample Templates

For the GenBank section, there are two sets of templates. The first two are the Get Genbank Data template and an update to the mtDNA Stories Template.

The Get Genbank Data template returns matching rows for the results table.

Encyclopedia of mtDNA Origins - Get Genbank Data

Encyclopedia of mtDNA Origins – Get Genbank Data

The second creates the table and calls the first template. It includes using the WP JQuery DataTable plugin to add DataTable features.

Encyclopedia of mtDNA Origins - Add GenBank to the mtDNA Story Template

Encyclopedia of mtDNA Origins – Add GenBank to the mtDNA Story Template

The second set of templates are to create specific pages for sample details. In the results table, the Hg ID links to a details page. This page includes the GenBank ID, a link to the external source of the information on the sample, the publication information for the sample, and extended information about the haplotype.

Because I made GenBank an Advanced Content Type, I created a PODs Page Template to build the details page. This is done with a little PHP –I know just enough to make me dangerous.

Encyclopedia of mtDNA Origins - GenBank Page Template

Encyclopedia of mtDNA Origins – GenBank Page Template

The page template references the second single_genbank_template template.

Encyclopedia of mtDNA Origins - single_genbank_template

Encyclopedia of mtDNA Origins – single_genbank_template

Looking at Results

Now to see how it looks. Here is the GenBank section on the page for T1b. The columns shown on this page are Hg ID, Publication, Missing Variants, and Additional Variants.

Encyclopedia of mtDNA Origins - GenBank Section

Encyclopedia of mtDNA Origins – GenBank Section

This is a sample details page. On it are full details about the sample with links to publication and source information.

Encyclopedia of mtDNA Origins - GenBank Sample Details

Encyclopedia of mtDNA Origins – GenBank Sample Details

One more requirement down. To be continued…