Continuing from my last post… I am building the table for mtDNA sequences from GenBank. GenBank sequences are the main source for defining branches on the maternal tree. Therefore, they help us understand not only the origins of a branch but the current structure of the tree — what we know and its limits.
GenBank Sample Requirement
From the first post, we have a user story –What the user wants. As a user with mtDNA results, I would like results and origin information from GenBank samples.
It includes thousands of complete or nearly complete mtDNA sequences. These come from academics and citizen science donations. There are now at least 31,000 complete human mtDNA samples in the database.
GenBank samples are useful for several reasons. First, samples are from complete sequencing methods like Sanger sequencing and NexGen sequencing. This means there is the potential to find new branches. Second, many samples have demographic information.
GenBank samples have disadvantages. Some samples are incomplete. That may be due to partial testing or partial failure of testing. Many samples are medical, or do not have demographic information. Finally, samples are stored in FASTA format.
FASTA format is useful for many tools, but it is not easy for a reader to scan. Converting sequences to lists of variants compared to the RSRS reference sequence is much more useful.
Converting from FASTA to variant lists with MitoTool
This conversion is possible with the MitoTool website and the downloadable version of the program.
While there is an online version, running many thousands of sequences is best done with the desktop version. To build an initial data-set, I download zip files of the sequences at Phylotree.org. This also includes many sequences that are not in GenBank but that Dr. Van Oven has extracted from published journal articles. In a future update, I will download the samples in GenBank that are not on the Phylotree.org website.
I ran groups of samples through the program in groups of 2,000 using the Whole mtDNA tab with the RSRS option.
Note that the desktop version of MitoTool currently only goes up to Phylotree Build 16. MitoTool results save as XML files that can be opened with Excel.
There are five columns: Name, Haplogroup, Missing, Private, and Variants. Name is identifying information. If the sample is from GenBank, it includes the GenBank Accession number. Haplogroup is the Phylotree Build 16 haplogroup. Where more than one haplogroup determination is possible, there is a comma separated list. Missing contains mutations that are expected based on the path from the RSRS sequence to the haplogroup and subclade. Private contains mutations that are present in the sample and not part of the path from the RSRS sequence to the haplogroup and subclade. They are not private in the sense of unique. The total number of people sequenced is too small to show in most cases if a mutation is unique to one person. Variants is the RSRS based haplotype.
To clean up the data, I replaced all blank fields in the Missing and Private columns with none. I then added a column and added Hg IDs. I put additional information about samples in its own filed. Finally, I cleaned the name column so that it contained only GenBank Accession numbers.
Where there was more than one possible haplogroup, I moved the sample to a separate file to be manually reviewed latter.
Building the GenBank Sample Database
Like the Geno 2 content type, I am creating GenBank as an Advanced Content Type. Most fields match those in the MitoTool file. I have again created both build 16 and build 17 fields as relationships to the mtDNA Stories. In addition, I created another relationship filed, authorship. This one links the sample to the journal article where it was published in the Journal Article Archive on the site.
GenBank Sample Templates
For the GenBank section, there are two sets of templates. The first two are the Get Genbank Data template and an update to the mtDNA Stories Template.
The Get Genbank Data template returns matching rows for the results table.
The second creates the table and calls the first template. It includes using the WP JQuery DataTable plugin to add DataTable features.
The second set of templates are to create specific pages for sample details. In the results table, the Hg ID links to a details page. This page includes the GenBank ID, a link to the external source of the information on the sample, the publication information for the sample, and extended information about the haplotype.
Because I made GenBank an Advanced Content Type, I created a PODs Page Template to build the details page. This is done with a little PHP –I know just enough to make me dangerous.
The page template references the second single_genbank_template template.
Looking at Results
Now to see how it looks. Here is the GenBank section on the page for T1b. The columns shown on this page are Hg ID, Publication, Missing Variants, and Additional Variants.
This is a sample details page. On it are full details about the sample with links to publication and source information.
One more requirement down. To be continued…