It was the beginning of the 21st century. Family Tree DNA and Oxford Ancestors were just getting started. Use of the human male sex chromosome DNA, the Y-chromosome DNA, for population history research was taking off for academic researchers.
- The Paper
- Results and Discussion
Life was good.
Only, there was a problem. Those budding academic researchers were using seven different trees to map the paternal tree of humanity. Yah, seven. New Y-SNPs weren't even being regularly checked between different research groups. That means that nobody was completely sure which Y-SNPs belonged on the same branches of any of those seven different trees.
Life was good, but it was wild.
To make sense of it all, representatives from the major research group came together and collaborated on a single tree using a single nomenclature. They published their work, A Nomenclature System for the Tree of Human Y-Chromosomal Binary Haplogroups, in the February 2002 edition of Genome Research. This collaboration became the Y Chromosome Consortium (YCC).
These are some of the more important terms used in the paper. Don't worry about memorizing definitions, I will re-explain them as I go along.
Alu insertion polymorphism – Alu stands for abundant transposable elements. These are small fragments of DNA that have migrated from their original location in our DNA to a new one. Yes, I know that sounds weird. On the Y-chromosome, an Alu insertion can be a binary polymorphism.
Binary polymorphism – A genetic change with two possible states. That is positive or negative — derived or ancestral.
Deletion – When one or more base pairs is subtracted from the genetic code. For example, AGA might be subtracted from GATAGATA with the result now reading GATTA.
Haplogroup – A branch on the Y-chromosome Tree defined by one or more binary polymorphism.
Haplotype – A group of Y-STRs that share a haplogroup.
Insertion – When one or more extra base pairs is added to the genetic code. For example, AAA might be added to GATAGATA with the result now reading GATAAAAGATA.
Y-SNP – A single nucleotide polymorphism. This is a genetic change of exactly one base pair to another value, A changes to C. This is a type of binary polymorphism.
Y-STR – A short tandem repeat marker. They can be simple repeats like GATA or more complex. Generally, they vary by repeat count and not a binary state.
Y-chromosome – The human male sex chromosome.
The Y chromosome contains the largest nonrecombining block in the human genome.
The Y-chromosome is the human male sex chromosome. Most of it passes from father to son almost unchanged generation after generation.
By virtue of its many polymorphisms, it is now the most informative haplotyping system, with applications in evolutionary studies, forensics, medical genetics, and genealogical reconstruction.
Here, we can understand polymorphism as any change to the DNA. It includes SNPs and STRs, but it is not limited to them.
However, the emergence of several unrelated and nonsystematic nomenclatures for Y-chromosomal binary haplogroups is an increasing source of confusion.
Those are the seven different trees I mentioned earlier.
To resolve this issue, 245 markers were genotyped in a globally representative set of samples, 74 of which were males from the Y Chromosome Consortium cell line repository.
They tested more than 74 samples for 245 Y-chromosome binary polymorphisms. I will note that I think we have removed some of these markers from our 2017/2018 trees for reliability issues. Yes, 245 was a tiny number of markers to build a whole tree.
A single most parsimonious phylogeny was constructed for the 153 binary haplogroups observed. A simple set of rules was developed to unambiguously label the different clades nested within this tree. This hierarchical nomenclature system supersedes and unifies past nomenclatures and allows the inclusion of additional mutations and haplogroups yet to be discovered.
They built a single unified tree. With the 245 polymorphisms tested, the tree had 153 haplogroups. Here, we understand a haplogroup to be any branch on the tree. They also decided on a set of rules for future tree updates and changes.
In recent years, an explosion in data from the nonrecombining portion of the Y chromosome (NRY) in human populations has been witnessed. This explosion has been driven, in part, by the many recently discovered polymorphisms on the NRY. There has been a keen interest in using polymorphisms on the NRY to examine questions about paternal genetic relationships among human populations since the mid-1980s (Casanova et al. 1985). In more recent years, a use has been found for these polymorphisms in DNA forensics (Jobling et al. 1997), genealogical reconstruction (Jobling 2001), medical genetics (Jobling and Tyler-Smith 2000) and human evolutionary studies (Hammer and Zegura 1996). The low level of polymorphism on the NRY hindered research for many years. By the end of 1996, there were fewer than 60 known polymorphisms on the NRY. Most of these (∼80%) were long-range polymorphisms (detectable by pulsed-field electrophoresis), conventional restriction fragment length polymorphisms (RFLPs), or short tandem repeats (STRs).
There was interest in using the non-recombining part of the Y-chromosome dates to the mid 1980s. However, until around 1996 there were not enough known binary polymorphism to work with.
Until 1997, there were only 11 known binary polymorphisms that could be genotyped by PCR-based methods (Hammer 1994; Seielstad et al. 1994; Hammer and Horai 1995; Whitfield et al. 1995; Santos et al. 1995; Jobling et al. 1996; Underhill et al. 1996). These included single nucleotide polymorphisms (SNPs), an Alu insertion polymorphism, and a deletion.
A binary polymorphism is one that gives us a yes/no answer. That is the positive/negative == derived/ancestral type that we in genetic genealogy sometimes wrongly call SNPs. At the beginning of 1997, only 11 were known for the Y-chromosome.
Then, in 1997, Underhill et al. (1997) published 19 new PCR-based binary polymorphisms that were discovered by a novel and efficient mutation detection method known as denaturing high-performance liquid chromatography (DHPLC). This method has since been used to discover more than 200 SNPs and small insertions/deletions (indels) on the NRY (Shen et al. 2000, Underhill et al. 2000; Hammer et al. 2001).
In 1997, Dr. Peter Underhill of Stanford found nineteen new binary polymorphisms. In the following three years, more than 200 binary polymorphims were found.
These polymorphisms are particularly useful because of their low rate of parallel and back mutation, which makes them suitable for identifying stable paternal lineages that can be traced back in time over thousands of years. As the number of known binary polymorphisms increased, so did the number of publications and the number of different systems used to name these binary haplogroups.
Binary polymorphisms are useful for building the paternal tree. Indeed.
Currently, there are at least seven different nomenclature systems in use, making it very difficult to compare results from one publication to the next. Our purpose here is twofold: (1) to construct a highly resolved tree of NRY binary haplogroups by genotyping most published PCR-based markers on a common set of samples, and (2) to describe a new nomenclature system that is flexible enough to allow the inevitable changes that will result from the discovery of new mutations and NRY lineages. We hope that the nomenclature presented here will be adopted by the community at large and will improve communication in this highly interdisciplinary field.
The seven different nomenclatures and trees were a problem.
Results and Discussion
NRY Haplogroup Tree and Haplogroup Nomenclature
We constructed a comprehensive haplogroup tree for the human NRY by genotyping most of the known polymorphisms on the NRY in a single set of samples (74 male Y Chromosome Consortium [YCC] cell lines). Some polymorphisms known to be variable in other DNAs showed no variation in the YCC panel; therefore, additional samples were included to improve the resolution of the phylogeny. This served to increase the number of polymorphic sites mapped onto the haplogroup tree to 237. Two mutational events occurred at each of eight sites. However, these recurrent mutations were found on different haplogroup backgrounds and thus were distinguishable events. The 245 mutational events gave rise to 153 NRY haplogroups. The single most parsimonious tree for these 153 NRY haplogroups is shown in Figure Figure1,1, with mutational events shown along the branches.
They tested the samples for the binary polymorphisms and placed them on the tree. Please take a moment to check out the 2002 YCC Tree.
The tree was drawn as asymmetrically as possible by sorting the descendants of each interior node so that the bottom-most descendant had the greatest number of immediate descendants.
The haplogroup branches with the greatest number of known descendants were listed first on the tree. This is historic placement is why R1a comes before R1b.
The position of the root in Figure Figure11 (indicated by an arrow) was determined by outgroup comparisons. In other words, whenever possible, homologous regions on the NRY of closely related species (e.g., chimpanzees, gorillas, and orangutans) were sequenced to determine the ancestral states at human polymorphic sites (Underhill et al. 2000, Hammer et al. 2001). The root of the tree falls between a clade defined by M91 and a clade defined by a set of markers: SRY10831a, M42, M94, and M139. The NRY tree in Figure Figure11 can be seen as a series of nested monophyletic clades (i.e., a set of lineages related by a shared, derived state at a single or set of sites).
They set the root of the tree using the Y-chromosomes of our closest primate cousins.
In both science and genetic genealogy, the terms clade and haplogroup are often used interchangeably.
To devise a nomenclature system at a reasonable scale, we assigned a capital letter to several of the major clades, beginning with the letter A (for the haplogroup above the position of the root in Fig. Fig.1)1) and continuing through the alphabet to the letter R.
They set the main haplogroups with capital letters.
The letter Y was assigned to the most inclusive haplogroup comprising haplogroups A–R.
Yes, that is Y and not Y-chromosome Adam at the root.
Deciding which clades are to receive the highest labeling level can only be, to some extent, arbitrary. Here, we label with single capital letters those clades that seem to us to represent the major divisions of human NRY diversity. Only 19 letters have been assigned to clades to allow for the possible expansion and further resolution of this phylogeny (the implications of which are discussed below).
Note that there aren't many of the IJK type joins that we now have. The discovery of binary polymorphisms (mostly SNPs) that defined those joins came much later.
We propose here two complementary nomenclatures. The first is hierarchical and uses selected aspects of set theory to enable clades at all levels to be named unambiguously. The capital letters (A–R) used to identify the major clades constitute the front symbols of all subsequent subclades (Fig. (Fig.1).1). Unlabeled clades can be named as the “join” of two subclades; for example, clade CR includes all chromosomes that share the derived state of the M168 and P9 polymorphisms.
This is the start of haplogroup longhand names.
Note that this is distinct from the set theoretic “union,” which, in the above example, would not define a monophyletic clade. Lineages that are not defined on the basis of a derived character represent interior nodes of the haplogroup tree and are potentially paraphyletic (i.e., they are comprised of basal lineages and monophyletic subclades). Thus, we suggest the term “paragroup” rather than haplogroup to describe these lineages. Paragroups are distinguished from haplogroups (i.e., monophyletic groupings) by using the * (star) symbol, which represents chromosomes belonging to a clade but not its subclades. For example, paragroup B* belongs to the B clade; however, it does not fall into haplogroup B1 or B2. As illustrated in Figure Figure2,2, internal nodes are highly sensitive to changes in tree topology. Thus, the * symbol cautions that a given paragroup name may refer to different sets of chromosomes in succeeding versions of the phylogeny.
Paragroups are used on both the ISOGG and the YFull trees.
Subclades nested within each major haplogroup defined by a capital letter are named using an alternating alphanumeric system. For example, within haplogroup E, there are three basal haplogroups that are named E1, E2, and E3, and the underived paragroup becomes E*. Nested clades within each of these haplogroups are named in a similar way, except that lower-case letters are used instead of numerals. Again, paragroups are labeled with an * symbol, and the remaining haplogroups are labeled with an “a,” “b,” “c,” etc. This naming system continues to alternate between numerals and lower-case letters until the most terminal branches are labeled (tip haplogroups). Therefore, the name of each haplogroup contains the information needed to find its location on the tree.
This is the longhand that is still in use by the ISOGG tree.
Alternatively, haplogroups can be named by the “mutations” that define lineages rather than by the “lineages” themselves. Thus, we propose a second nomenclature that retains the major haplogroup information (i.e., 19 capital letters) followed by the name of the terminal mutation that defines a given haplogroup. We distinguish haplogroup names identified “by mutation” from those identified “by lineage” by including a dash between the capital letter and the mutation name. For example, haplogroup H1a would be called H-M36 (Fig. (Fig.2).2). When multiple phylogenetically equivalent markers define a haplogroup, the one typed is used. For example, if M39 but not M138 were typed within haplogroup H1, then H1c becomes H-M39. If multiple equivalent markers were typed, this notation system omits some marker information, and a statement of which additional markers were typed should be included in the Methods section. Note that the mutation-based nomenclature has the important property of being more robust to changes in topology (Fig. (Fig.22).
This is the haplogroup shorthand that is in use at Family Tree DNA and at YFull. The authors were very right that it is more robust than the longhand naming system.
Note though that both systems were rolled out at the same time in the same 2002 paper. Both names are for haplogroups.
While it is straightforward to name monophyletic clades, it is more challenging to devise a simple and flexible system to name underived interior nodes. This is especially important to facilitate the naming of haplogroups in studies where not all markers are typed, and to provide a standard set of names for previously described haplogroups (and paragroups). For instances where not all markers within a clade are typed, we introduce a bracketing system that encloses an “x” (for “excluding”) and the lineages that have been shown to be absent. This system can be applied equally well to the lineage-based and mutation-based nomenclatures. The following examples portray the lineage-based nomenclature first, followed by the mutation-based nomenclature. Lineages (or markers) excluded from a haplogroup are listed within parentheses after the name of the haplogroup (or the last derived marker in the case of the mutation-based nomenclature). For example, if M82-derived chromosomes are typed with all downstream markers, then the underived chromosomes belong to H1* or H-M82* (Fig. (Fig.3A).3A). However, if M82-derived chromosomes are typed only with M36, then the underived chromosomes belong to H1*(xH1a) or H-M82*(xM36) (Fig. (Fig.3B).3B). If we apply this bracketing method to the naming of Underhill et al.'s (2000) paraphyletic haplogroup VI, then its label becomes F*(xK) or F-M89*(xM9) (Table (Table1).1). In the more extreme case of a study genotyping only the YAP and M3 markers, chromosomes ancestral for both markers would be named Y*(xDE,Q3) or Y*(xYAP,M3), where Y refers to the most inclusive haplogroup encompassing the total cladogram. See Table Table11 for application of this bracketing system to lineage-based names of previously published haplogroups. When using the mutation-based nomenclature, the adoption of this bracketing system is optional, as long as full lineage-based names of haplogroups have been given elsewhere in the manuscript (e.g., in the form of a table or a tree). The lineage- and mutation-based nomenclatures each has advantages and disadvantages, and each can be used where most appropriate.
If you are doing this today, I suggest avoiding the longhand based exclusions.
Cross-Referencing to Previous Nomenclatures
A number of investigators have developed nomenclature systems based on overlapping subsets of the markers typed here. To facilitate comparisons among seven previously published nomenclatures and our present proposed nomenclature, Figure Figure11 and Table Table11 illustrate direct comparisons among these different systems. These nomenclature systems are extremely inconsistent (i.e., nonisomorphic) in how they define haplogroups. Moreover, when there is consistency between two systems (e.g., between Underhill et al.'s  haplogroup V and Hammer et al.'s  haplogroup 1F), different names are used for the same haplogroups. All of the major human NRY nomenclature schemes used thus far have included paraphyletic groupings (see Fig. Fig.1),1), and these paragroups can be misinterpreted as being necessarily ancestral to “downstream” haplogroups containing derived characters. Three major benefits of the proposed system are (1) its ability to distinguish between underived interior nodes (paragroups) and monophyletic clades (haplogroups), (2) its flexibility in naming haplogroups at different levels of the phylogenetic hierarchy, and (3) its ability to accommodate new haplogroups as new mutations are discovered (see below). If broadly accepted and utilized, this system also will serve to standardize the names of NRY haplogroups in the literature.
Their naming system was designed to allow for change over time. This has proven to be a very good thing.
Caveats and Changes in Nomenclature
In addition to the long-term challenges posed by any attempt to form a stable nomenclature system, there are several caveats that should be raised relating to the way the current tree topology was inferred. First, it is important to point out that not all polymorphisms were genotyped in all individuals. Indeed, continued genotyping of these polymorphisms may result in slight changes in the topology of the tree in Figure Figure1.1.
The new tree was just a beginning.
It is also possible that some mutational events that were assumed to be unique actually are recurrent on the tree (i.e., there are undetected multiple hits at some additional sites).
Some binary polymorphisms have been shown to happen in so many places that we do not use them.
More importantly, because it is extremely difficult to devise a nomenclature system that is both informative in a phylogenetic sense and impervious to the need for renaming groups as new polymorphisms are discovered, a set of guidelines is needed to minimize the impact of future structural changes in the tree.
There should be rules for how the tree is changed.
To facilitate the evolution of the present nomenclature, we make a number of proposals. Firstly, a nomenclature committee comprising some of the current participants in the YCC will receive requests from investigators who wish new binary markers or haplogroups to be incorporated into the nomenclature, and will decide on the changes to be made to the existing system. At any one time, the current nomenclature and the committee's contact details will be made available on the following URL: http://ycc.biosci.arizona.edu.
The YCC wanted to act as a governing body for maintaining a stable Y-chromosome tree.
Consequently, we recommend that if investigators wish to use new markers prior to their incorporation into the nomenclature, they distinguish between consensus and novel parts of the clade labels by use of a forward slash. For example, a new mutation (μ) that divides clade D1 in two creates D1/-μ and D1/-M15*. This makes it clear to the reader which parts of the label are specific to that study and which can be cross-referenced to other publications. This will minimize confusion should two contemporaneous papers introduce novel markers within the same clade. In this manner, information from VNTR and STR haplotypes also can be incorporated; a standard nomenclature for Y-STRs already is available (Gill et al. 2001).
They asked for researchers to mark new tree changes in their papers as such. I don't think I have read a paper where this suggestion was followed.
Because new versions of the YCC nomenclature will be published annually to reflect changes in the tree topology resulting from newly discovered mutations, we suggest that each paper cite the particular version of the YCC NRY tree that was used (e.g., YCC NRY Tree 2002).
Annual updates to the YCC NRY tree were planned. This did not happen.
The cladistic nomenclature of human mtDNA diversity adopted by many groups some years ago has greatly advanced studies of maternal lineages and the communication of their conclusions (Richards et al. 1998).
The maternal mitochondrial DNA (mtDNA) tree was much more usable than any of the Y-DNA trees in use.
By contrast, recent dramatic advances in the resolution of paternal lineages have resulted in multiple nomenclature systems that have hampered communication among NRY researchers and the scientific community at large.
They thought fewer than 300 binary polymorphisms on the whole tree was dramatic.
Here, we introduce a strictly phylogenetic (cladistic) nomenclature for human NRY variation based on the phylogeny of 153 paternal lineages. This system is flexible in its ability to assign haplogroup names at different levels of the phylogenetic hierarchy. The phylogeny of the human NRY lies at the heart of a multidisciplinary enterprise in which unambiguous communication is vital. The nomenclature proposed here along with guidelines for revisions, represent an important resource to those interested in medical, forensic, and evolutionary genetics alike.
This is just up-selling their effort.
YCC Cell Lines
The YCC is a collaborative group involved in an effort to detect and study genetic variation on the human NRY. The YCC was initiated in 1991 by Michael Hammer and Nathan Ellis with the following goals: (1) to establish a repository of lymphoblastoid cell lines (YCC cell line repository) derived from a sample of males representing worldwide populations, (2) to provide DNA isolated from these cell lines to investigators searching for polymorphisms on the NRY, and (3) to establish a common database containing the results of typing DNAs from the Repository cell lines at as many Y-specific polymorphic sites as possible (YCC Newsletter: http://www.ycc.biosci.arizona.edu/ycc1.html). Lymphoblastoid cell lines were established at the New York Blood Center from blood donated by volunteers who gave informed consent. Additional cell lines were donated by Luca Cavalli-Sforza, Trefor Jenkins, Judy Kidd, and Ken Kidd; or were purchased from the Coriell Institute. See Table Table22 for a list of the YCC cell lines, as well as associated geographic, ethnic, and linguistic information.
Unlike the DNA sample we submit, cell lines are created from blood and are a source of ongoing DNA sample generation. As a community, we should note that these cell lines should still exist.
Other DNA Samples
In constructing the tree, a great deal of phylogenetic information was retained from previous studies. When markers from different laboratories mapped on the same branch of the tree, an attempt was made to determine the order of mutational events. Toward this end, a variety of samples was provided by each of the participating laboratories, all of which were obtained with informed consent. These samples represented known haplogroups that were not present in the YCC cell line DNAs and thus served to map many additional markers on the haplogroup tree.
Some of the samples had to come from individual labs. These likely would not have been cell lines.
Genotyping SNPs and Indels
The protocols for genotyping many of the 237 polymorphic sites analyzed have been published (see Underhill et al. 2000, 2001; Hammer et al. 2001, and references therein); some of these assays were converted from conventional RFLPs and DNA sequence data (e.g., Jobling 1994; Hammer et al. 1997; Pandya et al. 1998; Bergen et al. 1999; Shinka et al. 1999; Bao et al. 2000). The remainder will be published in future manuscripts. Recurrent mutations, observed at SRY10831, 12f2, MSY2, M116, M64, M108, P37, and P41 are counted as distinct polymorphisms. Supplementary Table 1 (available as an online supplement at http://www.genome.org) lists all published markers included in this survey and primer information.
Most of the testing was done with RFLP. This 2002 effort was not by Sanger sequencing.
The terms “haplogroup” and “haplotype” have various, overlapping definitions in the literature. Here, we use the terminology of de Knijff (2000) in which “haplogroup” refers to NRY lineages defined by binary polymorphisms. The term “haplotype” is reserved for all sublineages of haplogroups that are defined by variation at STRs on the NRY (Y-STRs). Mutations labeled with the prefix “M” (standing for “mutation”) were published by Underhill et al. (2000, 2001). Many of the mutations with the prefix “P” (standing for “polymorphism”) were described by Hammer et al. (1998, 2001). The eight recurrent mutational events are indicated by their mutation name followed by a or b.
There are the original and official definitions for haplogroup and haplotype for Y-chromosome population genetics and genetic genealogy.
Nathan Ellis (Memorial Sloan-Kettering Cancer Center), Michael Hammer (University of Arizona).
Michael Hammer (University of Arizona), Matthew E. Hurles (McDonald Institute for Archaeological Research), Mark A. Jobling (University of Leicester), Tatiana Karafet (University of Arizona), Turi E. King (University of Leicester), Peter de Knijff (Leiden University), Arpita Pandya (University of Oxford), Alan Redd (University of Arizona), Fabrício R. Santos (University of Oxford and Universidade Federal de Minas Gerais), Chris Tyler-Smith (University of Oxford), Peter Underhill (Stanford University), and Elizabeth Wood (University of Arizona). Mark Thomas (University College London) provided information on the order of the M17/SRY10831b mutations.
Luca Cavalli-Sforza (Stanford University), Nathan Ellis (Memorial Sloan-Kettering Cancer Center), Michael Hammer (University of Arizona), Trefor Jenkins (University of Witwatersrand), Judy Kidd (Yale University), Ken Kidd (Yale University).
Peter Forster (McDonald Institute for Archaeological Research), Michael Hammer (University of Arizona), Matthew E. Hurles (McDonald Institute for Archaeological Research), Mark A. Jobling (University of Leicester), Peter de Knijff (Leiden University), Chris Tyler-Smith (University of Oxford), Peter Underhill (Stanford University).
Stephen Zegura (University of Arizona), Matthew Kaplan (University of Arizona).
This work was supported by grants from the National Science Foundation (OPP-9806759) and the National Institute of General Medical Sciences (GM53566) to MH; from the NIH (GM28428 and GM55273) to PAU and LCS, from the BBSRC to AP; from the Leverhulme Trust to FRS; and from the CRC to CTS. MAJ is a Wellcome Trust Senior Fellow in Basic Biomedical Science (grant number 057559). The Y Chromosome Consortium thank Colin Renfrew and the McDonald Institute for Archaeological Research for running a workshop attended by the members of the nomenclature committee at which many issues were resolved in a collaborative spirit.
For the work that went into this first standard and tree, we thank every collaborator.
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 USC section 1734 solely to indicate this fact.
There you have it. The YCC in 2002 created a single unified tree for all binary polymorphisms from Y-chromosome DNA. They set up a dual naming system. Those naming systems became known as longhand and shorthand later. In a later post, I will go over the different ways the tree can be changed by new samples testing and new binary polymorphisms being discovered.