PCA Based Ethnic Origins – Part 3, How it is used

As I described in Part 2, Principal Component Analysis is a method that can show how alike or different populations of people are. Here, I will show how this method is reversed to attempt to find someone’s ethnic origins.

First, let’s be very clear what we are doing. There is an old logic example that goes like this.

  • If it is a cat, then it is an animal. True.
  • If it is an animal, then it is a cat. False.

Sometimes, you will be right. However, you will often be wrong. Finding ethnic origins based on PCA is similar.

  • If someone is from Italy (baring resent immigration), then their genetic markers represent the population of Italy. True.
  • If someone has genetic markers matching the population of Italy, they are from Italy. False.

I will come back to this more in future posts. It is important though to understand that this methodology is attempting to do something that is fundamentally flawed.

Now, back to explaining the PCA to ethnic origins percentages process. This time, I am going to zoom in on the part of the chart that shows people from Sardinia.

PCA - Italy 2
PCA – Sardinia Results

As a reminder, these are the color codes:

  • Green – Sardinia
  • Blue – North Italy
  • Yellow – Central Italy
  • Purple – South Italy including Sicily

In the most basic form, PCA to ethnic origins programs look at how someone fits with the known groups and try to match them to one or more of those groups. Most of the Sardinia samples group together. That will make it easy for an ethnic origin program to classify them as from Sardinia.

If someone with an unknown origin were compared, and they fit clearly into the Sardinia group, the program would predict them as Sardinian.

Not everyone in even the comparison set fits clearly into the Sardinian group though. There are two other types.

First are the ones who are half way between the Sardinia group and the other groups. These might be classed by the program as only 50% Sardinian.

Second, there is at least one sample from Sardinia that clusters with the Northern Italy group. This person might have ethnic origins calculations showing that they are Northern Italian. That will likely be seen as wrong by them.

This is true going the other way too. There are people from Southern Italy who are fully clustering with the Sardinians. This means the program would likely report that they are 100% Sardinian.

These are the problems one will find when using PCA to find the ethnic origins of people from a single population.

Can they be fixed? Partly yes. There are easy fixes and hard fixes.

The hardest part of fixing the fundamental problem with current analysis.

  • Q: What is that?
  • A: The SNPs (markers) on the current chips are too old.

One way or another, to get genealogically relevant and consistent ethnic origins percentages from PCA, the testing companies need to use genealogically aged markers. This method is not ever going to work as is. The companies can add another thousand population sets from around the world and still the problems above will be problems even for people in the origin population sets themselves.

Microarray chip based tests like they sell at Ancestry, 23andMe, and Living DNA are samples of our genomes tested with known variants. Most of those variants are common to at least 5% of the human population around the world. Some are selected to be between 1% and 5% of the human population. That means they have been around for a long time. The variants on a microarray chip date to the first farmers who spread agriculture 10,000 years ago, to the Out-of-Africa travellers 50,000 to 70,000 years ago, and even back to early stone age ancestors who predated modern humans.

The easy part is this. There will always be some people who match up better with a neighboring population than with their own. More and more comparison populations may make this worse and not better. The fix is to accurately describe and show maps for each population that reflect genetic reality. As long as companies continue to push artists’ concept maps to people rather than genetic reality, customers will be confused and dissatisfied.

Next time, I will write more about this easy fix.

Posts in series

