Creating a Y-STR Clustered Heatmap

Creating a Y-STR Clustered Heatmap
by Wesley Johnston
begun 10 Apr 2021, last updated 23 Nay 2021 - Update note on STR signatures
Click here to return to my main family history page.

I developed the Y-STR Clustered Heatmap in 2019. I first documented it in an article for the "Journal of One-Name Studies" which is expected to be published in Jul-Sep 2021. This web page deals with what I have learned and modified since that article. -- Wesley Johnston

What I Have Learned

Family Tree DNA does NOT use a standard step-wise genetic distance. Particularly for multi-valued markers, FTDNA uses their own method. Roberta Estes explained the differences on her DNAeXplained-Genetic Genealogy blog. Thus far, I have found no tool that accurately matches the FTDNA genetic distance in every case. Most tools are within a genetic distance of 1 or 2 different from the FTDNA GD. So, the overall heatmap focus on relationships changes slightly but not enough to greatly alter the benefits.
The process needs more automation both to facilitate generation of it and also to reduce human error. My original method was done completely in Excel. It required tedious repetitive clerical steps, with small differences at each step, followed by yet another series of repetetive clerical steps. All of this led to greatly increased opportunity for making an error. So, while I will still use Excel as the home for the process, I will use other tools if they can automate key parts of the process.
Combining STR and SNP information makes full automation likely not possible. Heatmap apps and clustering apps rely solely on the contents of the cells in the matrix. They can make an initial clustered heatmap, based solely on the STR genetic distances. But they cannot include additional considerations of the SNP clusters, unless I can figure out some way to make that information automatable in a way that balances with the STR clustering.
In some projects, we have been able to identify STR signatures: large numbers of STRs for which the SNP cluster members all have exactly the same value for all those STRs. I want to include consideration of this STR signature in the clustering of the heatmap and make visible which other kits not yet Big Y tested share this signature as I do the clustering.
The more I explore this the more I realize STR signatures of known SNPs in the Y-Haplotree is crucial to accuracy, and I have yet to figure out the best way to combine these. In the Johnstons of Annandale project, we use Bill Howard's Revised Correlation Coefficient method of generating phylogenetic trees. Unlike the original version of my heatmap, RCC does not rely on genetic distance calculation. It is a holistic measure of the entire set of STRs. I think Bill Howard's RCC implicitly includes some aspects of STR signatures. We have found most (but not all) clusters to conform well with Y-Haplotree positioning. The estimated dates for branch splits for recent (mid-1700's to present) seems less reliable than dates in the 1500-mid-1700's range. I think this is because the groupings in that range rely less on recent mutations and have a more solid STR signature. But I have not analyzed it to see if that conjecture is valid. (One other note: RCC seems only reliable for kits within a common SNP level in the Y-Haplotree. The core group of the Johnstons of Annandale kits are all I-Y8830.)

Toward a New More-Automated Process

I have yet to fully define the new process. But here are the rough steps that I have chosen to use at this point in the development.

Gather the Y-STR results into a spreadsheet, one kit per row.
Convert all multi-valued STR markers into separate single-valued markers.
Automated creation of the genetic distance matrix
This was THE most tedious and error-prine part of my original process. It is now done very quickly using Colin Ferguson's modified version Dean McGee's Y-Utility. Note that this web page can be saved to your computer and run from your computer rather than online. Do keep in mind that while "The target of the Hybrid mutation model is to match the method used by FTDNA", it is very close, but in some pairs of kits it is off by 1. So most of the values in the GD matrix will have the same genetic distance as the FTDNA GD -- but some will be off by 1.
1. In the Excel worksheet of raw results, select and copy the kit labels and STR values for each kit. Do not include the STR marker labels.
2. Paste the copied text into the blank window at Colin Ferguson's modified version Dean McGee's Y-Utility or to your downloaded copy of the web page.
3. On the web page, UNcheck "FTDNA" and "TMRCA". Then click "Execute".
4. The Genetic Distance matrix will pop up in a new window.
5. Use CTRL + A to select the entire GD Matrix and then CTRL + C to copy all of it.
6. Paste the copied information into an empty Excel worksheet. You will then have to format it by removing everything but the GD Matrix and its labels. Copy the row labels and paste them special with the transpose option as the column lables, and then align them vertically. You can then delete all the other column label rows.
First Clustering of the GD Matrix by GD from modal kit
The first clustering of the GD Matrix makes use of the "modal" values. The app calculated the modal values for each STR marker and then calculated the difference of each kit from the generated "modal" kit. You won't see the generated modal kit -- just the genetic distance that each kit is from the modal kit.
1. Set the cell at the intersection of the modal row and the modal column to zero. Sort the GD matrix on the modal column as the first sort and the kit label as the second sort.
2. In an empty new worksheet, copy the sorted GD matrix from the prior step, and paste it special with the transpose option.
3. Sort the GD matrix on the modal column as the first sort and the kit label as the second sort.
4. In an empty new worksheet, copy the sorted GD matrix from the prior step, and paste it special with the transpose option.
Final Clustering of the GD Matrix
The final clustering is where the SNP clusters and STR signatures have to be maintained. The only way to do this that I have found is to do it manually. The KEY thing to remember is that in order to move a person in the matrix, you have to move BOTH his row and his column to assure that his genetic distances from all the other kits stay in alignment with those other kits.

Contact Information

Send E-mail to wwjohnston01@yahoo.com