- Family Tree DNA does NOT use a standard step-wise genetic distance. Particularly for multi-valued markers, FTDNA uses their own method. Roberta Estes explained the differences on her DNAeXplained-Genetic Genealogy blog. Thus far, I have found no tool that accurately matches the FTDNA genetic distance in every case. Most tools are within a genetic distance of 1 or 2 different from the FTDNA GD. So, the overall heatmap focus on relationships changes slightly but not enough to greatly alter the benefits.
- The process needs more automation both to facilitate generation of it and also to reduce human error. My original method was done completely in Excel. It required tedious repetitive clerical steps, with small differences at each step, followed by yet another series of repetetive clerical steps. All of this led to greatly increased opportunity for making an error. So, while I will still use Excel as the home for the process, I will use other tools if they can automate key parts of the process.
- Combining STR and SNP information makes full automation likely not possible. Heatmap apps and clustering apps rely solely on the contents of the cells in the matrix. They can make an initial clustered heatmap, based solely on the STR genetic distances. But they cannot include additional considerations of the SNP clusters, unless I can figure out some way to make that information automatable in a way that balances with the STR clustering.
- In some projects, we have been able to identify STR signatures: large numbers of STRs for which the SNP cluster members all have exactly the same value for all those STRs. I want to include consideration of this STR signature in the clustering of the heatmap and make visible which other kits not yet Big Y tested share this signature as I do the clustering.
- The more I explore this the more I realize STR signatures of known SNPs in the Y-Haplotree is crucial to accuracy, and I have yet to figure out the best way to combine these. In the Johnstons of Annandale project, we use Bill Howard's Revised Correlation Coefficient method of generating phylogenetic trees. Unlike the original version of my heatmap, RCC does not rely on genetic distance calculation. It is a holistic measure of the entire set of STRs. I think Bill Howard's RCC implicitly includes some aspects of STR signatures. We have found most (but not all) clusters to conform well with Y-Haplotree positioning. The estimated dates for branch splits for recent (mid-1700's to present) seems less reliable than dates in the 1500-mid-1700's range. I think this is because the groupings in that range rely less on recent mutations and have a more solid STR signature. But I have not analyzed it to see if that conjecture is valid. (One other note: RCC seems only reliable for kits within a common SNP level in the Y-Haplotree. The core group of the Johnstons of Annandale kits are all I-Y8830.)
|