DNA testing makes it possible for those interested to easily and inexpensively take a DNA test. Due to privacy concerns as well as business model, however, only partial data is made available to customers. As a result, it is difficult to piece the raw data together to build up information. Making sense of the data is a challenge.
This paper is an initial effort at drawing deeper information from raw data provided by DNA testing. Two methods are presented to model the data and how it can then be interconnected. Steps to reduce complexity of interconnection are also presented. Both methods yield a connectivity matrix, though each is reduced differently. They are then fed as input to the widely used, open source graph drawing program, Graphviz.
As an Ancestry.com subscriber, the method presented stems from my desire to make sense of the somewhat disconnected raw data from Ancestry into easier to understand information. Ancestry provides two basic categories of information:
As mentioned, the challenge is that this is partial data. By limiting data access and not connecting customer trees in the fashion of WikiTree.org, customers must maintain subscriptions to continue their work. While some work will be identical performed in parallel with other customers, i.e., relatives, doing the same, this business model increases Ancestry revenue. To remain competitive, Ancestry uses some of the income for further document acquisition and digitization. So, there is a trade-off of pros and cons, both helping and hindering genealogical research.
A more practical challenge from the software developer's point of view is that there is no remote access to Ancestry's database. All information must be acquired through their web site. Web sites, by their nature, are designed for people as opposed to programmatic interaction. With these limitations in mind, the next steps are an approach to model the data and then present it in a useful manner.
This method is a useful way to view all direct matches and common matches between them and the primary person. Because it shows all data, it can also be a little overwhelming to make use of. This method in conjunction with the one presented later can make powerful allies, however, in determining where a DNA match is likely to reside in the family tree.
The technique is easily performed by hand, but becomes challenging with more people. The steps are simply:
The software version of the technique is also straightforward, and revolves around use of an adjacency matrix from graph theory. A square matrix is constructed such that rows and columns are named with each direct DNA match. Each element of the matrix is a zero if those two people do not share a common match, and a one if they do.
An example describes it better. Suppose we have a graph of four people whose DNA connects them like this:
A | B | C | D | |
A | 1 | 1 | 1 | 0 |
B | 1 | 1 | 1 | 1 |
C | 1 | 1 | 1 | 1 |
D | 0 | 1 | 1 | 1 |
Notice that along the diagonal there is redundant information indicating that A is related to A, B to B, and so on. This seemingly redundant data turns out to be helpful when reducing graph complexity. Because rows B and C are identical—and they wouldn't be, if the diagonal elements weren't 1—they can be collapsed into one node:
There are two important items in a DNA match: name and amount of shared cM's (centiMorgans). As a result, modeling a DNA match is simple. The larger challenge is modeling the relationship between people sharing DNA. It is important to note that the goal is not to build a family tree, but to create a graph showing who shares DNA segments with who. The graph is similar to a family tree, but not exactly that. By using cM levels, this technique can likely be enhanced to become a bit more like a tree. But for this first effort, simple results are the goal.
Ancestry DNA match lists are familiar to customers. Matches in 2021 look like the following.
Because direct database access is not available, two approaches are possible. One, is to use programmatic retrieval of web pages, parse them, and reformat name/cM data as desired. That is unquestionably the best way because it involves no human interaction. However, it works only for Ancestry pages as presented on the web today and is work intensive for the programmer. Because this is a spare time effort, I opted for the second and lazier approach. Each page of Common Matches is copy/pasted into a plain text file. Then, a program converts the data into a list like the following, where names in this paper are partially obscured with Xs:
XXXMurphy XXXMarkowski|3445 XXXMarkowski|3444 XXXntag888|1013 XXXBrosnan|359 XXXFraser|210 XXXieleenj|181 XXXNolan|172 XXXBrosnan|95 XXXFitzgerald|75 XXXconway|68 XXXBrosnan|68 XXXFreeman|65 --B.|65 --C.|56 XXXAdair|55 ...
An advantage to this is that DNA data from any source can be used as long as it is eventually presented in the form above. The program expects files in that format as input, and each file name must exactly match the name as entered. That is, suppose the file above is for direct matches with fictional customer Joe Smith. If a file exists with common matches between Joe Smith and XXXMurphy, the first name in the list, then that file of common matches must be named XXXMurphy, in this example. If such a file does not exist, XXXMurphy is ignored because he is such a distant match that he has no common matches between the Joe Smith and other matches. It isn't that he is uninteresting overall, but that this match can provide no DNA interconnection data and is unhelpful here.
Once all the people are read in with name and cM level stored, interconnections are modeled. This is performed just as in the small example presented earlier. Looking at intermediate output as the program runs on a small example,
Everyone 0: XXXN|396 1: XXXTobiansky|842 2: XXXGode|361 3: XXXFinnegan|690 4: XXXC.|618 5: XXXGode|671 6: XXXBrosnan|359 7: XXXntag888|1013At this point the program has read in direct matches, then read in common matches for each direct one. In the following steps, names are replaced with numbers above for easier construction of the adjacency matrix. Next, the complete matrix is constructed. For readability, zeros are replaced with dashes.
Matrix Raw 0: ['00' 111111--] 1: ['01' 111111--] 2: ['02' 111-11--] 3: ['03' 11-1----] 4: ['04' 111-11--] 5: ['05' 111-11--] 6: ['06' ------11] 7: ['07' ------11]
The matrix is then sorted in preparation for combining identical rows, i.e., people with identical common matches, into a single row. Again, this is why it is important to set all diagonal elements to 1 in the original matrix, above.
Matrix Sorted 0: ['06' ------11] 1: ['07' ------11] 2: ['03' 11-1----] 3: ['02' 111-11--] 4: ['04' 111-11--] 5: ['05' 111-11--] 6: ['00' 111111--] 7: ['01' 111111--]Finally, the reduced matrix is created, the final output of the program, where duplicate rows and columns have been combined.
Matrix Reduced 0: ['00,01' -11-] 1: ['02,04,05' 1---] 2: ['03' 1---] 3: ['06,07' ----]In reality, the program does not output what is shown above. It outputs the connectivity matrix in the dot language, a simple formal language that the open source graph visualization package, Graphviz, uses. Currently, it uses Graphviz's circo program, but no problems are posed if more elaborate dot output is generated for any of the Graphviz programs.
The examples above illustrate a small number of relatives, from 350 to 3000 cM. A more typical example, below, ranges from 40 to 3000 cM, making apparent the appeal of software.
This is hobbyist software without time or resources to knock off the rough edges! Please feel free to contribute if you are a programmer and are so inclined. Install python and Graphviz if not already on your computer. Then, to use the software:
That is by far the most work intensive part. With all these files in a directory, the easy part is to generate the graphs.
An optional -t name argument will highlight a target person and all matches in common.
Note: the software is preliminary and has shortcomings. Ancestry does not require that displayed names are unique. I discovered I have DNA matches with two men who chose the identical display name. At the moment, the only way around this is to manually rename one of them and use that same new name when referenced by other matches. No doubt there are other shortcomings yet to be discovered.
The following graphs are purposely not legible, in case Ancestry customers who match me don't want their data published. That is ok, however, because it is only the shapes of the graphs that matter for this discussion.
The program, dna, accepts command line arguments so that you can specify low and high thresholds of shared cM values. For example, suppose I use a lower threshold of 40 cM, meaning I only want to see extended family, the command is
python dna -c -l 40 mike_markowskiwhich generates a large graph of my extended family out to 40 cM shared DNA.
Suppose, however, that I filter out my sons and their many links similar to mine. Children share about 3500 cM's with parents, so I cap the cM's to 3000. Similarly, instead of going out to 40 cM, I only go out to 100 cM for this graph to keep it small:
python dna -c -l 100 -h 3000 mike_markowskiNow, my maternal and paternal trees are clearly delineated.
python dna -c -l 37 -h 200 mike_markowskiNotice that there are three groups rather than the four expected, one for each grandparent.
Trying to split trees further to my great-grandparents becomes much more uncertain due to the relatively small number of people who have taken DNA tests.
python dna -c -l 37 -h 60 mike_markowskiThere are eight subtrees as expected.
This method tries to eliminate redundant paths, making it easier to study interconnections. The graph produced is much smaller than those created by the exhaustive Method 1. As before, an example is the easiest way to describe it. Suppose we have the following graph of DNA matches:
Method 2's directed graph is reduced to this:
As before, an adjacency matrix is used to capture connections between common matches. The technique up to the point of creating the initial matrix in Method 1 is identical. The difference is how the matrix is simplified. In this case, paths are compared between every two nodes. In the illustration above where a link is shown between nodes A and D, the algorithm notes that the path A-B,C-D is length 2 and the path A-D is length 1. The longest path is always used because this eliminates the long, sweeping links running from top to bottom. Because this results in a sparse matrix, the method is very fast compared to Method 1.
The software is run identically to Method 1 and is selected with dna program's -d, for digraph, argument. Again, everything about the program initially is, in fact, exactly what Method 1 does to create the adjacency matrix. It is only the matrix simplification and the method that Graphviz plotting that changes.
After running
python dna -d -l 40 mike_markowskithe output, again purposely blurred, shows my two sons at the very top followed by a split with paternal to left, maternal to right. Because the maternal grandparent lines represent two continents, the subtrees are neat and orderly without crossover.
My paternal side, however, represents the small Irish farming village described earlier. The farther out (less cM in common) that the graph goes, the greater crossover and tangling is seen between my grandparents' lines.
Threshold range of 80 cM–200 cM once results in nearly, but not quite, four graphs representing grandparent lines.
The goal is for the programs to useful tools, helpful in creating a more complete family tree. They present the same information as Ancestry web pages but in different ways. An advantage is that data from multiple web pages is presented in one or two pictures, coalescing more information in one place. A major use is to help locate a general area in the family where a new match might exist. Often, a DNA match tells you that you are related to someone, but the path is a mystery. While a graph can't necessarily provide all the answers, with any luck it will at least narrow down areas of search.
For business concerns and, to a lesser extent, privacy concerns, Ancestry limits access to its data. The simple device of an adjacency matrix allows customers to more efficiently visualize the partial DNA data provided by Ancestry. This effort can be extended, providing better graphing and representation of data. The work described here makes no effort to create a family tree because having only DNA data results in a tree with uncertainty. The uncertainty could be represented by “clouds” of unknown detail that connect known segments of the tree. Ideally, the result would be a fully correct family tree, with unknown areas represented by clouds similar to computer network diagrams.
This document was generated using the LaTeX2HTML translator Version 2021 (Released January 1, 2021)
The command line arguments were:
latex2html -nonavigation -split 0 -show_section_numbers commonMatches.tex
The translation was initiated on 2021-06-08