How to Create and Tweak Surname Connection Graphs
by Wesley Johnston
Begun Febaruary 1, 2014 - Last updated May 13, 2024 - Add link to GEPHI graph

I have several community/parish reconstruction projects, in which I have created a database of the trees of all of the families in the records of the community. When these databases had relatively few surnames, it was relatively easy for me to manually create surname connection graphs, either in a graphic program or in a spreadsheet. But as the number of surnames and their inter-connections grew, the complexity of manually generating a surname connection graph became very challenging. I would add the most frequent surname pairs first to create a starting structure, and I would then add in the surname pairs for all of one of those original surnames. Adding just one more surname pair soon resulted in massive overhaul of the visual graph, in order to keep it understandable. What had worked for the pairs up to that point could no longer fit the new pair, even with moving nodes around. Another dimension became necessary, so that I had to come up with creative ways (e.g. asterisks) of showing that a cluster of surnames were related to each other by other than visible lines.

So I saw that I needed some automated tool to generate surname connection graphs. I had already realized in my manual process that standardized spellings of the surnames were needed, since the records were full of variants that really were the same family. So I had generated a spreadsheet of the surname pairs (grooms and brides) and then a standardized-spelling version. I wanted to input this standardized set of surname pairs into a tool that would generate an undirected connection graph with each node labeled as a surname. Ideally, the resulting graph would show multiple connections between a surname pair either with multiple lines between them or with a thicker line. I also realized that the default graphs were not going to work for presentation on a web page, since they were so cluttered. So I needed to have some way to tweak or subset the graphs, in order to provide better understanding of what they were revealing.

This came to a focus when I saw a thread posted on the Genealogy-DNA e-mail list by Ann Turner. I replied to her note, posing my need. A number of very helpful responses led me to examine several tools: NetDraw, igraph in R, yEd, and Pajek. Since this entire effort is about visualization, I have a heavy bias in favor of tools that are GUI and not command-driven. I also want a tool that is supported and kept current and has very good manuals and help. The best GUI and manuals for a simple start were for yEd, although Pajek looked like a good "someday I might try this" command-driven tool with lots of power and a good manual but a learning curve that I did not want to tackle until I had first invested an effort into seeing if yEd would do what I need.

It was not far into this effort that I realized that others would like to be able to do the same thing that I want -- a relatively simple way of generating and tweaking good quality surname connection graphs. So I began this web page February 1, 2014 to document my efforts.

-- Wesley Johnston


Contents
  1. Manually-Created Graphs
  2. yEd
  3. Data Preparation
  4. External Link: Using GEPHI to Create a Surname Connection Graph - In 2024, I figured out how to use GEPHI to create a surname connection graph. This is a link to step by step instructions for doing that.

Manually-Created Graphs

Here are four exampes from three different community projects. Click on the link to view the image in each case. [Links to the projects are in square brackets.]


yEd

The yEd software can be downloaded for free at the yWorks yEd page. Since I have my surname pairs in an Excel spreadsheet, the relevant section of the yEd Graph Editor Manual is Import of Excel Files.

I will now give the instructions for how I created a graph and then for how I created the web page to which the above link points. You first need to use the yWorks yEd link above to download and install yEd on your computer.

I generated a spreadsheet of 659 marriages - pairs of surnames, as described below in the Data Preparation section. Importing the spreadsheet into yEd was easily done. And then choosing the "Organic" layout and using its defaults generated a graph that was very much what I had wanted to be able to create. But it showed no surnames. So I posted a message on the yEd forum and someone provided the answer which definitely solved the problem. And I produced a new graph that is very much to my liking, though definitely something I would also like to tweak for visual reasons.

Here is a link to view the result. When you open this, you will see a popup in the yEd Live app with the URL of the file being opened. Copy that URL since yEd live for some reason does not load it despite what the popup says. Go ahead and click Load in the popup, then when the popup goes away, click on the three stacked lines at top left and then on "Open". Then click on "From Local File" and paste the URL in there to load it.

You can navigate around the graph and zoom in and out. The arrows connecting surnames point from the man to the woman. Each line represents one or more marriages. If you clikc on a node for a surname and then click on the "Neighborhood" box at the left, it will show you only the marriages that include that surname.

Instructions for Creating a Graph

  1. Step 1 - Import the data from the spreadsheet into yEd, specifying the node labels as the surnames.

    • Step 1a - Having yEd see your input data

      Run yEd. And in the "Welcome to yEd" window, click on the "Open..." icon. Navigate to your input spreadsheet, and click "Open".

      This will open yEd's "MS Excel Import" window, where you can see all the tabs of all the worksheets in your spreadsheet. Click on the tab that has your input data, so that your input data is showing.

      Note on the left of this window that you are in the "Data" tab of controls and that "Edge List" is the chosen option for "Edge Representation". The default "Property Names in First Row" is checked, which you should uncheck if you just have data with no column headings.

    • Step 1b - Telling yEd the specifics of your input data

      Use your cursor to highlight the column letters for both columns of your marriage surnames. Then click the "Adopt" button for "Data Range" in the "Edit List" section on the left.

      Use your cursor to highlight the column letter for the left column of your marriage surnames. Then click the "Adopt" button for "Column of Source IDs" in the "Edit List" section on the left.

      Use your cursor to highlight the column letter for the right column of your marriage surnames. Then click the "Adopt" button for "Column of Target IDs" in the "Edit List" section on the left. The source and target determine the direction of arrows in the graph (which is a directed and not an undirected graph). If your left column is husband and your right column is wife, then each marriage will be shown as the node with the husband's surname having an arrow pointing to the node with the wife's surname.

      The result of the above steps will look like this (click on the image for full size):

      raw spreadsheet
    • Step 1c - Telling yEd the specifics of how you want the nodes to be constructed and labeled in your graph

      Now click on the "Presentation" tab on the left of the "MS Excel Import" window. In the "Nodes" section's "Label Text", change the default "None" to "Node Label". Check the box for "Fit Size to Label". And change "Template" to "Rectangular Shape". In the Layout section, change "Circular" to "Organic".

  2. Step 2 - Click "Ok" at the bottom of the "MS Excel Import" window to generate your first graph.

    It's that easy.

  3. Step 3 - Use the "Layout" pulldown menu's "Organic" option to play around with the various control settings and see what graphs look like when you change them.

    You can always use the "Edit" menu's "Undo" option to go back to a prior graph if you like that better than the new one.

    For the graph I have posted in the link above, here are the control settings I used in the "Smart Organic Layout" window. I cannot claim to understand all of them, but this is what generated the graph you see on the web page (if I recall correctly).

    • Visual Tab: Preferred Scope (All), Edge Length (5), Consider Node Labels (checked); Minimal Node Distance (100.0), Avoid Node/Edge Overlaps (checked), Compactness (1.0), Use Natural Clustering (checked), Natural Clustering Quality (1.0)
    • Restrictions Tab: Output Area (Unrestricted)
    • Grouping: Group Layout Policy (Layout Groups), Use Automatic Group Comppaction (checked).
    • Algorithm: Quality/Time Ration (1.0), Maximal Duration (300), Activate Deterministic Mode (unchecked)

Instructions for Exporting the Graph and Setting it up on the Web - not sure these 2014 instructions are still correct

  1. Step 1 - Export the graph as an HTML Flash export.

    In yEd's "File" pull-down menu, click "Export". Navigate to where you want the five files [the exported graph (as a graphml file), the html file and the three supporting viewer files] to be created. Choose "HTML Flash Viewer (*.html) as the "Files of type". Then specify the name with which you want the graph and HTML to be saved and click "Save". Use the default 100% when the viewer specifications window pops up and click OK. And it is done. At this point, you can actually click on the generated HTML file on your hard drive to see how the viewer looks.

  2. Step 2 - Upload all five files to your server.

    The html file and graph will have the name you specified. The other three files are these:

    • crossdomain.xml
    • GraphMLViewer.js
    • GraphMLViewer.swf
  3. Step 3 - Add a link to the HTML file from whatever page you want.

    The manual for GraphML Viewer has instructions for how to embed the graph viewer in another web page, but for maximum size, it is best to set it up as above and link to it.


Data Preparation

What you want as input to yEd is a spreadsheet with one line for every marriage, with the man in the left column and the woman in the right column (or vice versa if you prefer). (These really were co-parentages in some cases, but the GEDCOM representation lumps them all together for data-manipulation as marriages. And I will refer to them in the same way, since they were actual linkages of one surname to another.)

If you have all of your data in a spreadsheet that is the index or abstract of all the marriages, then as long as there is a column for the man's surname and another column with the woman's surname, you might think that you do not have to do any data preparation at all and can just tell the software which columns you want to use when you import the file. But really you still do have to do some data preparation. So copy your master spreadsheet to an input spreadsheet, and then work in that input spreadsheet. Delete any marriage that do not have surnames for one or both spouses. And standardize the spelling of variants of surnames, so that you will really be able to see how families connected with each other (see more on this below). Then use pass the input spreadsheet to the graphing software.

But if your data is in a database, you have a bit more work to do, since you need to pass the data from the database into a spreadsheet.

My original data preparation was for the purpose of identifying all the surname pairs. The master location of the database for all of my projects is a database (one for each project) on Ancestry.com. This could be donwloaded as a GEDCOM file and imported into offline family tree software products. But what I really needed was a list of all the marriages, as the source of the surname pairs needed for graphing. And I also needed to standardize the variant spellings, in order to accurately reflect the connections between surnames.

  1. Step 1 - GEDCOM to Spreadsheet

    The first step is to create a list of surname pairs from all the marriages in the database.

    I did this with Legacy Family Tree software -- although not in the most logical way. The most logical place would be for Legacy to allow you to output the Master Marriage List to a CSV file. But they only allow it to output to hardcopy or (in the deluxe version) to a PDF file. So print it to a PDF file. Then go into the PDF file (with Adobe Reader) and select all and copy it all and paste it as text into a spreadsheet.

    What you will wind up with in the spreadsheet looks like this (click on the image for full size):

    raw spreadsheet
  2. Step 2 - Spreadsheet Cleanup

    Clearly, the raw form of the pasted information needs to be molded into surname pairs. And there are a number of problems to resolve in doing that. The irrelvant lines (such as lines 1-3) need to be removed. And the husbands and wives need to be put into separate columns. And the surnames have to be extracted and the given names and commas discarded. There are multiple ways to do this, and someone with more Excel skill than I have may know a more efficient way to do some of these tasks, but the following steps will work. I actually make a copy of my worksheet after doing each of these steps, so that the next step is done on the copy and the original forumlas are not lost, in case I have to go back and do something later -- such as capturing new marriage data from the updated database or from an entirely different database. Once you have entered all the formulas, you really do not want to simply delete them; so copy that worksheet to a new one and delete or overwrite the formulas in the new worksheet.

    • Step 2A - Eliminating Irrelevant Lines

      This is actually quite simple to do, even on a list of several thousand marriages. Click on the A at the top of the A column, to highlight the whole column. Then insert a column to the left of the data column. (I use the keysboard ALT + I, but you can also do this on the HOME menu's Cells section, where you can choose "Insert" and its "Insert Cells option.) This will move the data to column B and insert a blank column A. In cell A1, enter 1. Then with the cursor still in cell A1, use the keyboard and holding down the Shift and Ctrl keys, also press the End key. Then, still holding down the Shift and Ctrl keys, use the left arrow key, to move the cursor left so that only column A is highlighted (down to the last line on which there is data in column B). Now either use the keyboard and holding down the Alt key, press E, then I, then S (letting go of each one but still holding down the Alt key) or else on the HOME menu, in the Editing section, click Fill and then its Series option. Both methods will pop up the series box. Make sure the step value is 1, and then click OK. This will fill column A with the numbers 1, 2, 3, ... up to your last marriage line. Now sort both columns A and B (DATA / Sort) by column B. Now scroll down to where you find a cluster of lines saying "Husband Wife" and delete all those lines. Then scroll further down to the cluster of "Marriage List ..." lines, and delete all those lines. And then scroll down and delete all the "Produced by Legacy ..." lines. Now sort both columns again but this time make column A the one that you sort on. Finally delete column A, so that the data moves back into column A. And you have now removed all the irrlevant lines without losing the original order of your data lines.

    • Step 2B - Separating husbands and wives into different columns

      This is definitely something that requires some Excel skill. And it is not going to be easy when there are marriage without given names or surnames of one of the spouses. Start by taking a look at several of the marriage lines. Notice where the commas and spaces are. The format is husband surname, comma, space, husband given names, space, wife surname, comma, space, wife given names. There is also the unique "Unknown Unknown" line; simply delete that line, since it cannot yield a surname pair and complicates the creation of a general purpose formula for handling each line. Clearly, there are two different patterns in the marriages: one for marriages with all the names and one where the husband's given name is blank (e.g. "ALLER, LAKE, Ann"). So the method has to take both of these into account. We are actually saved some work by the fact that all we really want from each person is their surname.

      We will generate some new columns of formulas. Then when we are done, we will copy and paste only the values back into their same places. And then we will delete all but the surname pair columns. We have to search for the first comma and then put everything to the left of that into the husband surname column. Then we have to take everything to the right of that comma and search for the second comma and then take everything to the left of that comma into a new column and then search that column to find the last space and then take everything to the right of that into the wife's surname column. Just to be safe, I will also use the TRIM function, in case there are any unnoticed leading or trailing spaces. As I said, this takes some skill in Excel.

      I'll use columns B and C to capture the husband's surname, which is the easiest part. In column B1, enter the formula "=FIND(",",A1)" (without the outside quotes). This finds the position of the leftmost comma for the first marriage. In column C1, enter the formula "=TRIM(LEFT(A1,B1-1))" (again, without the outer quotes). This captures the husband's surname.

      I'll need several columns to capture the wife's surname. In column D1, enter the formula "=TRIM(RIGHT(A1,LEN(A1)-B1-1))". This captures everything to the right of the space after the leftmost comma in the original marriage text in cell A1. (The TRIM() is probably unnecessary but does not hurt anything.) In column E1, enter "=FIND(",",D1)", which finds the location of the rightmost comma. In column F1, enter "=TRIM(LEFT(D1,E1-1))". This captures everything from D1 that is to the left of the comma. Now comes the real golden bullet, capturing the last word of the remaining text that is now in F1. In cell G1, enter "=RIGHT(F1,LEN(F1)-FIND("|",SUBSTITUTE(F1," ","|",LEN(F1)-LEN(SUBSTITUTE(F1," ","")))))". This captures the wife's surname unless the husband had no given names, in which case it gives "#VALUE!". So finally in cell H1, enter "=IFERROR(G1,F1)". This uses the wife's surname from column G1 if it is present and from column F1 if G1 is the error value.

      So you now have the husband's surname in C1 and the wife's in H1. But you need to copy all these formula cells for every line of marriages in your spreadsheet. So use your mouse to highlight cells B1 through H1. And then on the HOME menu, click copy. Then use the keyboard and holding down the Shift key, press and hold the Ctrl key and then the End key. This will highlight all the cells in columns B to H in your spreadsheet. Let go of those keys, and on the HOME menu at the far left click the little triangle under "Paste" and then click the "Paste Special" option. This will open the Paste Special pop up box. Click the circle next to "Formulas", and then click OK.

      You'd like to rely on the results, but you really need to scroll through all the lines of the spreadsheet, looking for anomalies and manually correcting them. For example, a two-word surname will have been choppped down to one word by this method. So watch for a VAN COURLAER surname in the original data, which will have been chopped to COURLAER and correct it to VAN COURLAER. If there is a line with one of the surnames Unknown, simply delete that line. Or there may have been some lines with more than two commas. There are almost always some special cases that the general solution does not handle. So just correct those as you find them. This is a very important step; so do it carefully or you may wish later that you had.

      Now that all the surname data is as correct as you can make it, it is time to replace the formulas with their computed values, thus locking in the surnames for the husbands and wives. At the top highlight all of columns A through H. Then on the HOME menu, click COPY. Then click the little triangle under paste and again click on Paste Special. This time, in the Paste Special pop-up box, click the circle next to "Values", and then click OK. Now delete all the columns except C (the husbands' surnames which will end up as column A) and H (the wives' surnames which will end up as column B).

  3. Step 3 - Standardize name spellings.

    Spelling, even in modern records, can be inconsistent. Prior to the late 1800's, it could be extremely variable. We want the surname conncetion graph to show how families are related. So we do not want all the variants for each surname. We want a standardized spelling for those names that occur with variants. So you have to go through all the husband surnames and replace the variants for a name with the spelling that you have decided will be the standard for purposes of the graph. I mark all standardized spellings with an asterisk after the name, so that I know that these include variants, which I can examine if any later questions come up. Be sure to look for variants where the first letter as changed (e.g. Tillingham and Dillingham).

    The list is sorted by husband surname, since that is how the original PDF file showed it. So after standardizing all the husband surnames, you need to sort the marriages by wife surname and repeat the standardization, making sure that you standardize wife names with the same standard spelling as you used for the husbands.

    Finally, sort the list back into order by husband surname, with the wife surname as the second level of the sort. And at long last, you have your surname pairs. Different software packages may require further configuration of the data, in order to be read by the software package. But you now have the data that you need, which can be reformatted as necessary.


Copyright © 2024 by Wesley Johnston
All rights reserved


Click here for my main family history page.