I am particularly distressed by the failure of the genealogical computing community to deal with the complete re-design of the GEDCOM database format. GEDCOM is really two things in one:
The second aspect really depends on the first: the design of the database is the most fundamental aspect of GEDCOM.
The reality is that the database design of GEDCOM is based on the limitations of the technology of the early 1980's -- almost 30 years ago. The resulting design sacrificed the principles of good database design (especially third normal form) in order to make things work within the limits of the technology of that time. We all did our database designs that way in those days: we had to live within the limits of the technology, or our systems would not work, no matter how perfectly their design followed good database design principles. So we de-normalized and made things work. Various commercial genealogical computing products have added features on top of the GEDCOM design, but the fundamental flaws of the underlying GEDCOM design doom any on-going add-on efforts to frustration and also violate the transferability of the information that is the second goal of GEDCOM.
But that was then and this is now. Those days and their severe technological limitations are gone. And yet we are still using essentially the GEDCOM database design that was developed for the technology of the 1980's. And we are paying for it in ways that we should no longer have to. Until there is a fundamental redesign of the GEDCOM database structure, we will continue to be unable to exploit modern technology fully for what it can do for genealogical computing, and we will continue to be unable to share our databases fully, without significant loss of information, especially of added media. Now that many of those implementation limitations are no longer a problem, we really need a design for lineage-linked databases that supports all of the relationships in which people existed in their lives, since all of those relationships were potential sources of records and potential sources of learning more about the people through information far beyond bare-bones blood pedigrees.
So I have put together this web page that addresses the topic in detail.
Contents
What is wrong with the current design?
Good database design relies on normalization of the data. There are various levels, but the most important is the third level. Database design that conforms to this design principle is said to be in third normal form or 3NF. This form eliminates redundancy and assures integrity. The downside of implementing this design is that views of the data that combine attributes from different relationships of data have to be run every time you want to see the data that way. In the past, this was a prohibitively high cost in the amount of time it would take to display a screen with the information that you wanted to see. So in the 1980's, databases were not designed to conform to 3NF but instead were de-normalized so that the data was stored redundantly but could quickly be displayed.
The 4-Level Location Name Nightmare
The reality is that someone lived in a house or worked in a place that had an address and was in a city. These were the two fundamental ways that they thought about where they lived. Certainly, the state and country were significant. And if they were in a rural area in the United States, the county was important. But essentially you had a place that stayed where it was while the identification at various levels altered.
The problems of denormalized location data in the existing GEDCOM design lie in the forcing of every location field to have four levels: "Chicago, Cook, Illinois, USA" -- based on cities within the United States. This led to all sorts of torturing of location data on a rack of de-normalized design. Here are some examples:
So if you are doing a population study of a particular town or parish over the course of 300 years, how do you represent the fact that while the houses never moved, all of these different political boundaries and name changes took place?
Instead of a database design issue, it became a pseudo-religious issue: many people proposed or imposed THE ONE TRUE WAY that they saw this should be done. But the reality is that the problem existed entirely because of the design of the database that did not normalize the data. If the data were fully normalized (to 3NF), there would be a unique identifier for a place, and the boundaries and names of the political entities in which it sat would be held in separate relations that could be looked up using the place identifier and the date.
But the technology of the time did not permit rapid joins (the technical term for a lookup) of database relations, nor did anyone want to make the signficiant effort to create the database relations of the necessary historical information. That has changed, and the technology of today does permit rapid lookups, and the database relations of such historical information are now add-ons in some of the best commercial genealogical software products. But we are still limited because the GEDCOM database design does not support this.
Should you use "USA" or "United States of America"? How about "Michigan" or "MI"? Did you know that the official name of the state of Rhode Island is "State of Rhode Island and Providence Plantations"? And do you really want to enter that in the state field every time something happened in Rhode Island in your database?
Since the denormalized form of the location data requires that you re-enter the data every time, it is highly likely that at some point your fingers will not hit the right keys. Most commercial genealogical database software now is smart and starts prompting you with name choices that you have already entered, so that you can choose the same one that you entered before. But it is inevitable that at some point you will make a mistake and thus wind up with two different forms of the same name. And the database will not recognize that these are the same, so that if you want to retrieve a list of everyone who lived at that place, you have to know that you need to search for both forms of the way the name was entered. In a normalized database, this problem would be minimized: you enter the place identifier only, and all of the other levels are in separate relations, so that they are not entered over and over and over again.
Some people shoe-horned English counties into the level of American counties and then had to put the parish into the city name field and then had nowhere to put the village name within the parish -- and their state field was left blank. So the village of Blackborough in the parish of Kentisbeare, Devon would be entered as "Kentisbeare, Devon, , England" or perhaps "Blackborough (Kentisbeare), Devon, , England".
But other people recognized that English counties were really at the same structural level as US states and not US counties. Thus they entered the four levels as "Blackborough, Kentisbeare, Devon, England".
So major inconsistencies arose as to how different people represented the same place within the four levels of the GEDCOM design. And the whole GEDCOM goal of the sharing of databases ran into problems of merging a database with one naming convention into another that used a different naming convention.
That does not mean that it should be omitted. It is critically important to know what county holds the records of that place for a given time.
But when you address a letter, you do not enter county. And so some people feel uncomfortable simply entering "Fresno, Fresno, California, USA" and instead enter "Fresno, Fresno County, California, USA". While this works for them, as long as they are consistently using it in all cases, it creates real problems when merging two databases.
House addresses simply did not exist within the GEDCOM location structure. Nor did wards in cities nor townships in counties. So if you wanted to somehow record this information, you had to either place the information in a generic data field or else shoe-horn it into the 4-level location field structure.
Placing the information in a generic data field provided no simple way to answer the question of who lived in the same house?
Shoe-horning the information into the 4-level structure led to all sorts of corruptions of the purity of the four levels that were unique to the person doing the transcription. You see this, for example, in Ancestry.com's attempts to include Ward in census data location names: "Chicago Ward 33, Cook, Illinois". But it shows up in many ways in various trees that you will see, for example, "Lot 5 Conc 7 Whitby Township, Ontario, Ontario, Canada".
GEDCOM has created a nightmare situation in which important information (houses, townships, wards) does not exist within the design and thus cannot be easily retrieved without corrupting the design and making the database a nightmare to merge with another database that does not use the same convention. And even the fields that do exist are used in different ways by different people. If the data has been stored properly in normalized form that includes addresses and other attributes and relations, then it can be retrieved in ways that do not have to be designed into the system in advance.
Restriction to only Parent-Child and Parent-Spouse Human Relationships
The reality of the relationships that immerse people -- and which we find in the records -- are far more numerous than whether they were parent-child or parent-spouse relationships. But those are the only relationships that GEDCOM supports. You can compute other familial relationships from these. But you have no way to represent, in a retrievable fashion as a full relationship in the sense of database relations, a vast array of other relationships, of which several were particularly important in a person's life:
In addition, there are human relationships that appear in records that you want to capture but which you cannot fit into your existing database, since they depend on distant linkages that you cannot determine:
While there probably were technological limitation reasons for not designing the GEDCOM database to include these relationships, there was originally a pure lineage-linked focus. After my 2003 article "Non-blood Relationship Searches" appeared in "Genealogical Computing" magazine, a reader wrote a letter to the editor to express dismay that such a subject would even be discussed, since she had this narrow ancestors-only focus -- which really blinded her to understanding the lives of her ancestors. But there may have been a good deal of this same attitude in the original GEDCOM design, which the technological limitations of 1984 made it possible to cast into concrete that still encases and constricts us today.
We really need a database design that supports all of the key relationships in which people existed in their lives, since all of those relationships were potential sources of records and potential sources of learning more about the people through information far beyond bare-bones blood pedigrees.
....................STILL UNDER CONSTRUCTION.............................
Absence of Geographical Relationships
Where people lived in relationship to each other was important in their lives. Perhaps Cornelius met Maggie because they lived across the street from each other or maybe they were living in different apartments in the same building or worked together at the same place or attended the same church. Or maybe the baptismal record of a child shows that the child was born at "None of these geographical relationships are supported in GEDCOM. You can enter the information in a generic data field, but there is no way that you can easily search the GEDCOM database to discover which people were together in these ways.
There are also census-based relationships that may be more complicated to implement: is in the same household as (a servant, a border, ...), is in the same building as, is on the same or next census page as, is in the same enumeration district as.
Clearly if there were an address field, many of these geographical relationships could be discovered. But there are censuses and other records that do not include addresses but do include the information that two people were in some geographical relationship with each other, and it would be useful to include that connection between those people rather than to ignore it, as the current GEDCOM design does.
....................STILL UNDER CONSTRUCTION.............................
Relationships of media items to people and events and places
....................STILL UNDER CONSTRUCTION.............................