August 29, 2016 by Kenneth Fisher
There has been a long standing debate over the use of a natural primary key vs an artificial one. I thought I would throw in my two cents.
First of all why is it important? The primary key is used for two major purposes.
The primary key is one of the candidate (unique) keys possible for a table. It uniquely identifies a given row of data. This is particularly important in case you need to update or delete a specific row.
The primary key is frequently used in related tables to tie the two tables together. Typically this is a Foreign Key relationship.
- The smaller the key the less space taken up in the other tables.
- The smaller the key the smaller the related index. The smaller the related index the fewer reads are required. Joins are typically based on these indexes. So the smaller the key the faster the joins.
- Because this column or columns will be used as a relationship in other tables any changes to these values will also have to be performed in all of the tables that contain them.
The Clustered index
The primary key is frequently the clustered index. It certainly doesn’t have to be but since that’s the default it almost always is. That being the case let’s also look at what makes a particularly useful clustered index.
- Ever increasing (or decreasing depending on the direction of the index). This decreases fragmentation and increases insert speeds.
- Narrow – Same reasoning as above. Smaller = faster.
- Unique – Primary keys have to be unique anyway but this helps with the Narrow requirement. If the clustered index wasn’t unique a 4 byte uniqueifier has to be added.
- Static – Every time a value clustered index has to be changed the data has to be moved. This of course slows things down and again can lead to fragmentation.
Ok, now that we know what we are working with what are the pros and cons for natural and artificial keys?
A natural key is one constructed of data that already exists in the table. For example using latitude and longitude in a table of addresses. Or the social security number in a table of employees. (Before you say anything, yes, the social security number is a horrible primary key. Be patient.)
The pros and cons
- The data actually means something. If instead of using an integer artificial key for a list of states (for example) you use a two character field (TX, AL, CA, etc) then when you see a value in another table it actually means something.
- Potential space savings. All things being equal there is no need to add additional space to a table to hold an artificial key.
- Can be hard to pick a good one. Things that might seem obvious (like a social security number) make horrible primary keys. Remember they must be unique (SSN isn’t) and shouldn’t change (SSN can). (See I told you we would get there if you waited.)
- Frequently (but not always) costs additional space. Even though the artificial key is an additional column it’s frequently smaller than a natural key. Because of this when it’s repeated in other tables additional space will be taken up. Also remember if the primary key is also the clustered index then it’s going to exist in every other index.
- Remember smaller = faster and again, natural keys are usually bigger.
An artificial key (sometimes called a surrogate) is one that was created specifically to be the unique identifier of a row. Identity columns are the most common form of artificial keys. They are unique, ever increasing (typically) and have absolutely nothing to do with the rest of the data in the row.
The pros and cons
- Small! Everything that I said above about being smaller. FYI common mistake is to use INT everywhere. A SmallINT or even TinyINT is more than big enough for a lot of lookup tables.
- The chances of an artificial key value changing is next to nothing. I’m not saying it can’t happen. But it’s highly unlikely.
- Ever increasing. Identity columns are ever increasing (or decreasing) which helps if this is also the clustered index.
- Identity columns are also usually unique. They don’t have to be but usually are unless something strange happens.
- Because the artificial key doesn’t mean anything on it’s own, if it’s used in a foreign key you are more likely to have to join to the lookup table.
- They have absolutely no relation to the rest of the data on the table breaking third normal form. Honestly I don’t consider that to be a big deal but some people do so I thought I would mention it.
So based on all of the above which do you pick? It’s certainly easier to pick an artificial key. Although as a favor to me still pay attention to the datatype you use. If you really need > 32k (smallint, 2 bytes) entries for your employee table your company is huge. That said my personal opinion is that, like many other things, it depends. If you can find an appropriate small natural key then frequently that’s better. Not to mention the fact that if the table won’t have any children (the key won’t be used in any other tables as a foreign key) then size isn’t as big an issue and it’s easier to find a natural key.
Also don’t forget that the primary key and the clustered index don’t have to be the same thing. So using something like a modified date as the clustered index and something else as the primary key could be perfectly reasonable. In fact I have one whole database where the clustered indexes are the modified date and there are no primary keys. It’s a special case, but over the nears I’ve noticed that special cases seem to be the norm.