Natural vs Artificial Primary Keys

August 29, 2016 by Kenneth Fisher

There has been a long standing debate over the use of a natural primary key vs an artificial one. I thought I would throw in my two cents.

First of all why is it important? The primary key is used for two major purposes.

Uniqueness

The primary key is one of the candidate (unique) keys possible for a table. It uniquely identifies a given row of data. This is particularly important in case you need to update or delete a specific row.

Relationships

The primary key is frequently used in related tables to tie the two tables together. Typically this is a Foreign Key relationship.

The smaller the key the less space taken up in the other tables.
The smaller the key the smaller the related index. The smaller the related index the fewer reads are required. Joins are typically based on these indexes. So the smaller the key the faster the joins.
Because this column or columns will be used as a relationship in other tables any changes to these values will also have to be performed in all of the tables that contain them.

The Clustered index

The primary key is frequently the clustered index. It certainly doesn’t have to be but since that’s the default it almost always is. That being the case let’s also look at what makes a particularly useful clustered index.

Ever increasing (or decreasing depending on the direction of the index). This decreases fragmentation and increases insert speeds.
Narrow – Same reasoning as above. Smaller = faster.
Unique – Primary keys have to be unique anyway but this helps with the Narrow requirement. If the clustered index wasn’t unique a 4 byte uniqueifier has to be added.
Static – Every time a value clustered index has to be changed the data has to be moved. This of course slows things down and again can lead to fragmentation.

Ok, now that we know what we are working with what are the pros and cons for natural and artificial keys?

Natural

A natural key is one constructed of data that already exists in the table. For example using latitude and longitude in a table of addresses. Or the social security number in a table of employees. (Before you say anything, yes, the social security number is a horrible primary key. Be patient.)

The pros and cons

Pros:

The data actually means something. If instead of using an integer artificial key for a list of states (for example) you use a two character field (TX, AL, CA, etc) then when you see a value in another table it actually means something.
Potential space savings. All things being equal there is no need to add additional space to a table to hold an artificial key.

Cons:

Can be hard to pick a good one. Things that might seem obvious (like a social security number) make horrible primary keys. Remember they must be unique (SSN isn’t) and shouldn’t change (SSN can). (See I told you we would get there if you waited.)
Frequently (but not always) costs additional space. Even though the artificial key is an additional column it’s frequently smaller than a natural key. Because of this when it’s repeated in other tables additional space will be taken up. Also remember if the primary key is also the clustered index then it’s going to exist in every other index.
Remember smaller = faster and again, natural keys are usually bigger.

Artificial

An artificial key (sometimes called a surrogate) is one that was created specifically to be the unique identifier of a row. Identity columns are the most common form of artificial keys. They are unique, ever increasing (typically) and have absolutely nothing to do with the rest of the data in the row.

The pros and cons

Pros:

Small! Everything that I said above about being smaller. FYI common mistake is to use INT everywhere. A SmallINT or even TinyINT is more than big enough for a lot of lookup tables.
The chances of an artificial key value changing is next to nothing. I’m not saying it can’t happen. But it’s highly unlikely.
Ever increasing. Identity columns are ever increasing (or decreasing) which helps if this is also the clustered index.
Identity columns are also usually unique. They don’t have to be but usually are unless something strange happens.

Cons:

Because the artificial key doesn’t mean anything on it’s own, if it’s used in a foreign key you are more likely to have to join to the lookup table.
They have absolutely no relation to the rest of the data on the table breaking third normal form. Honestly I don’t consider that to be a big deal but some people do so I thought I would mention it.

Summary

So based on all of the above which do you pick? It’s certainly easier to pick an artificial key. Although as a favor to me still pay attention to the datatype you use. If you really need > 32k (smallint, 2 bytes) entries for your employee table your company is huge. That said my personal opinion is that, like many other things, it depends. If you can find an appropriate small natural key then frequently that’s better. Not to mention the fact that if the table won’t have any children (the key won’t be used in any other tables as a foreign key) then size isn’t as big an issue and it’s easier to find a natural key.

Also don’t forget that the primary key and the clustered index don’t have to be the same thing. So using something like a modified date as the clustered index and something else as the primary key could be perfectly reasonable. In fact I have one whole database where the clustered indexes are the modified date and there are no primary keys. It’s a special case, but over the nears I’ve noticed that special cases seem to be the norm.

Category: Index, Microsoft SQL Server, Performance, Referential Integrity, SQLServerPedia Syndication | Tags: indexes, microsoft sql server, Performance, Referential Integrity

6 thoughts on “Natural vs Artificial Primary Keys”

Gabriel says:

August 29, 2016 at 8:33 AM

One potential risk with artificial keys is that you might not catch duplicates. For example if you have an identity column on your employee table, you could insert the exact same row two or more times without realizing.

If you’re running a data warehouse and your employee table is tracking changes, then the artificial (a.k.a. surrogate) key works great for that purpose. Otherwise, you may need to add a unique constraint/index somewhere to ensure that you’re not getting unwanted duplicates.

- Kenneth Fisher says:
  
  August 29, 2016 at 8:58 AM
  
  That is a really excellent point. One of those things I “know” but never think of directly.
  
GDPR – After the restore | SQL Studies says:

March 19, 2018 at 8:01 AM

[…] Again, not a lawyer, not an expert, but I’m going to guess no. This does mean that in the Natural vs Artificial Keys argument the artificial keys side just got a big bump. Because to be honest, I’m going to […]

SQL Server to Oracle/Oracle to SQL Server Migrations, Part 2 | Toptal says:

October 29, 2019 at 6:50 AM

[…] a long-standing divide in the database community between two camps: the patriots of natural keys and the proponents of […]

Byron Mayes says:

January 27, 2021 at 12:51 PM

Always use the unique field or fields as your primary key. Auto identities has no meaning to the data row, unless it is a historical or log file.

- Kenneth Fisher says:
  
  January 27, 2021 at 4:03 PM
  
  That’s certainly one way to go. Of course if there is no unique combination of fields, or for whatever reason the unique combination includes a column that has to be nullable then you are going to have a problem.