Right sizing int : T-SQL Tuesday #136

March 9, 2021 by Kenneth Fisher

Brent Ozar (blog|twitter) is our host this month, which will make this the most popular #tsql2sday to date I’m sure. He’s asked us to talk about our favorite (or least favorite) data type. I can actually guess what a few people are going to write about but that’s neither here nor there. I actually want to talk about a group of data types more than an individual one.

Did you know that there are four different integer datatypes? It’s not just int.

Data type	Range	Storage
bigint	-2^63 (-9,223,372,036,854,775,808) to 2^63-1 (9,223,372,036,854,775,807)	8 Bytes
int	-2^31 (-2,147,483,648) to 2^31-1 (2,147,483,647)	4 Bytes
smallint	-2^15 (-32,768) to 2^15-1 (32,767)	2 Bytes
tinyint	0 to 255	1 Byte

If you look at the above chart you’ll see that each of the four is not only different in size, but dramatically different in the number of values it can store. Most people just use int everywhere. And while int will work the vast majority of the time, that doesn’t mean it’s the correct datatype. A very simple example would be using the int datatype for an identity column in a table where you are going to be adding millions of rows a day. In fact I recently read an article where a social media site did this. Unsurprisingly in a very short amount of time they hit the 2 billion row limit for positive values (because of course they didn’t start at the min negative value) and crashed. This would have been a time where using bigint would have been more appropriate. Just so you are aware, at one million rows a second it would take almost three hundred thousand years to fill a bigint. And that’s just the positive values. Now you pay for that in space. bigint is 4 bytes larger than int translating into a little less than 4mb per million rows increase. It doesn’t seem like much but at the point where you need to switch to bigint you are paying 16gb (~4.2 billion rows * 4 bytes more than int / 1024 / 1024 = ~8000). And that’s just where you are starting.

Now, speaking of space, let’s go the other direction. Let’s say you have a data warehouse with lookup tables for employees and another for sales districts. You decide to use an integer identity column. Now, 4 billion employees seems like a bit much even for a large company. How about 64 thousand? It’s highly unlikely that you’ll ever have that many employees even accounting for attrition and growth. So smallint is a reasonable possibility. There is also tinyint but that would only let us have 256 employees which even for a small company over time might not be enough. So smallint it is. Of course at 64k rows that’s a grand savings of 125kb. (~64 thousand rows * 2 bytes less than int / 1024). That’s not really that much hu? And for our sales districts table we could probably use that tinyint which will net us a grand total of 768 bytes. Not even 1 kb. (256 rows * 3 bytes less than int). So why bother?

Well, let’s say our multi billion row table is our fact table. Just by changing these two lookup values we’ve saved more than we lost by switching to bigint. Now try that with 10 or 20 lookup tables. The savings can add up pretty dramatically.

Category: Microsoft SQL Server, SQLServerPedia Syndication, T-SQL Tuesday | Tags: data types, T-SQL Tuesday

4 thoughts on “Right sizing int : T-SQL Tuesday #136”

Sean Redmond says:

March 10, 2021 at 12:54 AM

This is one of pet peeves with the developers: datatypes.

We have a table called, say, address.cantons (guess where I live) which has 28 entries. We are not expecting any new cantons in the medium term. The primary key was assigned an INT. The difference in size is so minute, I was told, that it was not relevant. Sure enough, for that table, the difference was 28×3 bytes. Not even a page.

Ah yes, I counter, but you forget that there are columns with a foreign-key constraint on that same key. So, for every million rows in the FK table, 3 million bytes are being added to the pages in the database for no good reason. 3 million bytes is almost 366 extra pages, which add nothing and reduce performance.

And this also applies to dates. Every table has an UpdatedDate column & a CreatedDate column. We don’t need accuracy at the level of 10^-8 seconds. We need it at minutes, maybe seconds. For this case, SMALLDATETIME is perfect. It is only 4 bytes (rather than 8 bytes for DATETIME2(7)). Since there are 2 columns, the savings are then 8 bytes per row and per million entries is almost a thousand pages.

This may seem pedantic but this is an easy win. Using the smallest datatypes that do the job means that SQL Server has fewer pages with which it needs to work with. The data is more efficiently stored and therefore more efficiently retrieved.

Kenneth Fisher says:

March 10, 2021 at 10:32 AM

I absolutely agree. And right sizing your datatypes is not just an int thing, it’s all datatypes. My only concern about using smalldatetime (usually) is that it ends in 2079 which while ~60 years away is only ~60 years away and may actually be an issue.

Reply
- Sean Redmond says:
  
  March 11, 2021 at 1:30 AM
  
  Even DATETIME2(0) is an improvement. And DATETIME2(0) would be suitable if timezones were a factor, but they are not.
  Our databases have a lifetime of about 10-15 years as a rule before the following product is released, so the 2079 deadline is not too much of a problem. The datatype just needs to be altered when the new database is being designed.
  
  Reply
  - Kenneth Fisher says:
    
    March 11, 2021 at 9:15 AM
    
    Certainly reasonable :). It really depends on what the date is being used for. If you are doing something like sales dates then 2079 is more than far enough out. End of life type dates for say insurance or other products may cause you a problem. It all comes down to knowing your date and your data types 🙂

Sean Redmond says:

March 10, 2021 at 12:54 AM

This is one of pet peeves with the developers: datatypes.

We have a table called, say, address.cantons (guess where I live) which has 28 entries. We are not expecting any new cantons in the medium term. The primary key was assigned an INT. The difference in size is so minute, I was told, that it was not relevant. Sure enough, for that table, the difference was 28×3 bytes. Not even a page.

Ah yes, I counter, but you forget that there are columns with a foreign-key constraint on that same key. So, for every million rows in the FK table, 3 million bytes are being added to the pages in the database for no good reason. 3 million bytes is almost 366 extra pages, which add nothing and reduce performance.

And this also applies to dates. Every table has an UpdatedDate column & a CreatedDate column. We don’t need accuracy at the level of 10^-8 seconds. We need it at minutes, maybe seconds. For this case, SMALLDATETIME is perfect. It is only 4 bytes (rather than 8 bytes for DATETIME2(7)). Since there are 2 columns, the savings are then 8 bytes per row and per million entries is almost a thousand pages.

This may seem pedantic but this is an easy win. Using the smallest datatypes that do the job means that SQL Server has fewer pages with which it needs to work with. The data is more efficiently stored and therefore more efficiently retrieved.

- Kenneth Fisher says:
  
  March 10, 2021 at 10:32 AM
  
  I absolutely agree. And right sizing your datatypes is not just an int thing, it’s all datatypes. My only concern about using smalldatetime (usually) is that it ends in 2079 which while ~60 years away is only ~60 years away and may actually be an issue.
  
  - Sean Redmond says:
    
    March 11, 2021 at 1:30 AM
    
    Even DATETIME2(0) is an improvement. And DATETIME2(0) would be suitable if timezones were a factor, but they are not.
    Our databases have a lifetime of about 10-15 years as a rule before the following product is released, so the 2079 deadline is not too much of a problem. The datatype just needs to be altered when the new database is being designed.
    
    - Kenneth Fisher says:
      
      March 11, 2021 at 9:15 AM
      
      Certainly reasonable :). It really depends on what the date is being used for. If you are doing something like sales dates then 2079 is more than far enough out. End of life type dates for say insurance or other products may cause you a problem. It all comes down to knowing your date and your data types 🙂