tim laqua dot com Thoughts and Code from Tim Laqua

15Oct/0911

Estimating the Size of a Table in SQL Server 2008

I have read this (http://msdn.microsoft.com/en-us/library/ms178085.aspx) article at least 6, maybe 7 times in the past - and every time I say to myself "this is ridiculous - someone has to have written a script to do this by now" and every time, I google for hours and fail to find anything. So I finally gave up and wrote something to do it. Note, I've only verified it on 100% fixed width tables. I compared its output to a 600+ million row table and it came out somewhere around 3% higher - fine with me as I'd rather over-estimate space requirements than under-estimate.

First, we have to create a table type to hold our table column definitions. the datatype is required, length is required for anything that allows length (MAX is not handled for any types), and the worst case scenario is always used so if you know you won't be filling those VARCHAR(255)s, feel free to use a smaller, more average, number. IsKey is a BIT value that tells the proc to use those columns as part of the index key. The rest of it is all calculated.

CREATE TYPE TableColumnDefinition AS TABLE
	([DataType] varchar(16) NOT NULL
	,[Length] int
	,[IsKey] bit
	,[DataLength] AS
		CASE 
			WHEN [DataType] = 'BIT' THEN 0.125
			WHEN [DataType] = 'TINYINT' THEN 1
			WHEN [DataType] = 'SMALLINT' THEN 2
			WHEN [DataType] IN ('INT', 'SMALLMONEY', 'REAL', 'SMALLDATETIME') THEN 4
			WHEN [DataType] IN ('BIGINT', 'DATETIME', 'MONEY') THEN 8
			WHEN [DataType] = 'UNIQUEIDENTIFIER' THEN 16
			WHEN [DataType] = 'NVARCHAR' THEN [Length] * 2 + 2
			WHEN [DataType] = 'VARCHAR' THEN [Length] + 2	-- DOESN'T SUPPORT MAX
			WHEN [Datatype] = 'DATE' THEN 3 
			WHEN [Datatype] = 'TIME' THEN 5 -- Worst case
			WHEN [Datatype] = 'DATETIME2' THEN 8 -- Worst case
			WHEN [Datatype] = 'DATETIMEOFFSET' THEN 10 -- Worst case
			WHEN [DataType] = 'CHAR' THEN [Length]
			WHEN [DataType] = 'NCHAR' THEN [Length] * 2 + 2
			WHEN [DataType] = 'BINARY' THEN [Length]
			WHEN [DataType] = 'VARBINARY' THEN [Length] + 2
			WHEN [Datatype] = 'TIMESTAMP' THEN 8
			WHEN [DataType] = 'XML' THEN [Length]	-- Estimate an average size in bytes, 2GB max
			WHEN [DataType] IN ('DECIMAL', 'NUMERIC') THEN 
				CASE 
					WHEN [Length] BETWEEN 1 AND 9	THEN 5
					WHEN [Length] BETWEEN 10 AND 19 THEN 9
					WHEN [Length] BETWEEN 20 AND 28 THEN 13
					WHEN [Length] BETWEEN 29 AND 38 THEN 17
				END
			WHEN [DataType] = 'FLOAT' THEN
				CASE
					WHEN [Length] BETWEEN 1 AND 24 THEN 4
					WHEN [Length] BETWEEN 25 AND 53 THEN 8
					WHEN [Length] IS NULL THEN 8	-- Default n is 53
				END
			ELSE 0
		END
	,[IsVariableLength] AS
		CASE 
			WHEN [DataType] IN ('VARCHAR', 'NVARCHAR', 'VARBINARY') THEN 1
			ELSE 0
		END)

Now for the proc that attempts to replicate the methodology outlined on the aforementioned MSDN article. The results appeared to be accurate or over the actual value, which was exactly what I needed, so I went with it and moved on. I'm completely open to corrections, suggestions, improvements, etc. Just make a comment and I'll integrate them (assuming they work ;-).

/*************************************************************************************
***
*** Procedure:  		[Utility_EstimateClusteredIndexSize]
*** Purpose:		    Estimates the amount of space needed for a given clustered index
*** Methodology:		http://msdn.microsoft.com/en-us/library/ms178085.aspx
***			
***			
*** Author:		      Tim Laqua
*** Date Created:	  2009-10-15 
*** 
*** Revision History
*** Date		Author			Description
*** 2009-10-15	tl				Created
*************************************************************************************/
ALTER PROCEDURE [dbo].[Utility_EstimateClusteredIndexSize]
(
	 @Num_Rows FLOAT
	,@ClusteredIndexIsUnique BIT
	,@Fill_Factor FLOAT
	,@TableColumn TableColumnDefinition READONLY
)
AS 
 
SET NOCOUNT ON
 
DECLARE
	 @Num_Cols FLOAT = 0					-- total number of columns (fixed-length and variable-length)
	,@Fixed_Data_Size FLOAT = 0			-- total byte size of all fixed-length columns
	,@Num_Variable_Cols FLOAT	= 0			-- number of variable-length columns
	,@Max_Var_Size FLOAT	= 0				-- maximum byte size of all variable-length columns
	,@Variable_Data_Size FLOAT = 0
 
DECLARE @Var TABLE
	([Name] VARCHAR(64)
	,[Value] FLOAT
	,[DisplayValue] AS
		CASE
			WHEN LOWER([Name]) LIKE '%size%' OR LOWER([Name]) LIKE '%space%' THEN
				CASE 
					WHEN [Value] < 1024 THEN CAST(CAST(ROUND([Value], 3) AS NUMERIC(16,0)) AS VARCHAR(16)) + ' B'
					WHEN [Value] < 1024*1024 THEN CAST(CAST(ROUND([Value]/1024, 3) AS NUMERIC(16,3)) AS VARCHAR(16)) + ' KB'
					WHEN [Value] < 1024*1024*1024 THEN CAST(CAST(ROUND([Value]/1024/1024, 3) AS NUMERIC(16,3)) AS VARCHAR(16)) + ' MB'
					ELSE CAST(CAST(ROUND([Value]/1024/1024/1024, 3) AS NUMERIC(16,3)) AS VARCHAR(16)) + ' GB'
				END
			ELSE CAST(FLOOR([Value]) AS VARCHAR(16))
		END)
 
 
-- Leaf level data
SELECT 
	 @Num_Variable_Cols = COUNT(1)
	,@Max_Var_Size = ISNULL(SUM([DataLength]), 0)
FROM @TableColumn WHERE [IsVariableLength] = 1
 
SELECT 
	 @Num_Cols = COUNT(1) + @Num_Variable_Cols
	,@Fixed_Data_Size = CEILING(ISNULL(SUM([DataLength]), 0))
FROM @TableColumn WHERE [IsVariableLength] = 0
 
 
INSERT INTO @Var VALUES ('@Fixed_Data_Size', @Fixed_Data_Size)
 
IF @ClusteredIndexIsUnique <> 0
BEGIN
	SET @Num_Cols += 1
	IF @Num_Variable_Cols > 0
		SET @Num_Variable_Cols += 1
	IF @Max_Var_Size > 0
		SET @Max_Var_Size += 4
END
 
DECLARE @Null_Bitmap INT = CEILING(2 + ((@Num_Cols + 7) / 8))
INSERT INTO @Var VALUES ('@Null_Bitmap', @Null_Bitmap)
 
-- if there are variable columns
IF @Num_Variable_Cols > 0
	SET @Variable_Data_Size = 2 + (@Num_Variable_Cols * 2) + @Max_Var_Size 
INSERT INTO @Var VALUES ('@Num_Cols', @Num_Cols)
INSERT INTO @Var VALUES ('@Num_Variable_Cols', @Num_Variable_Cols)
INSERT INTO @Var VALUES ('@Max_Var_Size', @Max_Var_Size)
 
DECLARE @Row_Size FLOAT = CEILING(@Fixed_Data_Size + @Variable_Data_Size + @Null_Bitmap + 4 )
INSERT INTO @Var VALUES ('@Row_Size', @Row_Size)
 
DECLARE @Rows_Per_Page FLOAT = FLOOR(8096 / (@Row_Size + 2))
INSERT INTO @Var VALUES ('@Rows_Per_Page', @Rows_Per_Page)
 
DECLARE @Free_Rows_Per_Page FLOAT = FLOOR(8096 * ((100 - @Fill_Factor) / 100) / (@Row_Size + 2) )
INSERT INTO @Var VALUES ('@Free_Rows_Per_Page', @Free_Rows_Per_Page)
 
DECLARE @Num_Leaf_Pages FLOAT = CEILING(@Num_Rows / (@Rows_Per_Page - @Free_Rows_Per_Page) )
INSERT INTO @Var VALUES ('@Num_Leaf_Pages', @Num_Leaf_Pages)
 
DECLARE @Leaf_space_used FLOAT = 8192 * @Num_Leaf_Pages
INSERT INTO @Var VALUES ('@Leaf_space_used', @Leaf_space_used)
 
-- Index Information
DECLARE
	 @Num_Key_Cols FLOAT = 0			-- total number of key columns (fixed-length and variable-length)
	,@Fixed_Key_Size FLOAT = 0			-- total byte size of all fixed-length key columns
	,@Num_Variable_Key_Cols FLOAT = 0	-- number of variable-length key columns
	,@Max_Var_Key_Size FLOAT = 0		-- maximum byte size of all variable-length key columns
 
 
SELECT 
	 @Num_Variable_Key_Cols = COUNT(1)
	,@Max_Var_Key_Size = ISNULL(SUM([DataLength]), 0)
FROM @TableColumn 
WHERE 
	[IsVariableLength] = 1 
	AND [IsKey] = 1
 
SELECT 
	 @Num_Key_Cols = COUNT(1) + @Num_Variable_Key_Cols
	,@Fixed_Key_Size = ISNULL(SUM([DataLength]), 0)
FROM @TableColumn 
WHERE 
	[IsVariableLength] = 0
	AND [IsKey] = 1
 
IF @ClusteredIndexIsUnique <> 0
BEGIN
	SET @Num_Key_Cols += 1
	IF @Num_Variable_Key_Cols > 0
		SET @Num_Variable_Key_Cols += 1
	IF @Max_Var_Key_Size > 0
		SET @Max_Var_Key_Size += 4
END
INSERT INTO @Var VALUES ('@Num_Variable_Key_Cols', @Num_Variable_Key_Cols)
INSERT INTO @Var VALUES ('@Max_Var_Key_Size', @Max_Var_Key_Size)
INSERT INTO @Var VALUES ('@Num_Key_Cols', @Num_Key_Cols)
 
-- Who would seriously have a nullable key column?  ridiculous.
DECLARE @Index_Null_Bitmap FLOAT = 2 + ((@Num_Key_Cols + 7) / 8) 
INSERT INTO @Var VALUES ('@Index_Null_Bitmap', @Index_Null_Bitmap)
 
DECLARE @Variable_Key_Size FLOAT = 0
 
IF @Max_Var_Key_Size > 0
	SET @Variable_Key_Size = 2 + (@Num_Variable_Key_Cols * 2) + @Max_Var_Key_Size 
INSERT INTO @Var VALUES ('@Variable_Key_Size', @Variable_Key_Size)
 
DECLARE @Index_Row_Size FLOAT = @Fixed_Key_Size + @Variable_Key_Size + @Index_Null_Bitmap + 1 + 6
INSERT INTO @Var VALUES ('@Index_Row_Size',@Index_Row_Size )
 
DECLARE @Index_Rows_Per_Page  FLOAT = FLOOR(8096 / (@Index_Row_Size + 2))
INSERT INTO @Var VALUES ('@Index_Rows_Per_Page', @Index_Rows_Per_Page)
 
DECLARE @Non_Leaf_Levels FLOAT = 1
 
-- Log base @Non_Leaf_Levels hack
WHILE (@Num_Leaf_Pages / @Index_Rows_Per_Page) > POWER(@Index_Rows_Per_Page,@Non_Leaf_Levels)
	SET @Non_Leaf_Levels += 1
 
SET @Non_Leaf_Levels += 1
INSERT INTO @Var VALUES ('@Non_Leaf_Levels', @Non_Leaf_Levels)
 
-- Summation hack
DECLARE 
	 @Num_Index_Pages FLOAT = 0
	,@Summand INT = @Non_Leaf_Levels
 
WHILE @Summand > 0
BEGIN
	SET @Num_Index_Pages += CEILING(@Num_Leaf_Pages/POWER(@Index_Rows_Per_Page, @Summand))
	SET @Summand -= 1
END
INSERT INTO @Var VALUES ('@Num_Index_Pages', @Num_Index_Pages)
 
DECLARE @Index_Space_Used FLOAT = 8192 * @Num_Index_Pages 
INSERT INTO @Var VALUES ('@Index_Space_Used', @Index_Space_Used)
 
DECLARE @Clustered_Index_Size FLOAT = @Leaf_Space_Used + @Index_Space_used
INSERT INTO @Var VALUES ('@Clustered_Index_Size', @Clustered_Index_Size)
 
SELECT * FROM @Var

Now for an example of how to run this doodad. This particular estimate was for a staging table - I needed the UNIQUEIDENTIFER as the key and the CreatedTime field had to be part of the clustered index for the partition scheme.

DECLARE @TableColumn TableColumnDefinition
 
INSERT INTO @TableColumn
VALUES 
	 ('UNIQUEIDENTIFIER', NULL, 1)
	,('DATETIME', NULL, 1)
	,('INT', NULL, 0)
	,('INT', NULL, 0)
	,('INT', NULL, 0)
	,('INT', NULL, 0)
	,('INT', NULL, 0)
	,('INT', NULL, 0)
	,('INT', NULL, 0)
	,('INT', NULL, 0)
	,('INT', NULL, 0)
 
EXEC [dbo].[Utility_EstimateClusteredIndexSize]	
 @Num_Rows = 750000		-- number of rows in the table
,@ClusteredIndexIsUnique = 1 	-- is it?
,@Fill_Factor = 90			-- page fill factor
,@TableColumn = @TableColumn	-- definition of table columns

As for what the output looks like, I wanted it to output all the variables used so I could validate along the way with th emanual method. The results look something like this

Name                                                             Value                  DisplayValue
---------------------------------------------------------------- ---------------------- -------------------
@Fixed_Data_Size                                                 60                     60 B
@Null_Bitmap                                                     5                      5
@Num_Cols                                                        12                     12
@Num_Variable_Cols                                               0                      0
@Max_Var_Size                                                    0                      0 B
@Row_Size                                                        69                     69 B
@Rows_Per_Page                                                   114                    114
@Free_Rows_Per_Page                                              11                     11
@Num_Leaf_Pages                                                  7282                   7282
@Leaf_space_used                                                 59654144               56.891 MB
@Num_Variable_Key_Cols                                           0                      0
@Max_Var_Key_Size                                                0                      0 B
@Num_Key_Cols                                                    3                      3
@Index_Null_Bitmap                                               3.25                   3
@Variable_Key_Size                                               0                      0 B
@Index_Row_Size                                                  34.25                  34 B
@Index_Rows_Per_Page                                             223                    223
@Non_Leaf_Levels                                                 2                      2
@Num_Index_Pages                                                 34                     34
@Index_Space_Used                                                278528                 272.000 KB
@Clustered_Index_Size                                            59932672               57.156 MB

So now you may ask "but what about estimating Non-Clustered Indexes and Heaps?" Well, good question as this is the methodology for estimating the size of a Clustered Index. I ran this for a fact table with two non-clustered indexes in addition to the clustered index by basically defining the index (so pretending the index was a table for the purposes of this proc - included columns would just be non-key columns). All in all, it came out +3.31% on 600+ million rows, so I called that methodology good enough 😉

Comments (11) Trackbacks (0)
  1. Hi,

    i tried to run this. Its giving so many errors.first i executed TableColumnDefinition and theni executed the procedure.its thorowing so many exceptions. can u pls tel the way to run this?

    Thanks in advance.

    • Sure, if you can let me know what parameters you’re passing to the procedure, I can take a look and try to figure out why it’s not working as expected

  2. hi TIM,
    how the stored procedure Utility_EstimateClusteredIndexSize know how muth indexes i have in specific table?
    how it’s know to calculate table with six indexes(one clustered and five nonclustered index)
    what about partition table?
    Thanks,
    Oshik.

    • For the multiple indexes, estimate each one individually. For non-clustered indexes, I just pretend it’s a table and included columns are non-key columns – it seems to come out pretty close in my tests. (You can just test it out against existing indexes). As for partitioned tables and partition-aligned indexes, I’ve never done anything special to account for the fact that they’re partitioned. As a matter of fact, all the examples I mentioned above were indeed partitioned tables and partition-aligned indexes.

    • Hi Tim,
      i think i miss something.
      let me clear my request:
      i have a partition table with 600 million rows, with 4 indexes.
      one is a clustered index(three key columns)
      secound – non-clustered index(non uniqe, one column)
      third- non-clustered index(non uniqe, one column)
      fourth- non-clustered index(non uniqe, one column).
      my table has a 20 columns(varchar, int, datetime etc…).
      as you meen, i need to create a TYPE(and setting the isKey to 1 for eash column index) with my table definition and run it foure time?
      Thanks,
      oshik.

      • Yes, you have to run it four times (once for each index).

        The Clustered Index needs all the columns specified, because the clustered index basically IS the table (that’s why you can’t have “included columns” in clustered indexes). As for the other nonclustered indexes, you only need to specify one key column for each – if you don’t have any included columns in the nonclustered indexes, you don’t need to have anything other than the key column(s). Also note for your non-unique nonclustered indexes, set IsClusteredIndexUnique to 0, it should really be called “IsIndexUnique”.

        We’re not really calculating the size of a table per-se because a table is simply the sum of the size of its indexes (let’s not talk about heaps – there’s no index there, but it certainly takes up space) – we’re just calculating the size of each index.

  3. This is great Tim. Thanks for making this calculation so much easier.

  4. Thanks for the script, assuming I’m not missing something, I think you can simplify the population of generation of rows for the insert into @tableColumn with the following query

    DECLARE @TableColumn TableColumnDefinition

    DECLARE
    @table_name sysname
    , @schema_name sysname
    SELECT
    @table_name = ‘TerribleTable’
    , @schema_name = ‘mySchema’

    ; WITH TABLE_COLUMNS AS
    (
    SELECT
    T.name AS [DataType]
    , SC.max_length AS [Length]
    , SC.name As column_name
    , ST.name AS table_name
    , schema_name(ST.schema_id) AS [schema_name]
    , ST.object_id
    FROM
    sys.tables ST
    INNER JOIN
    sys.columns SC
    ON SC.object_id = ST.object_id
    INNER JOIN
    sys.types T
    ON T.system_type_id = SC.system_type_id
    )
    , PRIMARY_KEYS AS
    (
    SELECT DISTINCT
    I.object_id
    , IC.column_id
    , I.fill_factor
    , I.is_primary_key
    , C.name AS column_name
    FROM
    sys.indexes I
    INNER JOIN
    sys.index_columns IC
    ON IC.object_id = I.object_id
    AND IC.index_id = I.index_id
    INNER JOIN
    sys.columns C
    ON C.object_id = I.object_id
    AND C.column_id = IC.column_id
    WHERE
    I.is_primary_key = 1
    )
    INSERT INTO @TableColumn
    SELECT
    TC.[DataType]
    , TC.[Length]
    , PK.is_primary_key AS [IsKey]
    FROM
    TABLE_COLUMNS TC
    LEFT OUTER JOIN
    PRIMARY_KEYS PK
    ON PK.object_id = TC.object_id
    AND PK.column_name = TC.column_name
    WHERE
    TC.table_name = @table_name
    AND TC.[schema_name] = @schema_name

  5. sp_spaceused or sp_helpdb is also a good way to measure table growth and you can basically determie growth levels based in an exponential manner.

  6. I believe the script has a bug regarding the Non Unique clustered index.
    If I am not missing something here, the following IF condition should be replaced with ( =0 ) instead of ( 0), Does it make sense to you?

    IF @ClusteredIndexIsUnique 0
    BEGIN
    SET @Num_Key_Cols += 1
    IF @Num_Variable_Key_Cols > 0
    SET @Num_Variable_Key_Cols += 1
    IF @Max_Var_Key_Size > 0
    SET @Max_Var_Key_Size += 4
    END


Leave a comment

No trackbacks yet.