Estimating the Size of a Table in SQL Server 2008
I have read this (http://msdn.microsoft.com/en-us/library/ms178085.aspx) article at least 6, maybe 7 times in the past - and every time I say to myself "this is ridiculous - someone has to have written a script to do this by now" and every time, I google for hours and fail to find anything. So I finally gave up and wrote something to do it. Note, I've only verified it on 100% fixed width tables. I compared its output to a 600+ million row table and it came out somewhere around 3% higher - fine with me as I'd rather over-estimate space requirements than under-estimate.
First, we have to create a table type to hold our table column definitions. the datatype is required, length is required for anything that allows length (MAX is not handled for any types), and the worst case scenario is always used so if you know you won't be filling those VARCHAR(255)s, feel free to use a smaller, more average, number. IsKey is a BIT value that tells the proc to use those columns as part of the index key. The rest of it is all calculated.
CREATE TYPE TableColumnDefinition AS TABLE
([DataType] varchar(16) NOT NULL
,[Length] int
,[IsKey] bit
,[DataLength] AS
CASE
WHEN [DataType] = 'BIT' THEN 0.125
WHEN [DataType] = 'TINYINT' THEN 1
WHEN [DataType] = 'SMALLINT' THEN 2
WHEN [DataType] IN ('INT', 'SMALLMONEY', 'REAL', 'SMALLDATETIME') THEN 4
WHEN [DataType] IN ('BIGINT', 'DATETIME', 'MONEY') THEN 8
WHEN [DataType] = 'UNIQUEIDENTIFIER' THEN 16
WHEN [DataType] = 'NVARCHAR' THEN [Length] * 2 + 2
WHEN [DataType] = 'VARCHAR' THEN [Length] + 2 -- DOESN'T SUPPORT MAX
WHEN [Datatype] = 'DATE' THEN 3
WHEN [Datatype] = 'TIME' THEN 5 -- Worst case
WHEN [Datatype] = 'DATETIME2' THEN 8 -- Worst case
WHEN [Datatype] = 'DATETIMEOFFSET' THEN 10 -- Worst case
WHEN [DataType] = 'CHAR' THEN [Length]
WHEN [DataType] = 'NCHAR' THEN [Length] * 2 + 2
WHEN [DataType] = 'BINARY' THEN [Length]
WHEN [DataType] = 'VARBINARY' THEN [Length] + 2
WHEN [Datatype] = 'TIMESTAMP' THEN 8
WHEN [DataType] = 'XML' THEN [Length] -- Estimate an average size in bytes, 2GB max
WHEN [DataType] IN ('DECIMAL', 'NUMERIC') THEN
CASE
WHEN [Length] BETWEEN 1 AND 9 THEN 5
WHEN [Length] BETWEEN 10 AND 19 THEN 9
WHEN [Length] BETWEEN 20 AND 28 THEN 13
WHEN [Length] BETWEEN 29 AND 38 THEN 17
END
WHEN [DataType] = 'FLOAT' THEN
CASE
WHEN [Length] BETWEEN 1 AND 24 THEN 4
WHEN [Length] BETWEEN 25 AND 53 THEN 8
WHEN [Length] IS NULL THEN 8 -- Default n is 53
END
ELSE 0
END
,[IsVariableLength] AS
CASE
WHEN [DataType] IN ('VARCHAR', 'NVARCHAR', 'VARBINARY') THEN 1
ELSE 0
END)
Now for the proc that attempts to replicate the methodology outlined on the aforementioned MSDN article. The results appeared to be accurate or over the actual value, which was exactly what I needed, so I went with it and moved on. I'm completely open to corrections, suggestions, improvements, etc. Just make a comment and I'll integrate them (assuming they work ;-).
/*************************************************************************************
***
*** Procedure: [Utility_EstimateClusteredIndexSize]
*** Purpose: Estimates the amount of space needed for a given clustered index
*** Methodology: http://msdn.microsoft.com/en-us/library/ms178085.aspx
***
***
*** Author: Tim Laqua
*** Date Created: 2009-10-15
***
*** Revision History
*** Date Author Description
*** 2009-10-15 tl Created
*************************************************************************************/
ALTER PROCEDURE [dbo].[Utility_EstimateClusteredIndexSize]
(
@Num_Rows FLOAT
,@ClusteredIndexIsUnique BIT
,@Fill_Factor FLOAT
,@TableColumn TableColumnDefinition READONLY
)
AS
SET NOCOUNT ON
DECLARE
@Num_Cols FLOAT = 0 -- total number of columns (fixed-length and variable-length)
,@Fixed_Data_Size FLOAT = 0 -- total byte size of all fixed-length columns
,@Num_Variable_Cols FLOAT = 0 -- number of variable-length columns
,@Max_Var_Size FLOAT = 0 -- maximum byte size of all variable-length columns
,@Variable_Data_Size FLOAT = 0
DECLARE @Var TABLE
([Name] VARCHAR(64)
,[Value] FLOAT
,[DisplayValue] AS
CASE
WHEN LOWER([Name]) LIKE '%size%' OR LOWER([Name]) LIKE '%space%' THEN
CASE
WHEN [Value] < 1024 THEN CAST(CAST(ROUND([Value], 3) AS NUMERIC(16,0)) AS VARCHAR(16)) + ' B'
WHEN [Value] < 1024*1024 THEN CAST(CAST(ROUND([Value]/1024, 3) AS NUMERIC(16,3)) AS VARCHAR(16)) + ' KB'
WHEN [Value] < 1024*1024*1024 THEN CAST(CAST(ROUND([Value]/1024/1024, 3) AS NUMERIC(16,3)) AS VARCHAR(16)) + ' MB'
ELSE CAST(CAST(ROUND([Value]/1024/1024/1024, 3) AS NUMERIC(16,3)) AS VARCHAR(16)) + ' GB'
END
ELSE CAST(FLOOR([Value]) AS VARCHAR(16))
END)
-- Leaf level data
SELECT
@Num_Variable_Cols = COUNT(1)
,@Max_Var_Size = ISNULL(SUM([DataLength]), 0)
FROM @TableColumn WHERE [IsVariableLength] = 1
SELECT
@Num_Cols = COUNT(1) + @Num_Variable_Cols
,@Fixed_Data_Size = CEILING(ISNULL(SUM([DataLength]), 0))
FROM @TableColumn WHERE [IsVariableLength] = 0
INSERT INTO @Var VALUES ('@Fixed_Data_Size', @Fixed_Data_Size)
IF @ClusteredIndexIsUnique <> 0
BEGIN
SET @Num_Cols += 1
IF @Num_Variable_Cols > 0
SET @Num_Variable_Cols += 1
IF @Max_Var_Size > 0
SET @Max_Var_Size += 4
END
DECLARE @Null_Bitmap INT = CEILING(2 + ((@Num_Cols + 7) / 8))
INSERT INTO @Var VALUES ('@Null_Bitmap', @Null_Bitmap)
-- if there are variable columns
IF @Num_Variable_Cols > 0
SET @Variable_Data_Size = 2 + (@Num_Variable_Cols * 2) + @Max_Var_Size
INSERT INTO @Var VALUES ('@Num_Cols', @Num_Cols)
INSERT INTO @Var VALUES ('@Num_Variable_Cols', @Num_Variable_Cols)
INSERT INTO @Var VALUES ('@Max_Var_Size', @Max_Var_Size)
DECLARE @Row_Size FLOAT = CEILING(@Fixed_Data_Size + @Variable_Data_Size + @Null_Bitmap + 4 )
INSERT INTO @Var VALUES ('@Row_Size', @Row_Size)
DECLARE @Rows_Per_Page FLOAT = FLOOR(8096 / (@Row_Size + 2))
INSERT INTO @Var VALUES ('@Rows_Per_Page', @Rows_Per_Page)
DECLARE @Free_Rows_Per_Page FLOAT = FLOOR(8096 * ((100 - @Fill_Factor) / 100) / (@Row_Size + 2) )
INSERT INTO @Var VALUES ('@Free_Rows_Per_Page', @Free_Rows_Per_Page)
DECLARE @Num_Leaf_Pages FLOAT = CEILING(@Num_Rows / (@Rows_Per_Page - @Free_Rows_Per_Page) )
INSERT INTO @Var VALUES ('@Num_Leaf_Pages', @Num_Leaf_Pages)
DECLARE @Leaf_space_used FLOAT = 8192 * @Num_Leaf_Pages
INSERT INTO @Var VALUES ('@Leaf_space_used', @Leaf_space_used)
-- Index Information
DECLARE
@Num_Key_Cols FLOAT = 0 -- total number of key columns (fixed-length and variable-length)
,@Fixed_Key_Size FLOAT = 0 -- total byte size of all fixed-length key columns
,@Num_Variable_Key_Cols FLOAT = 0 -- number of variable-length key columns
,@Max_Var_Key_Size FLOAT = 0 -- maximum byte size of all variable-length key columns
SELECT
@Num_Variable_Key_Cols = COUNT(1)
,@Max_Var_Key_Size = ISNULL(SUM([DataLength]), 0)
FROM @TableColumn
WHERE
[IsVariableLength] = 1
AND [IsKey] = 1
SELECT
@Num_Key_Cols = COUNT(1) + @Num_Variable_Key_Cols
,@Fixed_Key_Size = ISNULL(SUM([DataLength]), 0)
FROM @TableColumn
WHERE
[IsVariableLength] = 0
AND [IsKey] = 1
IF @ClusteredIndexIsUnique <> 0
BEGIN
SET @Num_Key_Cols += 1
IF @Num_Variable_Key_Cols > 0
SET @Num_Variable_Key_Cols += 1
IF @Max_Var_Key_Size > 0
SET @Max_Var_Key_Size += 4
END
INSERT INTO @Var VALUES ('@Num_Variable_Key_Cols', @Num_Variable_Key_Cols)
INSERT INTO @Var VALUES ('@Max_Var_Key_Size', @Max_Var_Key_Size)
INSERT INTO @Var VALUES ('@Num_Key_Cols', @Num_Key_Cols)
-- Who would seriously have a nullable key column? ridiculous.
DECLARE @Index_Null_Bitmap FLOAT = 2 + ((@Num_Key_Cols + 7) / 8)
INSERT INTO @Var VALUES ('@Index_Null_Bitmap', @Index_Null_Bitmap)
DECLARE @Variable_Key_Size FLOAT = 0
IF @Max_Var_Key_Size > 0
SET @Variable_Key_Size = 2 + (@Num_Variable_Key_Cols * 2) + @Max_Var_Key_Size
INSERT INTO @Var VALUES ('@Variable_Key_Size', @Variable_Key_Size)
DECLARE @Index_Row_Size FLOAT = @Fixed_Key_Size + @Variable_Key_Size + @Index_Null_Bitmap + 1 + 6
INSERT INTO @Var VALUES ('@Index_Row_Size',@Index_Row_Size )
DECLARE @Index_Rows_Per_Page FLOAT = FLOOR(8096 / (@Index_Row_Size + 2))
INSERT INTO @Var VALUES ('@Index_Rows_Per_Page', @Index_Rows_Per_Page)
DECLARE @Non_Leaf_Levels FLOAT = 1
-- Log base @Non_Leaf_Levels hack
WHILE (@Num_Leaf_Pages / @Index_Rows_Per_Page) > POWER(@Index_Rows_Per_Page,@Non_Leaf_Levels)
SET @Non_Leaf_Levels += 1
SET @Non_Leaf_Levels += 1
INSERT INTO @Var VALUES ('@Non_Leaf_Levels', @Non_Leaf_Levels)
-- Summation hack
DECLARE
@Num_Index_Pages FLOAT = 0
,@Summand INT = @Non_Leaf_Levels
WHILE @Summand > 0
BEGIN
SET @Num_Index_Pages += CEILING(@Num_Leaf_Pages/POWER(@Index_Rows_Per_Page, @Summand))
SET @Summand -= 1
END
INSERT INTO @Var VALUES ('@Num_Index_Pages', @Num_Index_Pages)
DECLARE @Index_Space_Used FLOAT = 8192 * @Num_Index_Pages
INSERT INTO @Var VALUES ('@Index_Space_Used', @Index_Space_Used)
DECLARE @Clustered_Index_Size FLOAT = @Leaf_Space_Used + @Index_Space_used
INSERT INTO @Var VALUES ('@Clustered_Index_Size', @Clustered_Index_Size)
SELECT * FROM @Var
Now for an example of how to run this doodad. This particular estimate was for a staging table - I needed the UNIQUEIDENTIFER as the key and the CreatedTime field had to be part of the clustered index for the partition scheme.
DECLARE @TableColumn TableColumnDefinition
INSERT INTO @TableColumn
VALUES
('UNIQUEIDENTIFIER', NULL, 1)
,('DATETIME', NULL, 1)
,('INT', NULL, 0)
,('INT', NULL, 0)
,('INT', NULL, 0)
,('INT', NULL, 0)
,('INT', NULL, 0)
,('INT', NULL, 0)
,('INT', NULL, 0)
,('INT', NULL, 0)
,('INT', NULL, 0)
EXEC [dbo].[Utility_EstimateClusteredIndexSize]
@Num_Rows = 750000 -- number of rows in the table
,@ClusteredIndexIsUnique = 1 -- is it?
,@Fill_Factor = 90 -- page fill factor
,@TableColumn = @TableColumn -- definition of table columns
As for what the output looks like, I wanted it to output all the variables used so I could validate along the way with th emanual method. The results look something like this
Name Value DisplayValue
---------------------------------------------------------------- ---------------------- -------------------
@Fixed_Data_Size 60 60 B
@Null_Bitmap 5 5
@Num_Cols 12 12
@Num_Variable_Cols 0 0
@Max_Var_Size 0 0 B
@Row_Size 69 69 B
@Rows_Per_Page 114 114
@Free_Rows_Per_Page 11 11
@Num_Leaf_Pages 7282 7282
@Leaf_space_used 59654144 56.891 MB
@Num_Variable_Key_Cols 0 0
@Max_Var_Key_Size 0 0 B
@Num_Key_Cols 3 3
@Index_Null_Bitmap 3.25 3
@Variable_Key_Size 0 0 B
@Index_Row_Size 34.25 34 B
@Index_Rows_Per_Page 223 223
@Non_Leaf_Levels 2 2
@Num_Index_Pages 34 34
@Index_Space_Used 278528 272.000 KB
@Clustered_Index_Size 59932672 57.156 MB
So now you may ask "but what about estimating Non-Clustered Indexes and Heaps?" Well, good question as this is the methodology for estimating the size of a Clustered Index. I ran this for a fact table with two non-clustered indexes in addition to the clustered index by basically defining the index (so pretending the index was a table for the purposes of this proc - included columns would just be non-key columns). All in all, it came out +3.31% on 600+ million rows, so I called that methodology good enough 😉
November 12th, 2009 - 06:02
Hi,
i tried to run this. Its giving so many errors.first i executed TableColumnDefinition and theni executed the procedure.its thorowing so many exceptions. can u pls tel the way to run this?
Thanks in advance.
November 12th, 2009 - 14:30
Sure, if you can let me know what parameters you’re passing to the procedure, I can take a look and try to figure out why it’s not working as expected
November 16th, 2009 - 08:50
hi TIM,
how the stored procedure Utility_EstimateClusteredIndexSize know how muth indexes i have in specific table?
how it’s know to calculate table with six indexes(one clustered and five nonclustered index)
what about partition table?
Thanks,
Oshik.
November 16th, 2009 - 17:42
For the multiple indexes, estimate each one individually. For non-clustered indexes, I just pretend it’s a table and included columns are non-key columns – it seems to come out pretty close in my tests. (You can just test it out against existing indexes). As for partitioned tables and partition-aligned indexes, I’ve never done anything special to account for the fact that they’re partitioned. As a matter of fact, all the examples I mentioned above were indeed partitioned tables and partition-aligned indexes.
November 19th, 2009 - 15:15
Hi Tim,
i think i miss something.
let me clear my request:
i have a partition table with 600 million rows, with 4 indexes.
one is a clustered index(three key columns)
secound – non-clustered index(non uniqe, one column)
third- non-clustered index(non uniqe, one column)
fourth- non-clustered index(non uniqe, one column).
my table has a 20 columns(varchar, int, datetime etc…).
as you meen, i need to create a TYPE(and setting the isKey to 1 for eash column index) with my table definition and run it foure time?
Thanks,
oshik.
November 19th, 2009 - 15:27
Yes, you have to run it four times (once for each index).
The Clustered Index needs all the columns specified, because the clustered index basically IS the table (that’s why you can’t have “included columns” in clustered indexes). As for the other nonclustered indexes, you only need to specify one key column for each – if you don’t have any included columns in the nonclustered indexes, you don’t need to have anything other than the key column(s). Also note for your non-unique nonclustered indexes, set IsClusteredIndexUnique to 0, it should really be called “IsIndexUnique”.
We’re not really calculating the size of a table per-se because a table is simply the sum of the size of its indexes (let’s not talk about heaps – there’s no index there, but it certainly takes up space) – we’re just calculating the size of each index.
June 9th, 2010 - 14:05
This is great Tim. Thanks for making this calculation so much easier.
April 27th, 2011 - 17:46
Thanks….
August 11th, 2011 - 12:19
Thanks for the script, assuming I’m not missing something, I think you can simplify the population of generation of rows for the insert into @tableColumn with the following query
DECLARE @TableColumn TableColumnDefinition
DECLARE
@table_name sysname
, @schema_name sysname
SELECT
@table_name = ‘TerribleTable’
, @schema_name = ‘mySchema’
; WITH TABLE_COLUMNS AS
(
SELECT
T.name AS [DataType]
, SC.max_length AS [Length]
, SC.name As column_name
, ST.name AS table_name
, schema_name(ST.schema_id) AS [schema_name]
, ST.object_id
FROM
sys.tables ST
INNER JOIN
sys.columns SC
ON SC.object_id = ST.object_id
INNER JOIN
sys.types T
ON T.system_type_id = SC.system_type_id
)
, PRIMARY_KEYS AS
(
SELECT DISTINCT
I.object_id
, IC.column_id
, I.fill_factor
, I.is_primary_key
, C.name AS column_name
FROM
sys.indexes I
INNER JOIN
sys.index_columns IC
ON IC.object_id = I.object_id
AND IC.index_id = I.index_id
INNER JOIN
sys.columns C
ON C.object_id = I.object_id
AND C.column_id = IC.column_id
WHERE
I.is_primary_key = 1
)
INSERT INTO @TableColumn
SELECT
TC.[DataType]
, TC.[Length]
, PK.is_primary_key AS [IsKey]
FROM
TABLE_COLUMNS TC
LEFT OUTER JOIN
PRIMARY_KEYS PK
ON PK.object_id = TC.object_id
AND PK.column_name = TC.column_name
WHERE
TC.table_name = @table_name
AND TC.[schema_name] = @schema_name
January 9th, 2012 - 23:32
sp_spaceused or sp_helpdb is also a good way to measure table growth and you can basically determie growth levels based in an exponential manner.
January 2nd, 2014 - 12:03
I believe the script has a bug regarding the Non Unique clustered index.
If I am not missing something here, the following IF condition should be replaced with ( =0 ) instead of ( 0), Does it make sense to you?
IF @ClusteredIndexIsUnique 0
BEGIN
SET @Num_Key_Cols += 1
IF @Num_Variable_Key_Cols > 0
SET @Num_Variable_Key_Cols += 1
IF @Max_Var_Key_Size > 0
SET @Max_Var_Key_Size += 4
END