How much space can I expect to recover from a rebuild after dropping a column?
In my last article - Identify Tables With Dropped Columns
- we saw how we can identify tables that have columns that were dropped.
Today, we are going to check a way to approximately calculate how much space we can expect to recover if we rebuild our table.
If you want to read the 1st article I published on this subject, you can read it here - What happens when we drop a column on a SQL Server table? Where’s my space?
Why can this be important?
Rebuilding a table is, by default, an offline operation! This means the table will have a lock and other sessions can’t read from it while the rebuild is running.
However, if you are running an Enterprise edition you can use the ONLINE = ON
option to minimize this. The table won’t be locked for the whole period of the rebuild.
That is why it’s important to estimate how much space we can recover from a rebuild, after dropping a column. Your mileage will vary but if you expect to recover dozens or even hundreds of GB maybe it will be more interesting to do the rebuild soon rather than later, but if it is just a dozen of MB and that won’t make a big difference, you can defer that to a later maintenance window.
Picking from where we left it
NOTE: I haven’t mentioned this before but, these scripts don’t work on Azure SQL DB, but it does work on Azure SQL Managed Instance.
In the last article, I ended with the following code block as an option to find the dropped columns.
SELECT object_name(p.object_id) table_name,
c.[name] as column_name,
pc.partition_column_id,
pc.is_dropped
FROM sys.system_internals_partitions p
JOIN sys.system_internals_partition_columns pc
ON p.partition_id = pc.partition_id
LEFT JOIN sys.columns c
ON p.object_id = c.object_id
AND pc.partition_column_id = c.column_id
WHERE pc.is_dropped = 1
This doesn’t output a lot of columns on purpose. We just needed the columns table_name
and column_name
(which is always NULL
in this case) where is_dropped = 1
to know the objects with dropped columns.
But…there is more!
If we explore the columns that belong to the sys.system_internals_partition_columns
system view, we can see that it returns a total of 28 columns (at least for the SQL Server 2022 version).
As mentioned before, this system view is not documented. This means we can only infer/assume some info based on the column name and its contents.
The max_length
column
To be able to calculate the space being wasted by the column that was dropped, we will need to know the size that column occupies on each record. For fixed-length data types, that will be the max_length
column.
The number of rows
To have the total KB that our dropped column is still occupying, we will need to multiply the max_length
column by the number of records that we have on the table.
For that, we will need to use the column rows
that we can obtain from the sys.system_internals_partitions
system view.
A practical example
Picking our Client
table where we have dropped the DoB
column, let’s see if this math works.
A column of data type datetime
occupies 8 bytes. We can see that from the max_length
column.
Let’s say our table has 50000 rows.
This will mean that 50000 (rows) * 8 (bytes) = 400000 (bytes).
To convert that to KB, we will need to divide this result by 1000 we will get 400 KB.
Knowing that each data page has 8 KB we can say that we are wasting 50 pages (400 / 8).
Confirming it
To confirm this is true, I have added some columns to our query.
SELECT
OBJECT_NAME(p.object_id) AS table_name,
COALESCE(cx.[name], c.[name]) as column_name,
pc.partition_column_id,
pc.is_dropped,
/*
New columns added
*/
pc.max_length,
p.[rows],
pc.max_length * p.[rows] AS total_bytes,
(pc.max_length * p.[rows]) / 1000 AS total_kb,
(pc.max_length * p.[rows]) / 1000 / 8 AS total_pages
FROM sys.system_internals_partitions p
JOIN sys.system_internals_partition_columns pc
ON p.partition_id = pc.partition_id
LEFT JOIN sys.index_columns ic
ON p.object_id = ic.object_id
AND ic.index_id = p.index_id
AND ic.index_column_id = pc.partition_column_id
LEFT JOIN sys.columns c
ON p.object_id = c.object_id
AND ic.column_id = c.column_id
LEFT JOIN sys.columns cx
ON p.object_id = cx.object_id
AND pc.partition_column_id = cx.column_id
CROSS APPLY sys.dm_db_index_physical_stats(DB_ID(), p.object_id, p.index_id, DEFAULT, 'LIMITED') AS ddips
WHERE p.object_id = object_id('dbo.Client')
AND p.index_id IN (0, 1)
AND is_dropped = 1
The before and the after
With the previous query, we already have an idea of the number of pages and therefore the amount of KB that we may be able to recover after a rebuild.
However, to add a bit more detail to our investigation let’s put the current size and the expected one, after rebuilding, side-by-side.
For this, we will need to know what the actual number of pages that our table is using is. For that, we can use the dynamic management function (DMF) sys.dm_db_index_physical_stats
that returns the page_count
column. As it is a DMF, to add to our query we will use a CROSS APPLY
.
SELECT
OBJECT_NAME(p.object_id) AS table_name,
COALESCE(cx.[name], c.[name]) as column_name,
pc.partition_column_id,
pc.is_dropped,
pc.max_length,
p.[rows],
pc.max_length * p.[rows] AS total_bytes,
(pc.max_length * p.[rows]) / 1000 AS total_kb,
(pc.max_length * p.[rows]) / 1000 / 8 AS total_pages,
ddips.page_count AS current_page_ount /* New column added */
FROM sys.system_internals_partitions p
JOIN sys.system_internals_partition_columns pc
ON p.partition_id = pc.partition_id
LEFT JOIN sys.index_columns ic
ON p.object_id = ic.object_id
AND ic.index_id = p.index_id
AND ic.index_column_id = pc.partition_column_id
LEFT JOIN sys.columns c
ON p.object_id = c.object_id
AND ic.column_id = c.column_id
LEFT JOIN sys.columns cx
ON p.object_id = cx.object_id
AND pc.partition_column_id = cx.column_id
CROSS APPLY sys.dm_db_index_physical_stats(DB_ID(), p.object_id, p.index_id, DEFAULT, 'LIMITED') AS ddips
WHERE p.object_id = object_id('dbo.Client')
AND p.index_id IN (0, 1)
AND is_dropped = 1
With this, we can see that we have 180 data pages.
Now, all we need to do is some math 😁
SELECT
OBJECT_NAME(p.object_id) AS table_name,
COALESCE(cx.[name], c.[name]) as column_name,
pc.partition_column_id,
pc.is_dropped,
pc.max_length,
p.[rows],
pc.max_length * p.[rows] AS total_bytes,
(pc.max_length * p.[rows]) / 1000 AS total_kb,
(pc.max_length * p.[rows]) / 1000 / 8 AS total_pages,
ddips.page_count AS current_page_ount,
/*
New columns added
*/
ddips.page_count - (pc.max_length * p.[rows] / 1000 / 8) AS new_page_count,
ddips.page_count * 8 AS current_data_size_kb,
(ddips.page_count - (pc.max_length * p.[rows] / 1000 / 8)) * 8 AS new_data_size_kb
FROM sys.system_internals_partitions p
JOIN sys.system_internals_partition_columns pc
ON p.partition_id = pc.partition_id
LEFT JOIN sys.index_columns ic
ON p.object_id = ic.object_id
AND ic.index_id = p.index_id
AND ic.index_column_id = pc.partition_column_id
LEFT JOIN sys.columns c
ON p.object_id = c.object_id
AND ic.column_id = c.column_id
LEFT JOIN sys.columns cx
ON p.object_id = cx.object_id
AND pc.partition_column_id = cx.column_id
CROSS APPLY sys.dm_db_index_physical_stats(DB_ID(), p.object_id, p.index_id, DEFAULT, 'LIMITED') AS ddips
WHERE p.object_id = object_id('dbo.Client')
AND p.index_id IN (0, 1)
AND is_dropped = 1
The final proof
To confirm if our math works, let’s rebuild our table so we can get rid of the extra space left by the drop action and check how many data pages is the table using.
-- Let's clear the dropped column and tidy up things
ALTER TABLE Client REBUILD
Checking the current number of data pages - for that we can run the following query:
SELECT
OBJECT_NAME(p.object_id) AS table_name,
COALESCE(cx.[name], c.[name]) as column_name,
pc.partition_column_id,
pc.is_dropped,
pc.max_length,
p.[rows],
ddips.page_count AS current_page_ount
FROM sys.system_internals_partitions p
JOIN sys.system_internals_partition_columns pc
ON p.partition_id = pc.partition_id
LEFT JOIN sys.index_columns ic
ON p.object_id = ic.object_id
AND ic.index_id = p.index_id
AND ic.index_column_id = pc.partition_column_id
LEFT JOIN sys.columns c
ON p.object_id = c.object_id
AND ic.column_id = c.column_id
LEFT JOIN sys.columns cx
ON p.object_id = cx.object_id
AND pc.partition_column_id = cx.column_id
CROSS APPLY sys.dm_db_index_physical_stats(DB_ID(), p.object_id, p.index_id, DEFAULT, 'LIMITED') AS ddips
WHERE p.object_id = object_id('dbo.Client')
AND p.index_id IN (0, 1)
As we can see on the image, the current_page_count
is now 130
which matches the new_page_count
column in the previous query.
How accurate is this?
You should use this script and analyse the results with a pinch of salt.
My idea wasn’t to build a one-size-fits-all script but rather to check if the math would add up and would be enough to rely on it for specific situations.
Feel free to pick the script and adapt it as you wish.
Also, check below for some of the reasons why it isn’t a one-size-fits-all.
This works well for fixed-length columns and no compression and…
The attentive reader may be wondering “Ok, this was on a HEAP!”.
- What if the table has a CLUSTERED index?
- What if our column that was dropped doesn’t have a fixed-length?
- Or if it was nullable?
- Will these calculations work the same way if we are using compression?
- And, what happens if I have NONCLUSTERED indexes using the column?
Let me answer all these five questions.
“Will these calculations work the same way if we are using compression?”
Let me address this one first because this is a show-stopper for this script.
If the table is compressed, the script can/will return some odd results as it doesn’t take into account that the data type used by the column can occupy less space (can be compressed), which will lead to abnormal (ultimately can lead to negative) values.
Therefore, the answer is no. The way the script is built won’t be a good fit for tables that are compressed.
“What if the table has a CLUSTERED index?”
As long as the table isn’t compressed, the same script can still be used to have a good guess. You can test by yourself, create a CLUSTERED index on the table and run the scripts again.
“What if the column that was dropped doesn’t have a fixed length?
When a column doesn’t have a fixed length, and it’s already dropped, you won’t have any metadata to give you an approximate size (an average_record_size). In this case, probably your best bet is to decide on a value as the average occupied by that column to do the calculation.
Example:
If we have a VARCHAR(50)
column, we will see the max_length
showing 50
however, if we know that on average, we only have 20 chars, we can use that to have an estimate.
If you don’t have that info, I would suggest using half of the size 25
in this example. At this point, it will be as good of a guess as any other.
“Or if it’s nullable?”
If the column has a fixed length being nullable or not doesn’t matter. It occupies the same number of bytes
as when the column has a value different than NULL
.
NOTE: Unless you use Sparse Columns but that is a different story with some restrictions/caveats .
“And NONCLUSTERED indexes?”
We can’t drop a column that is part of a NONCLUSTERED index without disable/drop that same index first.
This script is written to do calculations on the actual data (HEAP/CLUSTERED index) and not on a copy of the data (NONCLUSTERED index).
Wrap up
In this article, we modified the script to get the table with dropped columns to have more detail to be able to calculate how much space we are wasting after that column is dropped.
We also mentioned a bunch of caveats that you need to be aware of, depending on the table structure, that can make this script not produce the expected value.
I hope you find this example educational and that you can use it to do some testing by yourself.
Thanks for reading.