Tuesday, August 7, 2007

Handling NULL values in SQL Server 2005

In the simplest terms, a NULL value represents an unknown value. It’s unknown in the sense that the value is: missing from the system, may not be applicable in the current situation, or might be added later. NULL values are different than any other value and are sometimes hard to compare and handle.

I think there will always be a debate as to whether NULL values should exist in a normalized OLTP environment. (Read a previous article for database normalization tips.) Academics typically argue that you should always normalize your schema to 3rd normal form, and this should take care of your data redundancies and missing values. However, it is typically impractical to normalize your SQL Server database all the way to 3rd normal form. It looks good on paper and would work great in an ideal world, but it usually doesn’t perform well because of the extra joins involved in accessing data you need.

So it begins to make sense to have a little bit of data redundancy and placeholders for missing data in the SQL Server database. (Using data redundancy to increase database performance is outside the scope of this article.) The issues with using these NULL values are: You have to handle them a little bit differently than other values; and there are some small performance implications.

A closer look at handling NULLs

Here’s a run-through of different scenarios that you will likely encounter in a production environment where NULL values are allowed.

Note: All NULL comparisons in this article are done under the context of the ANSI_NULLS ON setting, which is a database option that determines how NULL comparisons are handled. Under ANSI_NULLS ON, two NULL values will never be equal to each other because the two separate values are unknown. With ANSI_NULLS OFF, two separate NULL values will evaluate to equal values.

The following script loads the SalesHistory table that I will use for the examples. Take special notice of the CASE statement in the WHILE loop. The CASE statement assigns the value NULL for every other record inserted into the SalesHistory table where the product to be inserted is Computer. This is allowed because the Product column in the SalesHistory table allows NULL values. Also notice that NULL values are assigned to variables using the assignment (=) operator. While NULL values are assigned this way, NULL values are not interrogated in such a way.

IF OBJECT_ID('SalesHistory')>0    
DROP TABLE SalesHistory;
GO
CREATE TABLE [dbo].[SalesHistory]
(
[SaleID] [int] IDENTITY(1,1) NOT NULL PRIMARY KEY,
[Product] [varchar](10) NULL,
[SaleDate] [datetime] NULL,
[SalePrice] [money] NULL
)
GO

DECLARE @i SMALLINT, @Product VARCHAR(10)
SET @i = 1

WHILE (@i <=100)
BEGIN
SET @Product = CASE WHEN @i%2 = 0 THEN 'Computer' ELSE NULL END

INSERT INTO SalesHistory(Product, SaleDate, SalePrice)
VALUES (@Product, DATEADD(mm, @i, '3/11/1919'), DATEPART(ms, GETDATE()) + (@i + 57))

INSERT INTO SalesHistory(Product, SaleDate, SalePrice)
VALUES('BigScreen', DATEADD(mm, @i, '3/11/1927'), DATEPART(ms, GETDATE()) + (@i + 13))

INSERT INTO SalesHistory(Product, SaleDate, SalePrice)
VALUES('PoolTable', DATEADD(mm, @i, '3/11/1908'), DATEPART(ms, GETDATE()) + (@i + 29))

SET @i = @i + 1
END
GO

Now that I have data in my SalesHistory table, I can run queries to look at how NULL values are handled.

This query will return all rows from the SalesHistory table, where the Product column contains the value NULL by use of the IS NULL statement.

SELECT *
FROM SalesHistory
WHERE Product IS NULL

This query will not return any rows. This is due to the handling of NULL values in SQL Server. A NULL value is not equal to any other value, so I cannot use the equality operator here. This can be a bit confusing until you get used to it. Remember to use IS NULL when you are testing to see if a column or variable contains the value NULL.

SELECT *
FROM SalesHistory
WHERE Product = NULL

This query returns all rows from the SalesHistory table where the Product column does not contain the value NULL by using the IS NOT NULL statement. This is the direct inverse of the IS NULL statement.

SELECT *
FROM SalesHistory
WHERE Product IS NOT NULL

This query returns all records from the SalesHistory table. The query uses the ISNULL function on the Product column. For every NULL value in the Product column, the ISNULL function replaces that value with the literal string ‘Missing’. I use this function a lot when I need to compare columns with possible NULL values to other values. The value that you are substituting must be of the same data type as the field that may contain the NULL value.

SELECT ISNULL(Product, 'Missing')
FROM SalesHistory

In this query, I am aggregating the number of products sold and the sale price of the products by the Product column. NULL values are distinct values but are grouped together when aggregations occur on them, which can be a bit confusing. Notice the single record in the resultset for the NULL product.

SELECT Product, COUNT(*) AS ProductCount, SUM(SalePrice) AS ProductSales
FROM SalesHistory
GROUP BY Product

Any time a NULL value is present in a column that an aggregation is performed on, such as a COUNT, AVG, or SUM function, those values will be ignored and therefore not included in the functions result.

The following query will not compile because the fields that are referenced are missing from the SalesHistory table; however, it will work fine for showing the functionality of the COALESCE function. The COALESCE function returns the first non-NULL value in the field list that it accepts. The COALESCE function is very useful in a join operation or in queries where you are comparing a single value from a list of possible fields to a single value.

SELECT COALESCE(Product, ProductDescription, Product, 'Missing Info')
FROM SalesHistory

This query uses the NULLIF function, which returns the value NULL if the two values passed into the function are the same value.

SELECT *, NULLIF(Product, NULL)
FROM SalesHistory

Operations on NULL values

Since NULL values are unknown, operations on them typically require some extra processing. The following query returns a unique list of Products from the SalesHistory table, including the NULL record. As the list of values are returned, I am concatenating the literal string ‘Sold’ to the end of the Product value. On the record that returns the NULL value, this concatenation will not work correctly.

SELECT Product + ' Sold'
FROM SalesHistory
GROUP BY Product

I can use the ISNULL function from above to replace the NULL value with an empty string so that I can concatenate the ‘Sold’ value to the end. This example probably wouldn’t be that useful in a production situation, but it does illustrate that you need to take special care when you encounter NULLs in string operations. You should also take care of calculations on numeric fields that allow NULL values.

SELECT ProductType = ISNULL(Product,'Unknown') + ' Sold:' , COUNT(*) AS ProductCount
FROM SalesHistory
GROUP BY Product

Consider this when defining table structure and constraints

In SQL Server, if a UNIQUE constraint is defined upon a NULLABLE column, only one NULL value will be allowed in that column. It makes sense to me that the column should allow more than one NULL value because NULL values are distinct values. This is something to consider when you are defining your table structure and constraints. Perhaps Microsoft will fix this in a future version of SQL Server.

No comments: