Saturday, July 27, 2013

Different methods to execute large query in small batches

Today I’ll show you some different methods to run a DML operation like UPDATE in small batches for a large table.  Though I’m going to show only UPDATE statements, methods are applicable to DELETE and INSERTS as well.   When you want  to run an UPDATE  query(or other DML operations) over a very huge table having  over a billion rows or even  100s of Million it is recommended that you should not update them at one go.  DML queries on a very large number of rows can cause performance issues, transaction log might get full, affects concurrent users, eats up lot of server resources, runs for hours and many other.   Good practice is that you should break the number of records in small batches of few thousand and update them.  This way you will use minimum server resources and most importantly you will prevent the transaction log file from getting full.

First lets create one table and insert some sample data.

--------------------------------------------------------------------------------------------------------

-- Create the table
CREATE TABLE #A
(
ID INT,
CREATED_DATE  DATE,
METHOD VARCHAR(10)
)

-- Insert 10000 rows in the table
;WITH MANGAL AS
(
SELECT 1 AS ID, CONVERT(DATE,'20000101') AS CREATED_DATE
UNION ALL
SELECT ID+1, DATEADD(DD, 1, CREATED_DATE)
FROM MANGAL
WHERE ID <= 9999

)
INSERT INTO #A(ID, CREATED_DATE)
SELECT ID, CREATED_DATE
FROM MANGAL

OPTION (MAXRECURSION 10000);

-- Check sample data
SELECT *
FROM #A

--------------------------------------------------------------------------------------------------------

This is not the most efficient way to generate some random data using a recursive CTE but I like it so I’m using it.

I. Using a TOP and GO:

This is the simplest method to run a query in a small batches.  I frequently use this method in development when I want to update some records quickly without thinking much.  Note the GO 10 in the query.  An integer after GO will execute the preceding query specified number of times.  In this case update statement will execute 10 times(I know there are 10000 rows in the table and I’m using TOP 1000, simple math).

------------------------------------------------------------------------------------------------------
UPDATE TOP (1000) #A
SET METHOD = 'I'
WHERE METHOD IS NULL
GO 10

-- Verify result
SELECT *
FROM #A

------------------------------------------------------------------------------------------------------

All the queries are quite simple  and they are self explanatory so I’m not going to explain them in detail.

II. Using ROWCOUNT and GO:

Same query but without the TOP operator.  Here I’m using a ROWCOUNT function to limit the number of rows in the batch.  ROWCOUNT causes SQL Server to stop processing the query after the specified number of rows are returned.

------------------------------------------------------------------------------------------------------
SET ROWCOUNT 1000
UPDATE #A
SET METHOD = 'II'
WHERE METHOD <> 'II'
GO 10

SET ROWCOUNT 0

-- Verify result
SELECT *
FROM #A

------------------------------------------------------------------------------------------------------ IMPORTANT:  Using SET ROWCOUNT will not affect DELETE, INSERT, and UPDATE statements in a future release of SQL Server. Avoid using SET ROWCOUNT with DELETE, INSERT, and UPDATE statements in new development work, and plan to modify applications that currently use it. For a similar behavior, use the TOP syntax.

III. Using TOP and GOTO:

Now instead of GO I’ll use GOTO to run the batch multiple times.  The GOTO statement causes the execution of the T-SQL batch to stop processing the following commands to GOTO and processing continues from the label where GOTO points.  Here I’ll use GOTO to keep processing the particular label until it’s @@ROWCOUNT becomes zero.

------------------------------------------------------------------------------------------------------ UPDATE_BATCH:

UPDATE TOP (3000) #A
SET METHOD = 'III'
WHERE METHOD <> 'III'
IF @@ROWCOUNT > 0  GOTO UPDATE_BATCH

-- Verify result
SELECT *
FROM #A

------------------------------------------------------------------------------------------------------

IV. Using ROWCOUNT and GOTO:

Same GOTO concept with ROWCOUNT.

------------------------------------------------------------------------------------------------------

SET ROWCOUNT 3500
UPDATE_BATCH:

UPDATE  #A
SET METHOD = 'IV'
WHERE METHOD <> 'IV'
IF @@ROWCOUNT > 0  GOTO UPDATE_BATCH

SET ROWCOUNT 0
-- verify result
SELECT *
FROM #A

------------------------------------------------------------------------------------------------------

V. Using WHILE and BREAK:

Here I’m taking the help of an infinite WHILE loop ( WHILE 1=1 ) to run the batch multiple times and using the BREAK to exit the loop when @@ROWCOUNT becomes zero.

------------------------------------------------------------------------------------------------------

WHILE (1=1)
BEGIN

    BEGIN TRANSACTION
       
        UPDATE TOP(3333) #A
        SET METHOD = 'V'
        WHERE METHOD <> 'V'
   
        IF @@ROWCOUNT = 0
            BEGIN
            COMMIT TRANSACTION
            BREAK
        END
    COMMIT TRANSACTION
END

-- Verify result
SELECT *
FROM #A

------------------------------------------------------------------------------------------------------

Same can be implemented using ROWCOUNT as I already shown.

VI. Using a Sequence column and WHILE:

Now an entirely different method.  Not using GO, GOTO or BREAK.  This method is more systematic where you have more control on how query is going to execute.  Note that when you use TOP or ROWCOUNT you actually have no control on which rows are going to get updated.  Sorting is completely dependent on the query plan created by the query engine.

------------------------------------------------------------------------------------------------------

DECLARE @C INT
DECLARE @I INT
DECLARE @L INT
DECLARE @MAX INT

SET @C = 1
SET @I = 2200
SET @L = @C + @I
SET @MAX = (SELECT MAX(ID) FROM #A)

SELECT @C AS C, @I AS I, @L AS L, @MAX AS MAX

WHILE @C < @MAX
BEGIN 

    UPDATE #A
    SET METHOD = 'VI'
    WHERE ID >= @C
    AND ID < @L

SET @C = @L -- OR SET @C = @C + @I
SET @L = @L + @I

END

-- Verify result
SELECT *
FROM #A

------------------------------------------------------------------------------------------------------

VII. Using a date range and WHILE:

On similar lines if we don’t have a reliable sequence column or we want to use a date column to update the records by month or year.

------------------------------------------------------------------------------------------------------

DECLARE @C DATE
DECLARE @I DATE
DECLARE @MAX DATE

SELECT    @C = MIN(CREATED_DATE),
        @MAX = MAX(CREATED_DATE)        
FROM #A

SET @I =  DATEADD(YEAR, DATEDIFF(YEAR, 0, @C)+1, 0)

--SELECT @C AS C, @I AS I , @MAX AS MAX

WHILE @C <= @MAX

BEGIN
    UPDATE #A
    SET METHOD = 'VII'
    WHERE CREATED_DATE >= @C
    AND CREATED_DATE < @I

SET @C = @I
SET @I = DATEADD(YEAR, 1, @I)

--SELECT @C, @I
END

-- Verify result
SELECT *
FROM #A

------------------------------------------------------------------------------------------------------

VIII.  Using WHILE without GO, GOTO or BREAK:

One last method from me without using GO, GOTO and BREAK but relatively simpler compare to last 2 methods.  Because different people will advise you not to use GO, GOTO and BREAK in SQL.

------------------------------------------------------------------------------------------------------

DECLARE @ROWCOUNT INT = 1
DECLARE @C INT = 2800

WHILE (@ROWCOUNT > 0)
BEGIN
    UPDATE TOP (@C) #A
    SET METHOD = 'VIII'
    WHERE METHOD <> 'VIII'

SET @ROWCOUNT = @@ROWCOUNT
END

-- Verify result
SELECT *
FROM #A

------------------------------------------------------------------------------------------------------

Same can be done using ROWCOUNT.

IX. Final

Well this is the query I found some 4-5 years back when I was searching for same thing on Google.  I don’t remember where I saw it or who wrote it.

------------------------------------------------------------------------------------------------------
DECLARE @C INT

SET ROWCOUNT 1000
WHILE (1=1)

BEGIN
   BEGIN TRANSACTION
  
   UPDATE #A
   SET METHOD = 'X'
   WHERE METHOD <> 'X'

   SET @C = @@ROWCOUNT

   IF @@Error <> 0
    BEGIN
    Print 'Problem Updating the records'
    ROLLBACK TRANSACTION
    BREAK
    END

   IF @C = 0
   BEGIN
      COMMIT TRANSACTION
      BREAK
   END
   COMMIT TRANSACTION
END
SET ROWCOUNT 1000
-- Verify result
SELECT *
FROM #A

------------------------------------------------------------------------------------------------------

Some notes:  Both TOP and ROWCOUNT can be parameterized.  We can pass a variable instead of direct number.  But of course since ROWCOUNT is soon going to be deprecated from future SQL Server versions one should avoid using it.  I was being lazy not to use transactions (commits and rollback ) wherever possible in above examples but one should use them to add more control, clarity to code and also for better understanding. 

I have not really tested that which query is better, I leave that up to you.  Idea was to share different methods to execute a query in small batches.  But I would go for method 6 or 7 on production environment.  Do let me know your comments, suggestions and what do think of all these methods.

Namaste
Mangal