The Subtle Bug That Kept My Delete Job Running Until Midnight

Sometimes the most maddening bugs aren’t syntax errors or missing indexes — they’re the ones that look completely fine until you stare at a clock at 11:45 PM wondering why your job is still running.

The Problem

I have recently been working on tuning a job that purges data out of various tables that are older than 13 months using GETDATE(). Pretty straightforward, right? Well, originally the job consisted of 6 different statements like so:

1
DELETE FROM tablename WHERE column_with_date <= DATEADD(MONTH, -13, GETDATE());

Except the tables had become so large the generic deletes were filling up tempdb (SQL Server’s shared workspace for temporary operations like sorting, spooling, and row versioning). No Bueno. So I decided to rewrite the job using six separate stored procedures instead. While redundant, we could modify the behavior for each delete independently should we ever need to. More importantly, we could batch out the deletes to prevent tempdb from filling up.

So I went about creating a generic template to run a delete in batches. I think I based the code on a script on DBA Stack Exchange.

1
SET NOCOUNT ON;
2

3
DECLARE @rowcount INT;
4
SET @rowcount = 1;
5

6
WHILE @rowcount > 0
7
BEGIN
8
    BEGIN TRANSACTION;
9

10
    DELETE TOP (500000) FROM database.dbo.tablename
11
        WHERE column <= DATEADD(MONTH, -13, GETDATE())
12
    OPTION (MAXDOP 1);
13

14
    SET @rowcount = @@ROWCOUNT;
15

2 collapsed lines
16
    COMMIT TRANSACTION;
17
END

One of the stored procedures began using parallelism and causing high CPU usage, so I added the MAXDOP 1 query hint. MAXDOP 1 forces the query to use a single CPU thread, preventing it from spawning parallel workers and hammering the server. All of the tables have a timestamp column, so GETDATE() works great as a filter. Except not all of the timestamp columns were formatted the same way.

All of the tables except one used the following format:

1
2018-01-15

Whereas the one table used the following format:

1
2018-01-15 10:30:45.000

The difference is that one table stores date and time, not just date. How does that matter? When the job step executed against that table it got stuck in a loop. The code was written to grab 500,000 records based on the date and TIME 13 months ago. With every iteration, GETDATE() would return a slightly later time, the cutoff would shift forward by milliseconds, and the batch would never fully drain — the job would run forever (well, until midnight anyway, when the date rolled over and the loop finally resolved itself).

Disclaimer: it took me about a week to finally realize what was happening.

The Fix

The root cause is a non-sargable comparison. Wrapping a column in CONVERT(VARCHAR(10), column_timestamp, 110) prevents SQL Server from using an index on that column — the engine has to evaluate the function for every row before it can compare. The correct fix is to cast both sides to DATE, which is sargable and index-friendly:

1
DELETE TOP (500000) FROM database.dbo.tablename
2
    WHERE CAST(column_timestamp AS DATE) <= CAST(DATEADD(MONTH, -13, GETDATE()) AS DATE)
3
OPTION (MAXDOP 1);

Casting to DATE strips the time component from both the column value and the cutoff, so every record from the target day is captured in a single pass. The loop drains cleanly, the job finishes, and nobody gets a 1 AM alert.

Key Takeaways

Date vs. datetime matters for delete loops. If your cutoff includes a time component, each iteration shifts the target window and the batch may never catch up. Cast to DATE on both sides.
Non-sargable WHERE clauses hurt more than readability. Wrapping a column in a function like CONVERT() blocks index seeks. Use CAST(column AS DATE) — or better yet, cast both sides as shown above.
MAXDOP 1 is a useful safety valve for maintenance jobs. Batch deletes running in parallel can hammer CPU; pinning them to a single thread keeps the impact predictable.
Batch your large deletes. A single unbounded DELETE on a massive table will balloon tempdb and potentially block everything else. Looping in chunks of 500k (tuned to your environment) is the safer pattern.

I probably should have caught the datetime mismatch on day one. I didn’t. But I can guarantee I’ll check the column data type before writing the next delete loop — and now so can you.