SQL Tricks – Remove Interior Spaces
This week, I was given a set of data to import to a database. After the import, I noticed that the source data had lots of unnecessary spaces all over the place. I can, of course get rid of leading and trailing spaces by applying RTRIM() and LTRIM() to my import, but I also had space padding inside some of the names themselves. For example, there might be four spaces between the first and last names. How it got that way is kind of complicated, and not particularly relevant, but I knew that having such ugly data in my database would bug me.
I figured this was well-trodden turf, but I didn’t like the first couple of google results I got. They either didn’t keep stripping spaces after the first few, or they were unnecessarily complicated. I was pretty sure that I could implement a recursive CTE to fix my names.
Let’s set up a test. (Hat tip to this random name generator.)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
-- CREATE TABLE Names ( NameID INT IDENTITY(1,1) NOT NULL, NameText NVARCHAR(100) ) INSERT Names (NameText) VALUES ('Twirlleaf Shadowscare') , ('Barleytoter Winslow') , ('Oakbreeze Azazel') , ('Demonreek Hazelcurse') , ('Moonglow Catweed') , ('Gumpfarmer Ó Meadhra Jr.') , ('Hicktrigger Cookson') , ('Cornhauler O''Hannagain') , ('Elegia Draco') , ('Spiderbug Newtbite') |
My solution is this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
-- ;WITH cteRemoveSpaces AS ( SELECT NameID , REPLACE(NameText, ' ', ' ') AS NameText , PATINDEX('% %', REPLACE(NameText, ' ', ' ')) AS idx FROM Names UNION ALL SELECT NameID , REPLACE(NameText, ' ', ' '), PATINDEX('% %', REPLACE(NameText, ' ', ' ')) FROM cteRemoveSpaces WHERE PATINDEX ('% %', NameText) > 0 ) ... do work with data .... |
Now, what is this doing?
Quick refresher: the recursive common table expression has three parts: a first select statement, known as the “anchor”, a UNION ALL statement, and then a 2nd expression that uses the CTE itself. As with most things SQL, there’s a good tutorial on SQLServerCentral.com.
I’m relying on two functions:
REPLACE(NameText, ‘ ‘, ‘ ‘) : It might be hard to tell, but the first quoted bit has two spaces, and the second has one. This replaces two spaces back to back with one space.
PATINDEX(‘% %’, ____) : This one has two spaces between the percent sign. It returns the first position at which the underline has two spaces, or zero if there is no instance of two spaces. I’m using it to show a column named [idx], which is going to be zero for the records that I really want (since I want the records where there are no occurrences of two spaces back-to-back).
Let’s look at what comes out of the CTE for just a single record:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
-- ;WITH cteRemoveSpaces AS ( SELECT NameID , REPLACE(NameText, ' ', ' ') AS NameText , PATINDEX('% %', REPLACE(NameText, ' ', ' ')) AS idx FROM Names UNION ALL SELECT NameID , REPLACE(NameText, ' ', ' '), PATINDEX('% %', REPLACE(NameText, ' ', ' ')) FROM cteRemoveSpaces WHERE PATINDEX ('% %', NameText) > 0 ) SELECT NameID, NameText, idx FROM cteRemoveSpaces WHERE NameID = 1 |
We get
It took the CTE 4 iterations, but we finally get a row with just a single space. Notice that this row has idx = 0.
To get my complete record set, I just make idx = 0 a condition. I’ll put it in there where clause here. In my real project, it was part of a more complex INSERT statement, and I put it in the JOIN. Either way.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
-- ;WITH cteRemoveSpaces AS ( SELECT NameID , REPLACE(NameText, ' ', ' ') AS NameText , PATINDEX('% %', REPLACE(NameText, ' ', ' ')) AS idx FROM Names UNION ALL SELECT NameID , REPLACE(NameText, ' ', ' '), PATINDEX('% %', REPLACE(NameText, ' ', ' ')) FROM cteRemoveSpaces WHERE PATINDEX ('% %', NameText) > 0 ) SELECT NameID, NameText, idx FROM cteRemoveSpaces WHERE idx = 0 |
And there we have it!
If you’re wondering about performance, I used this method in my project over 60,000 rows with anywhere up to 10 spaces per row, and it ran in a few seconds, which includes both the select of the data and the insert into the target table. Perfectly fine for my needs.
I’ve just started to upload some of my little tips and tricks to a github repository. You can check them out at https://github.com/bens4lsu/SQL-Patterns. At the moment that I click POST on this entry, this is the only one there, but I’ve got a big folder from which to go through and make additions. I’ll do that, as I make sure they aren’t completely outdated.