SQL Server HTML to Plain Text Function
This week, I had a requirement to convert some HTML text in a SQL column to plaintext and store that plaintext in a different database. I found this solution for html to text, but it’s using a WHILE loop and declaring variables, and we all know that loops make SQL server work row-by-row rather than exploiting it’s data-set processing power, right? (To be fair, it’s a very old post, and it’s likely that my function wasn’t a possibility when it was written.)
So I used the same string functions from that post, made some other improvements (converting HTML line feeds with text ones, using nvarchar data types instead of varchar) and wrapped it up into a function. If you want to grab the function and go, here it is. We’ll follow up with an explainer and a test of our function in case there’s confusion on how to put it into use.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
-- IF EXISTS (SELECT 1 FROM sys.objects WHERE name = 'Html_To_Plaintext_ITVF' AND type = 'IF' AND schema_id = SCHEMA_ID('dbo')) DROP FUNCTION dbo.Html_To_Plaintext_ITVF GO CREATE FUNCTION [dbo].[Html_To_Plaintext_ITVF] (@HTMLText nvarchar(max), @linefeed nvarchar(10)) RETURNS TABLE AS /***************************************************************************************************************/ /* */ /* 2021-12-30 - Ben Schultz - Initial version of function. Replaces <br> and </p> with line feeds, then */ /* strips any other html tags. */ /* */ /***************************************************************************************************************/ RETURN WITH cteTagsToReplaceWithLF AS ( SELECT a.tag , ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS R FROM ( VALUES ('</p>'), ('</tr>'), ('<br>'), ('<br />'), ('</li>') ) a(tag) ) , cteWithLineFeeds(n, Html) AS ( SELECT 1, REPLACE(@HTMLText, cteTags.tag, @linefeed) FROM cteTagsToReplaceWithLF cteTags WHERE cteTags.R = 1 UNION ALL SELECT n + 1, REPLACE(ctelf.Html, cteTags.tag, @linefeed) FROM cteWithLineFeeds ctelf JOIN cteTagsToReplaceWithLF cteTags ON ctelf.n = cteTags.R ) , cteHtml (i, HtmlText) AS ( SELECT TOP 1 0, Html FROM cteWithLineFeeds ORDER BY n DESC UNION ALL SELECT i + 1, CONVERT(nvarchar(MAX), STUFF(HtmlText, CHARINDEX(N'<', HtmlText), CHARINDEX(N'>', HtmlText, CHARINDEX(N'<', HtmlText)) - CHARINDEX(N'<', HtmlText) + 1, '')) FROM cteHtml WHERE CHARINDEX('<', HtmlText) > 0 AND CHARINDEX('>', HtmlText, CHARINDEX('<', HtmlText)) > 0 AND CHARINDEX('>', HtmlText, CHARINDEX('<', HtmlText)) - CHARINDEX('<', HtmlText) > 0 ) SELECT TOP 1 LTRIM(RTRIM(HtmlText)) AS PlainText FROM cteHtml ORDER BY i DESC GO |
Now, let’s look at what’s going on. The first thing we see is a CTE, cteTagsToReplaceWithLF.
1 2 3 4 5 6 |
-- SELECT a.tag , ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS R FROM ( VALUES ('</p>'), ('</tr>'), ('<br>'), ('<br />'), ('</li>') ) a(tag) |
You can just run this on any database, and you’ll get a two-column result. The first column is a sequential number, and the second column is one of the tags that we spelled out there. These are the tags that we’re going to replace with linefeeds.
The 2nd CTE in the list does the work of replacing our html tags with linefeed characters.
You will notice that I made the linefeed replacement character a parameter. Depending on the application, you might want to send a LF character (use CHAR(13) or CHAR(0x0D)) or a CR + LF combination (CHAR(10) + CHAR(13), or CHAR(0x0A) + CHAR(0x0D)). In some cases the author of the HTML may have inserted line feeds at natural points to make the HTML more readable, and in those cases, you might just want to make the replacement linefeed an empty string and let the HTML linefeeds rule.
1 2 3 4 5 6 7 8 9 10 |
-- SELECT 1, REPLACE(@HTMLText, cteTags.tag, @linefeed) FROM cteTagsToReplaceWithLF cteTags WHERE cteTags.R = 1 UNION ALL SELECT n + 1, REPLACE(ctelf.Html, cteTags.tag, @linefeed) FROM cteWithLineFeeds ctelf JOIN cteTagsToReplaceWithLF cteTags ON ctelf.n = cteTags.R |
Here, we use the good old recursive CTE to handle our replacements. We need the CTE to run for each of the html Tags in cteTagsToReplaceWithLF. The top SELECT statement in here does it for the first row, and the bottom statement continues the REPLACE for each subsequent row. Here, I’m using the column name [n] as the row number (index) column, so the results get numbered as the replacements happen.
Trivia: I think this is the only example that I have where I used a join in the bottom query of a recursive CTE.
The final CTE works similarly, but instead of replacing tags with linefeeds, it removes “<", ">“, and all of the content between the two.
1 2 3 4 5 6 7 8 9 10 11 12 |
-- SELECT TOP 1 0, Html FROM cteWithLineFeeds ORDER BY n DESC UNION ALL SELECT i + 1, CONVERT(nvarchar(MAX), STUFF(HtmlText, CHARINDEX(N'<', HtmlText), CHARINDEX(N'>', HtmlText, CHARINDEX(N'<', HtmlText)) - CHARINDEX(N'<', HtmlText) + 1, '')) FROM cteHtml WHERE CHARINDEX('<', HtmlText) > 0 AND CHARINDEX('>', HtmlText, CHARINDEX('<', HtmlText)) > 0 AND CHARINDEX('>', HtmlText, CHARINDEX('<', HtmlText)) - CHARINDEX('<', HtmlText) > 0 |
Our top SELECT statement gets the one row that we want to work with from cteWithLineFeeds. The “TOP 1” makes sure we only get one row from that prior CTE, and the “ORDER BY n DESC” makes sure we get the last row. So we’re starting with that one HTML string with all of our linefeed replacements done.
The bottom SELECT uses the CHARINDEX and STUFF functions from the post that I originally linked to find the “<" and ">“, and do the work to get rid of them and their innards. It looks confusing, but it’s repetitive.
- CHARINDEX(‘<', HtmlText) gives us the position of the first encountered "<" character.
- CHARINDEX(‘>’, HtmlText, CHARINDEX(‘<', HtmlText)) gives us the position of the first ">” character that occurs after the first “<" character.
- In spots, you see these two values subtracted and then added to 1, which gives the length of the string that needs to be removed.
After that, the final SELECT does the same trick with TOP 1 and ORDER BY to make sure we get the row back where all of the removals have been completed.
Now, let’s test our function. First, some setup data.
1 2 3 4 5 |
CREATE TABLE MyHtml (ID int, HTML nvarchar(MAX)) INSERT MyHtml (ID, HTML) VALUES (1, '<html><body><p>Artists that belong in the Rock and Roll Hall of Fame:</p><ul><li>Big Mama Thornton</li><li>Duran Duran</li><li>Pixies</li><li>Jane''s Addiction</li><li>Outcast</li></ul></body></html>'), (2, '<html><body><p>Attention: it has come to our notice that you have submitted an invalid application. This is thorugh<br>no fault of your own, but you must reapply<br>using our most up to date form.</p></body></html>') |
Now suppose we want to select out the content of the HTML column, but converted to plain text. The SELECT statement is super easy:
1 2 3 4 |
SELECT mh.ID, pt.PlainText FROM MyHtml mh CROSS APPLY dbo.Html_To_Plaintext_ITVF (mh.HTML, '**') pt OPTION(MAXRECURSION 0) |
A couple of quick notes:
- If your have more than 100 replacements in your HTML, you’ll need the OPTION(MAXRECURSION _) bit. Zero gives your code the right to run limitlessly, and you shouldn’t switch the number to zero until you’re really sure of everything else. Better yet, you can change it to a number up to about 32,000 and it will run until it gets to that recursion level and then error out.
- I used “**” as my linefeed character here so that it would be easy to see in the results. In the real world, you probably want a line feed CHAR(13), a carriage reuturn/line feed combination CHAR(10) + CHAR(13), or an empty string.
The code for the function along with some other queries that live in my bag of tricks are available for download on https://github.com/bens4lsu/SQL-Patterns.