SQL Server HTML to Plain Text Function

This week, I had a requirement to convert some HTML text in a SQL column to plaintext and store that plaintext in a different database. I found this solution for html to text, but it’s using a WHILE loop and declaring variables, and we all know that loops make SQL server work row-by-row rather than exploiting it’s data-set processing power, right? (To be fair, it’s a very old post, and it’s likely that my function wasn’t a possibility when it was written.)

So I used the same string functions from that post, made some other improvements (converting HTML line feeds with text ones, using nvarchar data types instead of varchar) and wrapped it up into a function. If you want to grab the function and go, here it is. We’ll follow up with an explainer and a test of our function in case there’s confusion on how to put it into use.

Now, let’s look at what’s going on. The first thing we see is a CTE, cteTagsToReplaceWithLF.

You can just run this on any database, and you’ll get a two-column result. The first column is a sequential number, and the second column is one of the tags that we spelled out there. These are the tags that we’re going to replace with linefeeds.

one row per tag, with an additional column for row number

The 2nd CTE in the list does the work of replacing our html tags with linefeed characters.

You will notice that I made the linefeed replacement character a parameter. Depending on the application, you might want to send a LF character (use CHAR(13) or CHAR(0x0D)) or a CR + LF combination (CHAR(10) + CHAR(13), or CHAR(0x0A) + CHAR(0x0D)). In some cases the author of the HTML may have inserted line feeds at natural points to make the HTML more readable, and in those cases, you might just want to make the replacement linefeed an empty string and let the HTML linefeeds rule.

Here, we use the good old recursive CTE to handle our replacements. We need the CTE to run for each of the html Tags in cteTagsToReplaceWithLF. The top SELECT statement in here does it for the first row, and the bottom statement continues the REPLACE for each subsequent row. Here, I’m using the column name [n] as the row number (index) column, so the results get numbered as the replacements happen.

Trivia: I think this is the only example that I have where I used a join in the bottom query of a recursive CTE.

The final CTE works similarly, but instead of replacing tags with linefeeds, it removes “<", ">“, and all of the content between the two.

Our top SELECT statement gets the one row that we want to work with from cteWithLineFeeds. The “TOP 1” makes sure we only get one row from that prior CTE, and the “ORDER BY n DESC” makes sure we get the last row. So we’re starting with that one HTML string with all of our linefeed replacements done.

The bottom SELECT uses the CHARINDEX and STUFF functions from the post that I originally linked to find the “<" and ">“, and do the work to get rid of them and their innards. It looks confusing, but it’s repetitive.

  • CHARINDEX(‘<', HtmlText) gives us the position of the first encountered "<" character.
  • CHARINDEX(‘>’, HtmlText, CHARINDEX(‘<', HtmlText)) gives us the position of the first ">” character that occurs after the first “<" character.
  • In spots, you see these two values subtracted and then added to 1, which gives the length of the string that needs to be removed.

After that, the final SELECT does the same trick with TOP 1 and ORDER BY to make sure we get the row back where all of the removals have been completed.

Now, let’s test our function. First, some setup data.

Now suppose we want to select out the content of the HTML column, but converted to plain text. The SELECT statement is super easy:

A couple of quick notes:

  • If your have more than 100 replacements in your HTML, you’ll need the OPTION(MAXRECURSION _) bit. Zero gives your code the right to run limitlessly, and you shouldn’t switch the number to zero until you’re really sure of everything else. Better yet, you can change it to a number up to about 32,000 and it will run until it gets to that recursion level and then error out.
  • I used “**” as my linefeed character here so that it would be easy to see in the results. In the real world, you probably want a line feed CHAR(13), a carriage reuturn/line feed combination CHAR(10) + CHAR(13), or an empty string.

The code for the function along with some other queries that live in my bag of tricks are available for download on https://github.com/bens4lsu/SQL-Patterns.

Posted in MS SQL, Transact SQL | Leave a comment

Comments are closed.