Clean out all bad characters from a string.

10

March 25, 2015 by Kenneth Fisher

I have a customer who is having a problem with a load. He is getting a fair number of bad characters including some unicode characters. The unicode characters in particular are making it so those values won’t go into a varchar column. This isn’t an uncommon problem so when he came to me I said (without thinking) “Well just clear out the bad characters.” Unfortunately after having a few minutes to think about it I realized that’s easier said than done. I did some research and the best I could find were here and here. In one case each string is parsed one character at a time and in the other each string is cleared by using a while loop that clears out any bad character one at a time until none are left. Neither really satisfied me. Both are RBAR (row by agonizing row) and neither is particularly fast. So I wanted a set based method.

To start with I created some test data.

-- Create a table with a bunch of rows and one column 
-- with string data.
SELECT CAST(t.name + ' ' + c.name AS nvarchar(max)) AS StringToFix
	INTO BadStringList
FROM sys.tables t
CROSS JOIN sys.all_columns c

ALTER TABLE BadStringList ADD Id INT NOT NULL IDENTITY (1,1) 
     CONSTRAINT pk_BadStringList PRIMARY KEY


-- Put in one random (probably bad) character into about 
-- 2 percent of the rows.  Then do it 75 times
; WITH MyCTE AS (
SELECT TOP (2) percent StringToFix, STUFF(StringToFix, 
		CAST(rand((len(StringToFix) * datepart(ms,getdate()))^2) * len(StringToFix) AS Int) + 1, 1,
		NCHAR(CAST(rand((len(StringToFix) * datepart(ms,getdate()))^2) * 65025 AS Int))) AS Stuffed
FROM BadStringList
ORDER BY NEWID()
)
UPDATE MyCTE
SET StringToFix = Stuffed
GO 75

Here is my clean up code. Note it doesn’t actually clean the bad data out of the source, it produces a result with clean data.

DECLARE @Pattern varchar(50) = '%[^a-zA-Z0-9_''{}"() *&%$#@!?/\;:,.<>]%';

WITH FixBadChars AS (SELECT StringToFix, StringToFix AS FixedString, 1 AS MyCounter, Id
				FROM BadStringList
				UNION ALL
				SELECT StringToFix, Stuff(FixedString, PatIndex(@Pattern, 
					FixedString COLLATE Latin1_General_BIN2), 1, '') AS FixedString, 
					MyCounter + 1, Id
				FROM FixBadChars
				WHERE FixedString COLLATE Latin1_General_BIN2 LIKE @Pattern)
SELECT StringToFix, FixedString, MyCounter, Id
FROM FixBadChars
WHERE MyCounter = 
		(SELECT MAX(MyCounter) 
		FROM FixBadChars Fixed
		WHERE Fixed.Id = FixBadChars.Id)
OPTION (MAXRECURSION 1000);

This returned ~170k cleaned rows in under 30 seconds.

So here is what it does. Let’s start with the pattern I’m using.

DECLARE @Pattern varchar(50) = '%[^a-zA-Z0-9_''{}"() *&%$#@!?/\;:,.<>]%';

A basic start is this ‘[a-z]’ which will match a single character a-z. ^ in pattern matching is a NOT, so [^a-z] matches anything that is NOT a-z. Then we add in all the other characters [^a-zA-Z0-9_”{}”() *&%$#@!?/\;:,.] which will match any character that is not an upper or lower a-z, a didget 0-9, or one of the symboles listed. This is our approved list. Fair warning I was not able to figure out how to get it to work with []s in the list. Maybe someone who is better at pattern matching than me can figure it out. Last but not least add %’s on either end and you have pattern that will find a string that has any character that is not in the approved list.

Next I’m using a recursive CTE to loop through the string. Notice that the anchor is a test against the string to be fixed to see if there are any more bad characters. I also have a counter so that in the main query I can pull the entry for each string with the largest counter. That being the fixed row. I’m using a subquery to get the fixed row, joining on my primary key Id. In my case the query also works fine if you use StringToFix (assuming that it is unique). Basically you just need a unique per row value to join on. It helps if it’s indexed (30 seconds vs 45 for my run).

WHERE MyCounter = 
		(SELECT MAX(MyCounter) 
		FROM FixBadChars Fixed
		WHERE Fixed.StringToFix = FixBadChars.StringToFix)

Two more minior settings.

COLLATE Latin1_General_BIN2

The COLLATE is necessary because otherwise some unicode characters get missed by the PATINDEX command. I’m using SQL_Latin1_General_CP1_CS_AS as my default collation but any collation should work.

OPTION (MAXRECURSION 1000)

The recursive CTE will recurse once for each bad character in a string. So the deepest the query will recurse is the maximum number of bad characters in a single string. Realistically the default MAXRECURSION of 100 is probably sufficient in this case but I believe in better safe than sorry.

And that’s it. A non-RBAR way to clean a string. Jeff Moden would be so proud.

10 thoughts on “Clean out all bad characters from a string.

  1. davidbainbridge83 says:

    Are the characters bad or were they just coded that way?

  2. Oooh, I want to try your way. I wrote a simple one last week in two seconds but yeah… it was RBAR. Rowset was small, so no biggee, but I’d like to put it head-to-head with this one!

  3. Herbert Tobisch says:

    Jeff, I did not test this, but:
    your solution needs one function call for each “bad” character to remove; so why not just perform a recursive function call without the recursive CTE stuff ?

  4. Brian Miller says:

    Kenneth
    In the quest to find the most efficient (and quickest) ways to clean strings, like you I look to see if there is a set based solution. However I have often found that for most of the data sets I come across, a simple row by row solution can be very quick. I copied the code from your article and generated sample data (limiting it to 170,000 rows) and seeded it with bad characters again using your code. I then ran your solution on my server and achieved comparable times to yourself ranging from 35 seconds to 60 seconds. (CPU times from 29125 ms to 32641 ms). I then compared it with a Row by Row Solution using a CLR function (udf_CLR_RemoveChars) I had written as part of suite of string processing SQL Server functions. Using the same dataset I achieved a processing time of typically 3 to 4 seconds and using only 2700 to 3000 ms CPU Time: a 11 times speed increase. From my experience CLR functions are very efficient in processing strings in SQL server and should be considered where allowed. FYI: please find the code to my function below written in VB.net

    Imports System
    Imports System.Data
    Imports System.Data.SqlClient
    Imports System.Data.SqlTypes
    Imports Microsoft.SqlServer.Server

    Partial Public Class UserDefinedFunctions
    _
    Public Shared Function udf_CLR_RemoveChars(ByVal strIn As String, ByVal strRemove As String, ByVal strReplace As String) As String

    ‘ For Removing Specified chars contained in ‘strRemove’ from inputted string: ‘strIn’

    ‘ ‘strReplace’ can be specified to replace any single or consecutive removed chars with a single instance of ‘strReplace’ (use empty Sting to ignore)

    If strIn Is Nothing Then Return Nothing
    If strRemove Is Nothing Then Return strIn

    ‘ make sure strRemove has Square brakets
    strRemove = “[” & strRemove & “]”
    Dim booRemovedLastChar As Boolean = False

    Dim objStrBuilder As New System.Text.StringBuilder()

    For i As Integer = 0 To strIn.Length – 1
    If strIn.Substring(i, 1) Like strRemove Then
    booRemovedLastChar = True
    Else
    If booRemovedLastChar Then objStrBuilder.Append(strReplace)
    objStrBuilder.Append(strIn.Substring(i, 1))
    booRemovedLastChar = False
    End If
    Next

    Return objStrBuilder.ToString()
    End Function
    End Class

    • Excellent solution! And to be fair this type of situation is perfect for CLR. There is only one very important reason I didn’t use CLR in my solution.

      I don’t know it 🙂

      I mean I know the basics from the SQL Server side. Given your code I can set it up to run in SQL Server. I just couldn’t write the VB.net part.

      Thanks!

  5. Jeff Moden says:

    Gosh… Careful Kenneth. The Recursive CTE is actually a form of RBAR that should be avoided here.

    • Yea, I remembered that recursive CTE is REBAR under the covers (after I’d posted) but this is the best I could come up with. It certainly performs better than any T-SQL solution I’ve seen so far 🙂

  6. David Sumlin says:

    Ken, no need to know CLR…just download SQL# from http://www.sqlsharp.com and use the prebuilt (and free) regex functionality. Easy peasy, and fast.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 6,758 other subscribers

Follow me on Twitter

Archives

ToadWorld Pro of the Month November 2013