SQL Server and Regular Expression may finally have settled their differences.
It is now possible to use Regular Expressions without leaving Management Studio.
I personally have wanted this functionality for a long time. Text-mining has always been a huge part of my job and regular expressions (in spite of the very true cartoon above) are a pretty useful tool. I’d sat at SQL Bits and been excited by the announcement that R and Python were coming to SQL Server via EXEC sp_execute_external_script. When the time was right I would use R to write Regular Expressions in SQL.
The time became right this week which meant I had the joy of letting someone else get there first. That someone was Jeffrey Yao and I used his excellent blog to get started. The blog really is good but it didn’t go as far as I wanted; namely it didn’t show how to extract multiple matches of, say, phone numbers.
Here is Mr Yao’s code with one adaptation that I have noted.
USE TestDB; -- assume this is our testing database IF OBJECT_ID('dbo.tblRegEx', 'U') IS NOT NULL DROP TABLE dbo.tblRegEx; CREATE TABLE dbo.tblRegEx( id INT IDENTITY, a VARCHAR(300), b VARCHAR(300)); GO INSERT INTO dbo.tblRegEx( a, b) VALUES('hello hello hello world', 'my phone number is 321-111-1111'), ('this this is is really fun', 'this number (604) 991-9111 is my cell phone; (704) 991-9111 is my other cell phone'), --only Difference to Jeffrey Yao blog is the extra phone number ('no duplicate here', 'no phone number here either, just my license# 111 111 2222'), ('multiple blank lines --this is 6th line', '222 333-4444 is my cell phone');
My goal was to get a list of phone numbers in a relational table that led back to tblRegex:
This was how I did it:
--usage example DECLARE @Json NVARCHAR(MAX); EXEC dbo.usp_Table_RegexExtract @PKField = 'id', @InputField = 'b', --name of field with scrapable data @Table = 'dbo.tblRegex', @RegexPattern = '(\\(?\\d{3}\\)?[-\\s]\\d{3}-\\d{4}\\b)', @json = @Json OUTPUT; SELECT * FROM OPENJSON(@Json) WITH ( MyID INT 'strict $.LocalID', Input NVARCHAR(MAX) '$.InputFieldValue', OrderInd INT '$.OrderInd', ExtractedValue NVARCHAR(MAX) '$.FinalExtraction' );
These three lines of code contain: R script, box-fresh SQL Server features like JSON and Regular Expressions and this, in the SQL Server world is pretty darn sexy. (The converse of this statement is also true.)
Let’s take a look under the hood.
CREATE PROCEDURE dbo.usp_Table_RegexExtract @PKField NVARCHAR(128), --name of primary key field @InputField NVARCHAR(128), --name of field containing data to be scraped @Table NVARCHAR(128), @RegexPattern NVARCHAR(4000) = '(\\(?\\d{3}\\)?[-\\s]\\d{3}-\\d{4}\\b)', @json NVARCHAR(MAX) = NULL OUTPUT AS /*extracts values from a field based on the regex pattern provided and returns the [nested] output as JSON */ SET NOCOUNT ON; BEGIN DECLARE @SQLStatement NVARCHAR(4000) = N'SELECT #PKField as id, #InputField AS b from #Table'; SELECT @SQLStatement = REPLACE(@SQLStatement, '#PKField', @PKField), @SQLStatement = REPLACE(@SQLStatement, '#InputField', @InputField), @SQLStatement = REPLACE(@SQLStatement, '#Table', @Table); DECLARE @RStatement NVARCHAR(4000)
The beginning of the stored procedure simply takes in a table name, a field name and a primary key field; it then puts them together to form a simple select statement; for example: SELECT id, b FROM dbo.tblRegEx; it also takes in a regular expression pattern, in this case the one for a US telephone number.
I used this SQL Statement to get the column (vector if you will) of data to be interrogated. The next part is where the magic happens:
DECLARE @RStatement NVARCHAR(4000) = N'pattern = "' + @RegexPattern + '"; m <- gregexpr(pattern, inData$b, ignore.case = T, perl = T); #NB ignore case n <- rep(NA,length(inData$b)); n <- sapply(regmatches(inData$b, m), paste, collapse="|"); #NB sapply here I found @dizzy_pete wrote a good blog http://petewerner.blogspot.co.uk/2012/12/using-apply-sapply-lapply-in-r.html inData$c <- n; outData <- inData; outData;';
The variable @RStatement is being prepared specifically to take advantage of sp_execute_external_script and whilst the differences between Mr Yao and my scripts are small they are important (and actually took me some time to get to). The first is that, in order to get multiple matches we need to use gregexpr (with a g at the front). This leads to the next conundrum, how to deal with nested vectors. After a few false starts, I ended up using sapply and using a pipe to separate multiple entries (for more information please see the blog linked to in the code).
This was enough R code to go straight to execution but first I needed to build a table variable to hold the output:
DECLARE @RegExOut TABLE (LocalID INT PRIMARY KEY NOT NULL, InputFieldValue NVARCHAR(MAX), InitalExtraction NVARCHAR(MAX) ); INSERT INTO @RegExOut( LocalID, InputFieldValue, InitalExtraction ) EXEC sp_execute_external_script @language = N'R', @script = @RStatement, @input_data_1 = @SQLStatement, @input_data_1_name = N'inData', @output_data_1_name = N'outData';
This meant that SQL had an object in which to store the regular expression matches. The next challenge was to return these to the user. That is to use our output variable.
I did this by invoking another long awaited SQL function STRING_SPLIT(), culminating the code with FOR JSON PATH and assigning this to our output variable.
SELECT @json = ( SELECT rxo.LocalID, rxo.InputFieldValue, rxo.InitalExtraction, ROW_NUMBER() OVER(PARTITION BY rxo.LocalID ORDER BY rxo.LocalID) AS OrderInd, extraction.Value AS FinalExtraction FROM @RegExOut rxo CROSS APPLY STRING_SPLIT(rxo.InitalExtraction, '|') extraction WHERE extraction.Value IS NOT NULL FOR JSON PATH );
And that was that. Now I just needed to use it, which I did like so:
--usage example DECLARE @Json NVARCHAR(MAX); EXEC dbo.usp_Table_RegexExtract @PKField = 'id', @InputField = 'b', --name of field containing data to be scraped @Table = 'dbo.tblRegex', @RegexPattern = '(\\(?\\d{3}\\)?[-\\s]\\d{3}-\\d{4}\\b)', @json = @Json OUTPUT; SELECT *FROM OPENJSON(@Json) WITH ( MyID INT 'strict $.LocalID', Input NVARCHAR(MAX) '$.InputFieldValue', OrderInd INT '$.OrderInd', ExtractedValue NVARCHAR(MAX) '$.FinalExtraction' );
SQL, JSON, R and Regex all in three lines. Hurrah!