Part 2: Predictive Modeling using R and SQL Server Machine Learning Services

Recap!

Part 1 of Predictive Modeling using R and SQL Server Machine Learning Services covered an overview of Predictive Modeling and the steps involved in building a Predictive Model. Using our sample dataset – Ski Resort rental data – we wanted to predict RentalCount for the year 2015, given the variables – Month, Day, Weekday, Holiday and Snow.

Part 1 covered:
– Getting data from a SQL Server database
– Preparing data for modeling
– Training models
– Comparing results and finalizing a model

We found that for this problem, predictions using the Decision Tree model were more accurate than the Linear Regression model. SQL Server Machine Learning Services (MLS) lets us train and test Predictive Models using R or Python, in the context of SQL Server. Thus, we can build T-SQL programs that contain embedded R/Python scripts that train on data stored in the database.

Deploy Machine Learning code with SQL Server

In this part, we will deploy the R code we wrote in Part 1 to SQL Server. To deploy, we will store the trained model in database and create a stored procedure that predicts using the model. This stored procedure can be invoked from applications.

1. Create Table for storing the model: Here, we create a table in SQL Server to store the trained model. The model will be used for prediction in step 3.

2001_CreateTableRentalRxModels.jpg

2. Create Stored Procedure for generating the model: This stored procedure will use the R scripts we wrote in Part 1 utilizing sp_execute_external_script introduced in SQL Server 2016. To execute sp_execute_external_script, first enable external scripts by using the statement – sp_configure ‘external scripts enabled’, 1;

The function to generate Decision Tree model – rxDTree – is part of the RevoScaleR package for R. RevoScaleR package includes numerous other R functions for importing, transforming, and analyzing data at scale. Point to note is that the functions run on the RevoScaleR interpreter, built on open-source R. It is engineered to leverage the multithreaded and multinode architecture of the host platform, meaning when R code executes within a SQL Server SP, it utilizes parallel processing.

2002_SPToGenerateTrainedDTreeModel

2003_generatetraineddtreemodel.jpg

3. Create Stored Procedure for prediction: Now that we have the model output, we can create an SP that would use the model to predict rental count for new data. Again, we are using the R code covered in Part 1, only that this time we are using it in a SQL Stored Procedure.

2004_SPToPredictRentalCountUsingDTreeModel.jpg

2005_PredictRentalCountForTestDataUsingDTreeModel.jpg

 

2006_PredictRentalCountUsingDTreeModel.jpg

Isn’t that just awesome? We have a Predictive Model that can be used within applications to predict rental count. Now that we have covered a sample project, in Part 3  of the series, I will share my experience using SQL Server Machine Learning Services to solve a problem at my work.

Before we conclude Part 2,

Predict using Native Scoring (SQL Server 2017*): In SQL Server 2017, Microsoft has introduced a native predict function. What this means is we do not need to run R/Python code in a SQL stored procedure to do the actual prediction. Native scoring uses native C++ libraries that reads a trained model stored in binary format (in our case in a SQL Server table), and generate scores for new input data.

2007_CreateTableNativeModelSupport.jpg

2008_GenerateNativeModel.jpg

2010_PredictionUsingNativePredictFunction.jpg

Resources:

Scripts for Part 2: https://drive.google.com/file/d/15fwujRipLg-k2ozOFb9G9PFfBO327zTa/view?usp=sharing
RevoScaleR
https://docs.microsoft.com/en-us/machine-learning-server/r-reference/revoscaler/revoscaler
SQL Server ML Tutorial: https://microsoft.github.io/sql-ml-tutorials/R/rentalprediction/step/3.html
Native Scoring: https://docs.microsoft.com/en-us/sql/advanced-analytics/sql-native-scoring?view=sql-server-2017
RxSerializeModel: https://docs.microsoft.com/en-us/machine-learning-server/r-reference/revoscaler/rxserializemodel
Forecasts and Prediction using SQL Server MLS: https://docs.microsoft.com/en-us/sql/advanced-analytics/r/how-to-do-realtime-scoring?view=sql-server-2017

Advertisements

Part 1: Predictive Modeling using R and SQL Server Machine Learning Services

To R or not to R?

A few months ago, I asked myself an important question – which language to learn first – R or Python? From my research, I found that R is regarded as more old-school and difficult to learn, but Python is more popular. It made perfect sense to learn R first 🙂

For the uninitiated, R is a programming language that makes statistical and mathematical computation easy, and is useful for machine learning/predictive analytics/statistics work.

Along the way, I found the following courses useful.
https://www.edx.org/course/introduction-to-r-for-data-science
https://www.edx.org/course/programming-in-r-for-data-science

The goal has always been to explore Machine Learning Services in SQL Server, and dive deeper thereafter. This 3-part series is a walk through/review of Microsoft’s tutorial on Predictive Modeling using R and SQL Server. The first part deals with preparing data, training a model and using it for prediction.

What is Predictive Modeling?

Predictive Modeling uses statistics to predict outcomes.

In our example, we will use Machine Learning Services for SQL Server 2017 to predict number of rentals for a future date in a ski rental business. This brings up an important question – why do we want to predict? In this case, prediction will help the business be prepared from a stock, staff and facilities perspective. Prediction has numerous, life-changing applications. Read about application of AI in predicting cardiovascular disease by Microsoft for Apollo Hospitals, India.

For the ski rental prediction, we will use test data provided by MS, SQL Server 2017 with Machine Learning Services, and R Studio IDE. Please check the following URL and follow the simple instructions to set up your environment. If you have any question, please use Comments section to drop me a note.
https://microsoft.github.io/sql-ml-tutorials/R/rentalprediction

PredictiveModeling-Steps1

Steps involved in building the predictive model in SQL Server

  1. Getting data
  2. Preparing data
  3. Training models
  4. Comparing results and choosing a model
  5. Deploying the Machine Learning script to SQL Server

1. Getting Data: Okay, let’s get started. In the first step, we will restore sample database – TutorialDB – using the database backup file provided by MS. After restoring the database using SSMS, we will take a look at rental data we will use for training the model.

All scripts used will be provided in the Resources section at bottom of the page.

002_RestoreSampleDatabase-TutorialDB

003_ExamineTrainingData

You will see that we have 453 rows of rental stats. Data is in place, so we are good to move on to Step 2.

2. Preparing Data: Now that database is restored and data available in SQL Server, we will load data to R and transform it. Open R Studio and execute the rental data load script. After loading data, we will examine a few rows/observations and inspect data types. We will then proceed to change types of a few columns to factor.

004_loadrentaldatafromsqlserver.jpg

005_DataPreparation

3. Training Models: Now that data is prepared, we will chose a model that best describes dependency between variables in our dataset. During training, we provide the variables along with the outcome so that our model can train to predict the outcome. Here, we will compare predictions by two different models and choose the more accurate one as our predictive model.
For clarity, let me state that we are trying to predict RentalCount for the year 2015, given the variables – Month, Day, Weekday, Holiday and Snow.

The challenge in Machine Learning is in knowing what various models mean, and when a particular model might be more suitable. MS recommends this cheat sheet as a guide.

006_SplitDatasetToTrainingAndTest

007_TrainingUsingLinearRegressionAndDecisionTreeModels

008_RentalDataPrediction

Comparing results: Here, we compare results to figure out which model predicted more accurately. Decision Tree performed better in this case and we will use the model to deploy our Machine Learning Solution to SQL Server in Part 2

009_RentalDataPlotDifferencePredictedAndActual

Resources:

Scripts for Part 1 (zip file): https://drive.google.com/file/d/1Tzs4qzFXXL-NgxFlKQcuVHKwx90Imr8u/view?usp=sharing
SQL Server R tutorials: https://docs.microsoft.com/en-us/sql/advanced-analytics/tutorials/sql-server-r-tutorials?view=sql-server-2017
Gitghub repo for rental prediction: https://microsoft.github.io/sql-ml-tutorials/R/rentalprediction/
Machine Learning cheat sheet, use with caution 🙂 https://docs.microsoft.com/en-us/azure/machine-learning/studio/algorithm-choice#the-machine-learning-algorithm-cheat-sheet

Export data from Cosmos DB to SQL Server

Cosmos DB is Microsoft’s latest NoSQL database offering low latency, high scalability and geo-distribution to any Azure region. Read Microsoft documentation about Cosmos DB here or check out my blog about Introduction to Azure Cosmos DB. In its infancy, Cosmos DB was known as Document DB. Renaming was inevitable as Document DB evolved beyond just a document store.

Following is a simple note on how to export data from Cosmos DB to SQL Server. You may want a dump of data from a Cosmos DB collection to SQL Server for analysis, data integrity checks, troubleshooting a production issue or to derive insights.

There are a few methods to export data from Cosmos DB. The quickest one is to use Document DB / Cosmos DB Migration Tool. This is a tool provided by Microsoft  to migrate data TO/FROM various sources such as MongoDB, JSON, csv and SQL Server to Cosmos DB.

1. Use Azure Cosmos DB Migration tool to export data to json files:

  • Install the tool and launch as Administrator (use an Azure VM for best performance). Please be mindful of spike in RU costs when exporting data from your collection. To avoid throttling, scale your collection up as required just before you do the export or export in off-peak hours, if any.  Leave a comment if you need any pointer on this.LaunchCosmosDBMigrationTool
  • Choose source – DocumentDB (CosmosDB aka DocumentDB)

CosmosDBMigrationTool-Source

  • Specify connection string. You can find endpoint and key from Keys section of your CosmosDB account in Azure portal. Please note that CosmosDB is case-sensitive.

CosmosDBMigrationTool-ConnectionString

  • A peek into Advanced options
    • Include internal fields – for each document, Cosmos DB maintains a set of auto-generated internal/system fields such as _ts and _self. Their names start with an underscore, making it easy to differentiate them from user fields. This option lets you include internal fields in the output. This is handy, especially the _ts field, which indicates when the document was last updated.
    • Number of retries on failure & Retry interval – Set a reasonable number of retries. In this case, I have used the value 1.
    • Connection Mode you want to use Gateway to get best performance and to bypass firewall rules.

CosmosDBMigrationTool-AdvancedOptions

  • Specify target information. In this case, we want to export to a json document. We could either output to a local file or a blob.

CosmosDBMigrationTool-TargetInformation

  • Error Logging. Set these options to enable error logging.

CosmosDBMigrationTool-ErrorLogging

Hit Import and if there are no errors, you will soon have a new json file with all the data from your collection. If you want only a subset of data, all you need to do is modify your source query.

2. Import json files to SQL Server:

SQL Server 2016 introduced a JSON parse function called OPENJSON. If none of your user databases are upgraded to 2016 yet, but you have a 2016 engine, context-switch to a system database to use OPENJSON.

DECLARE @productNutrition varchar(max);

--Read from the json file using openrowset
SELECT @productNutrition = BulkColumn
FROM OPENROWSET(BULK'C:\Users\Smruthi\Downloads\Arjun\productnutrition_20180427.json', SINGLE_BLOB) JSON;

--Pass the variable containing json as parameter to OPENJSON function
SELECT *
FROM OPENJSON (@productNutrition)
WITH
(
ProductID varchar(20) '$.id',
ProductDescription varchar(100) '$.description',
ProductGroup varchar(200) '$.foodGroup',
ServingAmount float '$.servings[1].amount',
ServingUnit varchar(10) '$.servings[1].description',
nutrients nvarchar(max) as json --note that json is case sensitive
)

Give it a go yourself. If you have any question, leave a comment and I will be happy to assist.

If you are exploring Cosmos DB, consider reading my blog about Azure Cosmos DB – Partition and Throughput to get an overview of partitioning and scaling concepts.

jsonfileSample

openjson Example

Live Query Statistics – SQL Server 2016

SQL Server Management Studio (SSMS) 2016 has a handy new feature to View Live Query Statistics. This feature also works when connected to a 2014 instance using SSMS 2016.

Why use it?

Previously, similar stats were available by doing so:

SET STATISTICS "Detail" ON;

With the new feature, the main difference is that as the query progresses, stats are refreshed, thus providing a live coverage of what SQL Server is up to. This is a fantastic feature to learn about query processing and execution plans, and it helps to identify performance problems. Only yesterday, I used it to identify a “bad plan” where SQL was doing an Eager Spool by cross-joining two million record tables, and thus filling tempdb.

How to use it?

To use the feature, simply click on “Include Live Query Statistics” button in the toolbar or under menu Query -> Include Live Query Statistics.
IncludeLiveQueryStatistics-SQLServer2016

Watch the magic unfold when you execute query.

LiveQueryStatistics-Example.png

 

If you are not using SQL Sentry Plan Explorer already, please install PRO version for free from Sentry One. I find it amazing and it’s probably the best free tool out there to do performance analysis. Install the SSMS add-in, and you can right click on plan and view plan in SQL Sentry Plan Explorer. It becomes incredibly useful for complex plans that are cumbersome to navigate in SQL Server. Not to mention the host of other info that’s presented nicely to make performance tuning enjoyable.

SQLSentryPlanExplorer-AddIn

SQLSentryPlanExplorer.jpg

Watch out for that..

Live Query Statistics is an interesting feature, but I had a nervous few moments when my long-running ETL stored procedure began to produce a plan that was way too big for SSMS to handle. I had to eventually cancel execution and view plan from DMV. So, while it is a nice feature, watch out for a few gotchas. Microsoft also advises that there might be a mild performance dip in execution of your query when you are using this feature.

Who can use it?

You need to have SHOWPLAN and VIEW SERVER STATE permissions to be able to use this feature effectively.

Overall, this is a great feature that makes information available at finger tips when you are possibly scrambling to address that performance issue that is driving everyone nuts!

Database Mail shutting down. Mail queue has been stopped.

DB mail in our production servers started acting up last week. We have SQL Server 2012 SP1. There were a couple of sets of symptoms. Following is what I did to get DB mail working again.

I hope you don’t spend 4 hrs like me to find a solution.

Symptom 1:

Mails will not be sent anymore. Sending test mail will not work.  When DB mail status is checked using the following SP, it will return “STARTED”.

USE msdb
GO
EXEC dbo.sysmail_help_status_sp

However, checking the status of mail items themselves using the following SP will show that they are “unsent”.

SELECT TOP 10 sent_status, *
FROM dbo.sysmail_allitems WITH(NOLOCK READUNCOMMITTED)
ORDER BY send_request_date desc

Stopping DB mail using the following SP does not work anymore. The command takes forever to execute and nothing happens.

USE msdb
GO
EXEC dbo.sysmail_stop_sp

If these are the symptoms,

  1. Kill the process “DatabaseMail.exe” from task manager of the server to stop DB mail.
  2. Check status using sysmail_help_status_sp and ensure it shows “STOPPED”.
  3. Start DB mail using the following command.
    USE msdb
    GO
    EXEC dbo.sysmail_start_sp
  4. Check status of mail items. The sent_status column should now show “sent”.

Symptom 2:

Fix for “Symptom 1” worked on one server, but on another, mails were not being sent at all. If the above fix doesn’t do it for you, Stop DB mail and check mail log in SSMS at

Management->Database Mail->View Database Mail Log

If log shows the message – “DatabaseMail process is shutting down. Mail queue has been stopped”, check for mails stuck in the mail queue.

Use msdb
GO
Select count(*) from ExternalMailQueue – Check if this is non-zero
  1. If the above query returns a non-zero number, there are stuck mails. Use following script to remove all items from the queue. Please be informed that all these MAILS WILL BE LOST!
Use msdb
GO
ALTER QUEUE ExternalMailQueue WITH STATUS = ON
set nocount on
declare @Conversation_handle uniqueidentifier;
declare @message_type nvarchar(256);
declare @counter bigint;
declare @counter2 bigint;
set @counter = (select count(*) from ExternalMailQueue)
set @counter2=0
while (@counter2<=@counter)
begin
receive @Conversation_handle = conversation_handle, @message_type = message_type_name from ExternalMailQueue
set @counter2 = @counter2 + 1
end

2. Check number of mails in the mail queue. It will be zero.

Use msdb
GO
Select count(*) from ExternalMailQueue

3. Start DB mail using

USE msdb
GO
EXEC dbo.sysmail_start_sp

 4. Check status of mail items. The sent_status column should now show “sent”.

XQuery

In 2005 version of SQL Server, XQuerying was introduced. It’s a powerful feature and when used together with a CTE, it helps to keep code simple and clean. I use XML extensively in my code and many a time, I find myself looking up code, searching the internet or trying things out myself. I hope this post will be a single point of reference to most of the usual XQuery coding.

Following is a simple script to read from an XML variable. Same code can be used to read from an XML column as well.

NOTE: I will append more XQuery code samples to this post.

DECLARE @xmlProduct XML = 
'<Catalog>
	<Product>
		<Name>Selle Italia Road</Name>
		<Code>SIR-1</Code>
		<Category>Saddle</Category>
		<Description>Sleek saddle from Selle Italia</Description>
		<UnitPrice>50</UnitPrice>
		<Currency>USD</Currency>
	</Product>
	<Product>
		<Name>Brooks Tourer</Name>
		<Code>BT-T</Code>
		<Category>Saddle</Category>
		<Description>Brooks leather saddle for touring</Description>
		<UnitPrice>120</UnitPrice>
		<Currency>USD</Currency>
	</Product>
</Catalog>
'
--Read from nodes
SELECT
c.value('Name[1]','varchar(50)') [ProductName],
c.value('Code[1]','varchar(50)') [Code],
c.value('Category[1]','varchar(50)') [Category],
c.value('UnitPrice[1]','varchar(50)') + CHAR(32) + c.value('Currency[1]','varchar(50)') [Price],
c.value('Description[1]','varchar(50)') [Description]
FROM 
@xmlProduct.nodes('//Catalog/Product') AS TAB(c)

Output:
XQueryOutput

Retrieve unsaved query

I was working on a remote machine and while I was away, the machine was restarted. Lost the scripts I had not saved, or so I thought; and then I came up with this – was a life saver!

SELECT TOP 10 
EST.text
FROM sys.dm_exec_query_stats EQS
CROSS APPLY sys.dm_exec_sql_text(eqs.sql_handle) EST
WHERE EST.text LIKE '%text%'
ORDER BY last_execution_time DESC