Part 1 covered:
– Getting data from a SQL Server database
– Preparing data for modeling
– Training models
– Comparing results and finalizing a model
We found that for this problem, predictions using the Decision Tree model were more accurate than the Linear Regression model. SQL Server Machine Learning Services(MLS) lets us train and test Predictive Models using R or Python, in the context of SQL Server. Thus, we can build T-SQL programs that contain embedded R/Python scripts that train on data stored in the database.
Deploy Machine Learning code with SQL Server
In this part, we will deploy the R code we wrote in Part 1 to SQL Server. To deploy, we will store the trained model in database and create a stored procedure that predicts using the model. This stored procedure can be invoked from applications.
1. Create Table for storing the model: Here, we create a table in SQL Server to store the trained model. The model will be used for prediction in step 3.
2. Create Stored Procedure for generating the model: This stored procedure will use the R scripts we wrote in Part 1 utilizing sp_execute_external_script introduced in SQL Server 2016. To execute sp_execute_external_script, first enable external scripts by using the statement – sp_configure ‘external scripts enabled’, 1;
The function to generate Decision Tree model – rxDTree – is part of the RevoScaleR package for R. RevoScaleR package includes numerous other R functions for importing, transforming, and analyzing data at scale. Point to note is that the functions run on the RevoScaleR interpreter, built on open-source R. It is engineered to leverage the multithreaded and multinode architecture of the host platform, meaning when R code executes within a SQL Server SP, it utilizes parallel processing.
3. Create Stored Procedure for prediction: Now that we have the model output, we can create an SP that would use the model to predict rental count for new data. Again, we are using the R code covered in Part 1, only that this time we are using it in a SQL Stored Procedure.
Isn’t that just awesome? We have a Predictive Model that can be used within applications to predict rental count. Now that we have covered a sample project, in Part 3 of the series, I will share my experience using SQL Server Machine Learning Services to solve a problem at my work.
Before we conclude Part 2,
Predict using Native Scoring (SQL Server 2017*): In SQL Server 2017, Microsoft has introduced a native predict function. What this means is we do not need to run R/Python code in a SQL stored procedure to do the actual prediction. Native scoring uses native C++ libraries that reads a trained model stored in binary format (in our case in a SQL Server table), and generate scores for new input data.
A few months ago, I asked myself an important question – which language to learn first – R or Python? From my research, I found that R is regarded as more old-school and difficult to learn, but Python is more popular. It made perfect sense to learn R first 🙂
For the uninitiated, R is a programming language that makes statistical and mathematical computation easy, and is useful for machine learning/predictive analytics/statistics work.
The goal has always been to explore Machine Learning Services in SQL Server, and dive deeper thereafter. This 3-part series is a walk through/review of Microsoft’s tutorial on Predictive Modeling using R and SQL Server. The first part deals with preparing data, training a model and using it for prediction.
What is Predictive Modeling?
Predictive Modeling uses statistics to predict outcomes.
In our example, we will use Machine Learning Services for SQL Server 2017 to predict number of rentals for a future date in a ski rental business. This brings up an important question – why do we want to predict? In this case, prediction will help the business be prepared from a stock, staff and facilities perspective. Prediction has numerous, life-changing applications. Read about application of AI in predicting cardiovascular disease by Microsoft for Apollo Hospitals, India.
For the ski rental prediction, we will use test data provided by MS, SQL Server 2017 with Machine Learning Services, and R Studio IDE. Please check the following URL and follow the simple instructions to set up your environment. If you have any question, please use Comments section to drop me a note. https://microsoft.github.io/sql-ml-tutorials/R/rentalprediction
Steps involved in building the predictive model in SQL Server
Comparing results and choosing a model
Deploying the Machine Learning script to SQL Server
1. Getting Data: Okay, let’s get started. In the first step, we will restore sample database – TutorialDB – using the database backup file provided by MS. After restoring the database using SSMS, we will take a look at rental data we will use for training the model.
All scripts used will be provided in the Resources section at bottom of the page.
You will see that we have 453 rows of rental stats. Data is in place, so we are good to move on to Step 2.
2. Preparing Data: Now that database is restored and data available in SQL Server, we will load data to R and transform it. Open R Studio and execute the rental data load script. After loading data, we will examine a few rows/observations and inspect data types. We will then proceed to change types of a few columns to factor.
3. Training Models: Now that data is prepared, we will chose a model that best describes dependency between variables in our dataset. During training, we provide the variables along with the outcome so that our model can train to predict the outcome. Here, we will compare predictions by two different models and choose the more accurate one as our predictive model.
For clarity, let me state that we are trying to predict RentalCount for the year 2015, given the variables – Month, Day, Weekday, Holiday and Snow.
The challenge in Machine Learning is in knowing what various models mean, and when a particular model might be more suitable. MS recommends this cheat sheet as a guide.
After the 2018 Microsoft Ignite event, Microsoft announced a free trial of Azure Cosmos DB. For those who are eager to check out Cosmos DB, this is a great opportunity to familiarize yourself with the hottest No SQL database in the market. Cosmos DB is currently sitting pretty at Rank 29 in the DB Engines Ranking page. Quite an achievement considering that the product is “only 1 year old”.
There is no need for a credit card or subscription to avail this free trial. Please do note though that MS is likely to withdraw trial in a few months.
Step 1: Choosing an API/data model Once you click the above URL, you will be asked to pick an API and data model. Go ahead and pick SQL to check out document (JSON) data model if you are unsure. I want to check out Gremlin API for Graph, so I am choosing Graph in this example.
You will be asked to login with your Microsoft account. If you don’t have one, create a new account.
Step 2: Choose default options and click on Create container. Note that container will be created free of cost at a low throughput of 400 RU/s. If you are not familiar with throughput, read my article about Azure Cosmos DB – Partition and Throughput.
You now have a brand new container and the familiar Azure portal access. My trial container has read/write location of Central US.
Closer look: There is a lot going on here, so let us take a closer look.
My container is called Persons and it is in graphdb database. Cosmos DB uses Apache Tinkerpop Gremlin API for graph traversal. Currently, my container Persons is empty, so let us connect a sample application. If you are new to Graph DB, I recommend reading this free ebook from another popular Graph DB product – Neo4j. https://neo4j.com/graph-databases-book/
Step 3: Sample project Click on the Quick start blade to download a sample project and explore graph data.
The sample project already has the connection string set to the trial database, so we can execute the project right away. How awesome is that!
In Step 1, I chose Graph API and model, so my sample project has Gremlin API queries. Read more about Gremlin API here – https://aka.ms/gremlin.
Step 4: Execute sample project
Now it’s time to execute the project!
Step 5: Data Explorer
Looks like the program has added a number of graph documents to the container. Let us head over to Azure Portal and explore using Data Explorer blade. Click on the Execute Gremlin Query button to take a look at the data added by sample program. Now let us take a closer look at the output.
Closer look: A quick look shows that the query g.V() returned 4 nodes. Result can be viewed in either JSON or Graph mode.
If we click on the second node – Ben – and zoom in, we get a nice graphical view of ben’s relationships. Click and explore to view related nodes.
Switch to JSON mode to view result in JSON format. Pretty self-explanatory there 🙂
Step 6: Monitor activity Head over to Activity Log blade to take a look at activity in the Cosmos DB account. This could be downloaded as a .csv or exported to Event Hub for further analysis.
We took a look at the Cosmos DB Trial offered by Microsoft and got started with Graph API. Graph DB has a number of attractive use cases such as Fraud Detection and Recommendation Engine. Another popular Graph DB is Neo4j. I hope this article helped you understand how powerful Cosmos DB’s multi model support is. If you have any questions, drop me an email or add a comment. I will be happy to help. Add me on LinkedIn to stay connected – https://www.linkedin.com/in/arjunsivadasan/
In my previous article Introduction to Azure Cosmos DB, I mentioned Partition and Throughput only briefly. Adopting a good partition scheme is quintessential to setting up your Cosmos DB container for elastic scaling and blazing performance. This article will take a closer look at these two aspects to help fully utilize the storage and performance offerings of Cosmos DB.
Azure Cosmos DB containers store documents, graphs or tables. Containers (a.k.a. collections in the context of documents) are logical entities that could be distributed across multiple physical partitions or servers.
Physical and Logical Partitions
A physical partition is an internal Cosmos DB concept, essentially a fixed amount of SSD storage combined with a variable amount of compute power (CPU, memory and IO). The number of physical partitions of a container depends on its storage and throughput. For containers with shared throughput, number of partitions depends on RU/s assigned to the set of containers.
Request Unit (RU/s) – is the unit of throughput. 1 RU/s serves a get by self-link (internal property) or id of a 1 KB item.
When a collection is created, we can specify a fixed storage capacity of 10 GB or unlimited capacity. A fixed storage collection is limited in performance to a max of 10,000 RU/s.If we choose unlimited capacity, the collection created potentially has no max RU/s limit. Collections are supposedly unlimited in terms of storage and throughput, and physical partition management is handled by Cosmos DB behind the curtains. Note that for a multi-partition collection, we need to specify a partition key.
Data within a container having the same partition key value form a logical partition. The max storage limit of a logical partition is 10 GB, which means if data associated with a certain partition key value goes beyond 10 GB, the logical partition will be full and cannot grow any further. This is why adopting a good partition scheme is very important to avail the storage and performance guarantees of Azure Cosmos DB.
Azure Cosmos DB internally has a limit for the max throughput that can be provided by a physical partition – PRUmax. This value keeps changing based on factors such as hardware used and platform upgrades. For now, keep in mind that this happens behind the scenes.
Let us assume PRUmax = 10,000 RU/s. We create an unlimited Collection product at 20,000 RU/s initial throughput and productid as the partition key. Cosmos DB has to create at least 2 physical partitions to support the 20,000 RU/s throughput requested. Currently, the default seems to be 5. So, Cosmos DB creates a new collection with 5 physical partitions. The throughput requested will be equally assigned to these physical partitions. This means, the max throughput limit for each partition is 20,000/5 = 4000 RU/s.
As we add new documents, Cosmos DB allocates the key space of partition key hashes evenly and consistently across the 5 physical partitions. If the partition key is well chosen, writes will be distributed evenly across the partitions, each partition serving nearly 5000 RU/s and a cumulative of nearly 20,000 RU/s. This is ideal. In real world, it is possible that we chose a bad partition key.
What can go wrong?
Performance impact: If majority of the concurrent writes/reads pertain to a specific partition key value, we could have 1 physical partition maxing out the 5000 RU/s allocated to it (hot partition), while the other 4 partitions idling. When this happens, requests are bound to get rate-limited and we will get Http 429 response code.
Storage impact: Earlier in the article, I mentioned the concept of logical partition. All data having the same partition key form a logical partition. Logical partitions cannot be split across physical partitions. For the same reason, if the partition key chosen is of bad cardinality, we could potentially have skewed storage distribution. Say, 1 logical partition becomes fatter faster and hits the max limit of 10 GB, while the others are nearly empty. The physical partition housing the maxed out logical partition cannot split and could thus cause an application downtime.
Physical partition split
Azure Cosmos DB manages physical partitions seamlessly behind the scenes, if you chose your partition key smartly that is. Following are two scenarios when Cosmos DB will split a physical partition.
Storage limit of 10 GB: When a physical partition is full, Cosmos DB will split it into 2 new partitions assigning data corresponding to nearly half of the keys to each new partition. As mentioned previously, the split cannot happen if data in the physical partition in question have the same partition key value.
Increasing throughput: When throughput assigned is increased such that the existing number of physical partitions are insufficient to support it, Cosmos DB will add new physical partitions. In the above example, if the throughput is increased to 100,000 RU/s, Cosmos DB would add 5 new physical partitions.
Cosmos DB needs 100,000/PRUmax = 10 physical partitions to support the throughput setting.
What makes Cosmos DB an attractive high volume transaction database is the ease of scaling. When request rates are low, throughput could be lowered to keep costs down. Cosmos DB’s performance is predictable. For example, a read of a 1-KB document with session consistency always consumes 1 RU, regardless of number of concurrent requests or amount of data stored.
There are, however, two major design considerations to facilitate elastic scaling of Azure Cosmos DB.
Distribute requests and storage
Ideal candidate property for partition key will allow writes to be distributed across various distinct values. Requests to the same partition key should remain lower than the max throughput limit allocated to a partition. A good partition key will evenly distribute writes across all physical partitions and not cause hot partitions. In our example, productid is a good partition key, because it is unlikely that all concurrent requests will be focused on a specific product. If we were to chose the property productcategory as partition key, that could potentially cause hot partitions
Partition scope for queries and transactions
At one extreme, we can use the same partition key for all documents. At the other extreme, we can have unique partition key for each document. Both approaches have their limitations. Using the same partition key for all documents will limit scalability and cause a hot partition and inefficient utilization of throughput. Using unique partition keys will support high scalability, but result in a lot of cross-partition queries and prevent use of cross-document transactions. Occasional fan-out of queries is not too bad, but frequent fan-out will incur high RU consumption and result in rate-limiting.
Throughput can be estimated based on the number of expected reads/writes per second. 1 Request Unit (RU) corresponds to read of a 1-KB document containing 10 unique property values by self-link or id. Write, replace or delete will consume more RU/s.
Microsoft provides a Request Unit calculator that serves to arrive at a base throughput to assign when creating a new collection. Be prepared to fine tune the RU setting as you trot along, but this is a good starting point.
This URL ignites nostalgia 🙂
Azure Cosmos DB is a lot more versatile compared to the initial Document DB days. With added support for Mongo DB, Graph, Cassandra and Table APIs and multi-master and global distribution support, Cosmos DB is definitely the most exciting product in database technology at the moment. With the new Azure data products such as Azure Stream Analytics, Azure Data Bricks and HDInsight supporting out-of-the-box integration with Cosmos DB, it is fast becoming a good candidate for Big Data solutions.
Please feel free to reach out if you have questions. I’m always happy to discuss technology 🙂
Cosmos DB started as an internal project – “Project Florence” – at Microsoft in 2010. It is a globally distributed, highly scalable and performant multi-model database service built on Azure. It was built ground up with horizontal scalability and global distribution in view, so you can expect seamless geo-replication to Azure regions by clicking on a map and performance to the order of milliseconds.
Why should you use Cosmos DB?
Cosmos DB is the future of storage. Below are some reasons you may want to use it.
You use Microsoft stack already and find your favourite RDBMS – SQL Server – too slow
Your applications need to process large amounts of data at blazing speeds
Your data is not relational anymore, you want to use JSON document store or Graph DB or one of the many popular APIs Cosmos DB supports
Big Data applications – This is an evolving use case. Many Azure products such as Azure Stream Analytics, HDInsight and Data Bricks integrate with Cosmos DB, making it an ideal choice for persisting big data.
You are having to “over-provision” specs to cover peak loads on the database server
Your business spans multiple continents and you want easy and reliable replication
You trust Microsoft to do a better job with availability using Azure
Collections are to Cosmos DB what tables are to an RDBMS. This analogy is for easy understanding; don’t get hung up on details. There are many differences between collections and tables. A more generic term is Container, especially as Cosmos DB has evolved to support more APIs since the Document DB days.
After you create a Database in your Cosmos DB account, you can create your first collection. Below are a few aspects of a collection you need to be familiar with.
1. Request Units / Throughput: Request Unit (RU) is the currency of Cosmos DB.
1 RU = capacity to read* a single 1 KB item consisting of 10 unique property values *by self-link or id
A collection need not be a set of documents of the same schema. As Cosmos DB is schema-agnostic, you can very well store different document types in the same collection. That said, you may want to store document types that need the same amount of scalability and storage in the same collection. Cosmos DB is cost effective for high volume transaction systems because RU/s for a collection can be ramped up at peak hours and reduced to a minimum during hours of low trade.
When the provisioned number of RU/s is used up, client requests may experience throttling. Prolonged throttling indicates an insufficient throughput setting.
2. Partition: Each document in a collection has a partition key and an id
Partition key helps to identify the logical partition, and the id field helps to identify a specific document.
When you create a collection with an initial throughput more than 1000 RU/s, the resulting collection has unlimited scalability. Depending on the initial throughput assigned, Cosmos DB computes the number of physical partitions required to support the provisioned throughput. Logical partitions have a storage limit of 10GB. Due to this, a single logical partition (housing documents having a certain partition key value, such as Product Type= ‘Motorbike helmet’) should not amount to more than 10GB. Theoretically, the number of partitions in a collection can be infinite. Therefore, a good partition key must be a field having a suitable cardinality. With a bad partition key, you are bound to have uneven distributions of data and are likely to hit the 10 GB storage limit per partition.
3. Time to Live: TTL can be set at a collection or document level. Document level TTL takes precedence, obviously.
Simply put, if TTL for a document is set to 3600 seconds, the document expires when it is older than 3600 seconds. System field _ts (_ts or timestamp is set to Epoch time) indicates the last modified time of a document. Be careful when you set this at collection level. However, this feature helps to implement a sound purge policy.
Geo-replication becomes as simple as clicking on a graph with Cosmos DB’s global distribution feature. With more than 50 Azure regions, you could not only set multiple Read regions for global customers, but also specify several failover regions to help with disaster recovery.
Cosmos DB is Microsoft’s latest NoSQL database offering low latency, high scalability and geo-distribution to any Azure region. Read Microsoft documentation about Cosmos DB here or check out my blog about Introduction to Azure Cosmos DB. In its infancy, Cosmos DB was known as Document DB. Renaming was inevitable as Document DB evolved beyond just a document store.
Following is a simple note on how to export data from Cosmos DB to SQL Server. You may want a dump of data from a Cosmos DB collection to SQL Server for analysis, data integrity checks, troubleshooting a production issue or to derive insights.
There are a few methods to export data from Cosmos DB. The quickest one is to use Document DB / Cosmos DB Migration Tool. This is a tool provided by Microsoft to migrate data TO/FROM various sources such as MongoDB, JSON, csv and SQL Server to Cosmos DB.
1. Use Azure Cosmos DB Migration tool to export data to json files:
Install the tool and launch as Administrator (use an Azure VM for best performance). Please be mindful of spike in RU costs when exporting data from your collection. To avoid throttling, scale your collection up as required just before you do the export or export in off-peak hours, if any. Leave a comment if you need any pointer on this.
Specify connection string. You can find endpoint and key from Keys section of your CosmosDB account in Azure portal. Please note that CosmosDB is case-sensitive.
A peek into Advanced options
Include internal fields – for each document, Cosmos DB maintains a set of auto-generated internal/system fields such as _ts and _self. Their names start with an underscore, making it easy to differentiate them from user fields. This option lets you include internal fields in the output. This is handy, especially the _ts field, which indicates when the document was last updated.
Number of retries on failure & Retry interval – Set a reasonable number of retries. In this case, I have used the value 1.
Connection Mode – you want to use Gateway to get best performance and to bypass firewall rules.
Specify target information. In this case, we want to export to a json document. We could either output to a local file or a blob.
Error Logging. Set these options to enable error logging.
Hit Import and if there are no errors, you will soon have a new json file with all the data from your collection. If you want only a subset of data, all you need to do is modify your source query.
2. Import json files to SQL Server:
SQL Server 2016 introduced a JSON parse function called OPENJSON. If none of your user databases are upgraded to 2016 yet, but you have a 2016 engine, context-switch to a system database to use OPENJSON.
DECLARE @productNutrition varchar(max);
--Read from the json file using openrowset
SELECT @productNutrition = BulkColumn
FROM OPENROWSET(BULK'C:\Users\Smruthi\Downloads\Arjun\productnutrition_20180427.json', SINGLE_BLOB) JSON;
--Pass the variable containing json as parameter to OPENJSON function
FROM OPENJSON (@productNutrition)
ProductID varchar(20) '$.id',
ProductDescription varchar(100) '$.description',
ProductGroup varchar(200) '$.foodGroup',
ServingAmount float '$.servings.amount',
ServingUnit varchar(10) '$.servings.description',
nutrients nvarchar(max) as json --note that json is case sensitive
Give it a go yourself. If you have any question, leave a comment and I will be happy to assist.
SQL Server Management Studio (SSMS) 2016 has a handy new feature to View Live Query Statistics. This feature also works when connected to a 2014 instance using SSMS 2016.
Why use it?
Previously, similar stats were available by doing so:
SET STATISTICS "Detail" ON;
With the new feature, the main difference is that as the query progresses, stats are refreshed, thus providing a live coverage of what SQL Server is up to. This is a fantastic feature to learn about query processing and execution plans, and it helps to identify performance problems. Only yesterday, I used it to identify a “bad plan” where SQL was doing an Eager Spool by cross-joining two million record tables, and thus filling tempdb.
How to use it?
To use the feature, simply click on “Include Live Query Statistics” button in the toolbar or under menu Query -> Include Live Query Statistics.
Watch the magic unfold when you execute query.
If you are not using SQL Sentry Plan Explorer already, please install PRO version for free from Sentry One. I find it amazing and it’s probably the best free tool out there to do performance analysis. Install the SSMS add-in, and you can right click on plan and view plan in SQL Sentry Plan Explorer. It becomes incredibly useful for complex plans that are cumbersome to navigate in SQL Server. Not to mention the host of other info that’s presented nicely to make performance tuning enjoyable.
Watch out for that..
Live Query Statistics is an interesting feature, but I had a nervous few moments when my long-running ETL stored procedure began to produce a plan that was way too big for SSMS to handle. I had to eventually cancel execution and view plan from DMV. So, while it is a nice feature, watch out for a few gotchas. Microsoft also advises that there might be a mild performance dip in execution of your query when you are using this feature.
Who can use it?
You need to have SHOWPLAN and VIEW SERVER STATE permissions to be able to use this feature effectively.
Overall, this is a great feature that makes information available at finger tips when you are possibly scrambling to address that performance issue that is driving everyone nuts!