Analyzing Heart Disease risk using Key Influencers AI visual in Power BI

The Gartner Magic Quadrant for 2019, announced earlier this month, names Microsoft the leader in Analytics and Business Intelligence Platforms. Microsoft also coincidentally announced the public preview release of its first AI-driven visual for Power BI Key Influencers Рthis month, among a number of new features for Feb 2019. Inbuilt integration of Power BI with many Azure data products would catapult Power BI miles ahead of Tableau in the long run.

EN-CNTNT-GartnerMQ-BI2019.jpg

Key Influencers is the first of many AI visuals Microsoft would release I assume, in their efforts to democratize AI and make their customers look cool ūüôā In this article, we will go over the various features of this new visual using a publicly available dataset, and get familiar with interpreting the results. Download a copy of¬†Power BI Desktop file for the example¬†I am using in this article and try it out yourself using the free Power BI Desktop tool.

Key Influencers

Key Influencers is a powerful Power BI visual that lets us understand the factors that drive a metric. Power BI analyzes data, ranks the factors that matter, and displays them as key influencers. Under the hood, Power BI uses ML.NET to run logistic regression to calculate the key influencers. Logistic regression is a statistical model that compares different groups to each other, also taking into consideration the number of data points available for a factor.

As the visual is still in preview, there are a number of limitations. My first attempt to use Key Influencers using a survey responses dataset was rather unimpressive.

In my second attempt, I used the popular Heart Disease dataset from UCI to identify key influencers affecting heart disease, and achieved good results.

Heart Disease - Key Influencers Power BI.jpg

Limitations

Before we delve any further, let us take a look at the limitations that apply in the public preview phase of the visual. Pay attention here to avoid frustration as you explore the visual.

Following features are not supported:

  • Analyzing metrics that are aggregates/measures
  • Direct Query / Live Connection / Row Level Security – support
  • Consuming the visual in Power BI Embedded and Power BI mobile apps

Using the Key Influencers Visual

As a first time user, I found the Key Influencers visual intuitive and self-explanatory. It hardly takes a few minutes to set up the visual once you have clean data. Check out Microsoft documentation to understand all aspects of the visual. You could also download a copy of Power BI Desktop file for the example I am using in this article.

Note: Keep column names readable as this will help interpret the visual better

Getting Familiar

There are 2 tabs available within the visual – Key influencers and Top Segments.

The Key influencers tab displays the key factors affecting the metric value selected. In this case, the top factor that affects positive diagnosis of Heart Disease, based on our dataset, is Reversible Defect Thalassemia – increasing the risk of heart disease by 2.83 times when the value of Reversible Defect Thalassemia is 7.

On the right hand side, there is a column chart showing distribution of the selected factor. The check box at the bottom lets you display only influential factor values. We could click-select a different factor to see how it contributes to heart disease.

Heart Disease - Key Influencers Power BI - Getting Familiar.jpg


The Top segments tab displays different segments identified by Power BI within the population, for the metric value selected. Click-select a segment to view more details such as the factor values that define the segment, and how the segment compares against the average. We could also drill down further into the segment to split by additional fields.

Under the hood, Power BI uses ML.NET to run a decision tree to find interesting subgroups. The objective of the decision tree is to end up with a subgroup of datapoints that is relatively high in the metric we are interested in Рin our case, the patients who  are suspected to have heart disease.

Heart Disease - Key Influencers Power BI - Top Segment.jpg

 

Heart Disease - Key Influencers Power BI - Top Segment Details.jpg

First Impression

Considering that it is still in preview and is only going to get better, Key Influencers¬†ticks the right boxes. The rationale behind choosing a popular dataset, such as the Heart Disease dataset from UCI, for my example was to allow for comparison of results to Machine Learning models that are already publicly available. Power BI seems to identify influencers correctly and does a good job at presentation. I’m thoroughly impressed by this new feature.

Suggested Reading

If you enjoyed this article, consider reading my other articles on Azure data products.

https://sqlroadie.wordpress.com/2018/04/29/what-is-azure-cosmos-db/
https://sqlroadie.wordpress.com/2018/08/05/azure-cosmos-db-partition-and-throughput/
https://sqlroadie.wordpress.com/2019/02/17/azure-databricks-introduction-free-trial/

Resources:

Download the Power BI workbook used in the example – https://drive.google.com/open?id=13Pt25UPt7dOW3raZmavHHVl7gAStv5uy
Intro to Key Influencers by Microsoft: https://docs.microsoft.com/en-us/power-bi/visuals/power-bi-visualization-influencers
Power BI Feb 2019 feature summary – https://powerbi.microsoft.com/en-us/blog/power-bi-desktop-february-2019-feature-summary/

Heart Disease Data source
Donor:¬† David W. Aha (aha ‘@’¬†ics.uci.edu) (714) 856-8779
Creators:

  • Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
  • University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
  • University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
  • V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D.

Azure Databricks – Introduction (Free Trial)

Microsoft’s Azure Databricks is an advanced Apache Spark platform that brings data and business teams together. In this introductory article, we will look at what the use cases for Azure Databricks are, and how it really manages to bring technology and business teams together.

Databricks

Before we delve deeper into Databricks, it is good to have a general understanding of Apache Spark.

Apache Spark is an open-source, unified analytics engine for big data processing, maintained by the Apache Software Foundation. Spark and its RDDs were developed in 2012 in response to limitations of MapReduce. 

Key factors that make Spark ideal for big data processing are:

  • Speed – up to 100X faster
  • Ease of use – code in Java, Scala, Python, R and SQL
  • Generality – use SQL, streaming and complex analytics
Apache Spark Ecosystem.jpg
Pic courtesy: Microsoft

Databricks Рthe company Рwas founded by creators of Apache Spark. Databricks provides a web-based platform for working with Spark, with automated cluster management and IPython-style notebooks. It is aimed at unifying data science and engineering across the Machine Learning (ML) life cycle from data preparation, to experimentation and deployment of ML applications. Databricks, by virtue of its big data processing capabilities, also facilitates big data analytics. Databricks, as the name implies, thus lets you build solutions using bricks of data.

Azure Databricks

Azure Databricks combines Databricks and Azure to allow easy set up of streamlined workflows and an interactive work space that lets data teams and business collaborate. If you’ve been following data products on Azure, you’d be nodding your head along, imagining where Microsoft is going with this ūüôā

Azure Databricks enables integration across a variety of Azure data stores and services such as Azure SQL Data Warehouse, Azure Cosmos DB, Azure Data Lake Store, Azure Blob storage, and Azure Event Hub. Add rich integration with Power BI, and you have a complete solution.

Azure Databricks Overview
Pic courtesy: Microsoft

Why use Azure Databricks?

By now, we understand that Azure Databricks is an Apache Spark-based analytics platform that has big data processing capabilities and brings data and business teams together. How exactly does it do that, and why would someone use Azure Databricks?

  1. Fully managed Apache Spark clusters: With the serverless option, create clusters easily without having to set up your own data infrastructure. Dynamically auto-scale clusters up and down, and auto-terminate inactive clusters after a predefined period of inactivity. Share clusters with your teams, reduce time spent on infrastructure management and improve iteration time.

  2. Interactive workspace: Streamline data processing using secure workspaces, assign relevant permissions to different teams. Mix languages within a notebook Рuse your favorite out of R, Python, Scala and SQL. Explore, model and execute data-driven applications by letting Data Engineers prepare and load data, Data Scientists build models, and business teams analyze results. Visualize data in a few clicks using familiar tools like Matplotlib, ggplot or take advantage of the rich integration with Power BI.

  3. Enterprise security: Use SSO through Azure Active Directory integration to run complete Azure-based solutions. Roles-based access control enables fine-grained user permissions for notebooks, clusters, jobs, and data.

  4. Schedule notebook execution: Build, train and deploy AI models at scale using GPU-enabled clusters. Schedule notebooks as jobs, using runtime for ML that comes preinstalled and preconfigured with deep learning frameworks and libraries such as TensorFlow and Keras. Monitor job performance and stay on top of your game.

  5. Scale seamlessly: Target any amount of data or any project size using a comprehensive set of analytics technologies including SQL, Streaming, MLlib and GraphX. Configure number of threads, select number of cores and enable autoscaling to dynamically scale processing capabilities leveraging a Spark engine that is faster and performant through various optimizations at the I/O layer and processing layer (Databricks I/O).

Of course, all of this comes at a price. If this article has piqued your interest, hop over to Azure Databricks homepage and avail the 14 day free trial!

Azure Databricks - Free Trial 14 days.jpg

Suggested learning path:

  1. Read more about Azure Databricks – https://docs.microsoft.com/en-us/azure/azure-databricks/what-is-azure-databricks
  2. Create a Spark cluster and run a Spark job on Azure Databricks – https://docs.microsoft.com/en-us/azure/azure-databricks/quickstart-create-databricks-workspace-portal#clean-up-resources
  3. ETL using Azure Databricks – https://docs.microsoft.com/en-us/azure/azure-databricks/databricks-extract-load-sql-data-warehouse
  4. Stream data into Azure Databricks using Event Hubs – https://docs.microsoft.com/en-us/azure/azure-databricks/databricks-stream-from-eventhubs
  5. Sentiment analysis on streaming data using Azure Databricks – https://docs.microsoft.com/en-us/azure/azure-databricks/databricks-sentiment-analysis-cognitive-services

I hope you found the article useful. Share your learning experience with me. My next article will be on Real-time analytics using Azure Databricks.

Azure Databricks - Real time analytics.jpg
Azure Databricks

Resources:

https://azure.microsoft.com/en-au/services/databricks/
https://databricks.com/product/azure
https://docs.microsoft.com/en-us/azure/azure-databricks/what-is-azure-databricks
https://docs.microsoft.com/en-us/azure/azure-databricks/quickstart-create-databricks-workspace-portal#clean-up-resources
https://databricks.com/blog/2019/02/07/high-performance-modern-data-warehousing-with-azure-databricks-and-azure-sql-dw.html

Part 2: Predictive Modeling using R and SQL Server Machine Learning Services

Recap!

Part 1 of Predictive Modeling using R and SQL Server Machine Learning Services¬†covered an overview of Predictive Modeling and the steps involved in building a Predictive Model. Using our sample dataset – Ski Resort rental data – we wanted to predict RentalCount¬†for the year 2015, given the variables ‚ÄstMonth, Day, Weekday, Holiday and Snow.

Part 1 covered:
– Getting data from a SQL Server database
– Preparing data for modeling
– Training models
– Comparing results and finalizing a model

We found that for this problem, predictions using the Decision Tree model were more accurate than the Linear Regression model. SQL Server Machine Learning Services (MLS) lets us train and test Predictive Models using R or Python, in the context of SQL Server. Thus, we can build T-SQL programs that contain embedded R/Python scripts that train on data stored in the database.

Deploy Machine Learning code with SQL Server

In this part, we will deploy the R code we wrote in Part 1 to SQL Server. To deploy, we will store the trained model in database and create a stored procedure that predicts using the model. This stored procedure can be invoked from applications.

1. Create Table for storing the model: Here, we create a table in SQL Server to store the trained model. The model will be used for prediction in step 3.

2001_CreateTableRentalRxModels.jpg

2. Create Stored Procedure for generating the model:¬†This stored procedure will use the R scripts we wrote in Part 1¬†utilizing¬†sp_execute_external_script¬†introduced in SQL Server 2016. To execute¬†sp_execute_external_script, first enable external scripts by using the statement –¬†sp_configure ‘external scripts enabled’, 1;

The function to generate Decision Tree model РrxDTree Рis part of the RevoScaleR package for R. RevoScaleR package includes numerous other R functions for importing, transforming, and analyzing data at scale. Point to note is that the functions run on the RevoScaleR interpreter, built on open-source R. It is engineered to leverage the multithreaded and multinode architecture of the host platform, meaning when R code executes within a SQL Server SP, it utilizes parallel processing.

2002_SPToGenerateTrainedDTreeModel

2003_generatetraineddtreemodel.jpg

3. Create Stored Procedure for prediction: Now that we have the model output, we can create an SP that would use the model to predict rental count for new data. Again, we are using the R code covered in Part 1, only that this time we are using it in a SQL Stored Procedure.

2004_SPToPredictRentalCountUsingDTreeModel.jpg

2005_PredictRentalCountForTestDataUsingDTreeModel.jpg

 

2006_PredictRentalCountUsingDTreeModel.jpg

Isn’t that just awesome? We have a Predictive Model that can be used within applications to predict rental count. Now that we have covered a sample project, in¬†Part 3¬† of the series, I will share my experience using SQL Server Machine Learning Services to solve a problem at my work.

Before we conclude Part 2,

Predict using Native Scoring (SQL Server 2017*): In SQL Server 2017, Microsoft has introduced a native predict function. What this means is we do not need to run R/Python code in a SQL stored procedure to do the actual prediction. Native scoring uses native C++ libraries that reads a trained model stored in binary format (in our case in a SQL Server table), and generate scores for new input data.

2007_CreateTableNativeModelSupport.jpg

2008_GenerateNativeModel.jpg

2010_PredictionUsingNativePredictFunction.jpg

Resources:

Scripts for Part 2: https://drive.google.com/file/d/15fwujRipLg-k2ozOFb9G9PFfBO327zTa/view?usp=sharing
RevoScaleR
https://docs.microsoft.com/en-us/machine-learning-server/r-reference/revoscaler/revoscaler
SQL Server ML Tutorial: https://microsoft.github.io/sql-ml-tutorials/R/rentalprediction/step/3.html
Native Scoring: https://docs.microsoft.com/en-us/sql/advanced-analytics/sql-native-scoring?view=sql-server-2017
RxSerializeModel: https://docs.microsoft.com/en-us/machine-learning-server/r-reference/revoscaler/rxserializemodel
Forecasts and Prediction using SQL Server MLS: https://docs.microsoft.com/en-us/sql/advanced-analytics/r/how-to-do-realtime-scoring?view=sql-server-2017

Part 1: Predictive Modeling using R and SQL Server Machine Learning Services

To R or not to R?

A few months ago, I asked myself an important question – which language to learn first – R or Python? From my research, I found that R is regarded as more old-school and difficult to learn, but Python is more popular. It made perfect sense to learn R first ūüôā

For the uninitiated, R is a programming language that makes statistical and mathematical computation easy, and is useful for machine learning/predictive analytics/statistics work.

Along the way, I found the following courses useful.
https://www.edx.org/course/introduction-to-r-for-data-science
https://www.edx.org/course/programming-in-r-for-data-science

The goal has always been to explore Machine Learning Services in SQL Server, and dive deeper thereafter. This 3-part series is a walk through/review of Microsoft’s tutorial on Predictive Modeling using R and SQL Server. The first part deals with preparing data, training a model and using it for prediction.

What is Predictive Modeling?

Predictive Modeling uses statistics to predict outcomes.

In our example, we will use Machine Learning Services for SQL Server 2017 to predict number of rentals for a future date in a ski rental business. This brings up an important question – why do we want to predict? In this case, prediction will help the business be prepared from a stock, staff and facilities perspective. Prediction has numerous, life-changing applications. Read about application of AI in predicting cardiovascular disease by Microsoft for Apollo Hospitals, India.

For the ski rental prediction, we will use test data provided by MS, SQL Server 2017 with Machine Learning Services, and R Studio IDE. Please check the following URL and follow the simple instructions to set up your environment. If you have any question, please use Comments section to drop me a note.
https://microsoft.github.io/sql-ml-tutorials/R/rentalprediction

PredictiveModeling-Steps1

Steps involved in building the predictive model in SQL Server

  1. Getting data
  2. Preparing data
  3. Training models
  4. Comparing results and choosing a model
  5. Deploying the Machine Learning script to SQL Server

1. Getting Data:¬†Okay, let’s get started. In the first step, we will restore sample database – TutorialDB – using the database backup file provided by MS. After restoring the database using SSMS, we will take a look at rental data we will use for training the model.

All scripts used will be provided in the Resources section at bottom of the page.

002_RestoreSampleDatabase-TutorialDB

003_ExamineTrainingData

You will see that we have 453 rows of rental stats. Data is in place, so we are good to move on to Step 2.

2. Preparing Data: Now that database is restored and data available in SQL Server, we will load data to R and transform it. Open R Studio and execute the rental data load script. After loading data, we will examine a few rows/observations and inspect data types. We will then proceed to change types of a few columns to factor.

004_loadrentaldatafromsqlserver.jpg

005_DataPreparation

3. Training Models: Now that data is prepared, we will chose a model that best describes dependency between variables in our dataset. During training, we provide the variables along with the outcome so that our model can train to predict the outcome. Here, we will compare predictions by two different models and choose the more accurate one as our predictive model.
For clarity, let me state that we are trying to predict RentalCount for the year 2015, given the variables – Month, Day, Weekday, Holiday and Snow.

The challenge in Machine Learning is in knowing what various models mean, and when a particular model might be more suitable. MS recommends this cheat sheet as a guide.

006_SplitDatasetToTrainingAndTest

007_TrainingUsingLinearRegressionAndDecisionTreeModels

008_RentalDataPrediction

Comparing results: Here, we compare results to figure out which model predicted more accurately. Decision Tree performed better in this case and we will use the model to deploy our Machine Learning Solution to SQL Server in Part 2

009_RentalDataPlotDifferencePredictedAndActual

Resources:

Scripts for Part 1 (zip file): https://drive.google.com/file/d/1Tzs4qzFXXL-NgxFlKQcuVHKwx90Imr8u/view?usp=sharing
SQL Server R tutorials: https://docs.microsoft.com/en-us/sql/advanced-analytics/tutorials/sql-server-r-tutorials?view=sql-server-2017
Gitghub repo for rental prediction: https://microsoft.github.io/sql-ml-tutorials/R/rentalprediction/
Machine Learning cheat sheet, use with caution ūüôā https://docs.microsoft.com/en-us/azure/machine-learning/studio/algorithm-choice#the-machine-learning-algorithm-cheat-sheet