Stroke Prediction

Every

seconds, someone in the United States has a stroke.

Every

minutes, someone dies of stroke.

Every year, over

people in the United States have a stroke.

in 6 deaths from cardiovascular disease was due to stroke.

About

What is this project about ?

What is our project?

We created a predictive algorithm which receives input data about a user's medical history and general information, and then uses machine learning to predict if the user is likely or unlikely to experience a stroke in their lifetime.

What is our ultimate goal?

Strokes are one of the leading causes of death in our world, and countless conditions put everyone at risk. Our goal is to warn the user about their likelihood of experiencing a stroke and provide further assistance.

What is our data set?

The data set is a compilation of certain medical conditions and general attributes on an unknown patient, as well as the information on whether that specific person has suffered a stroke or not.

Classification or Regression?

Our problem is a classification problem due to the binary nature of our data. This is because we are predicting whether a person will be likely to have a stroke or not, which will have a binary value.

What does our MVP look like?

Our MVP is a simple survey-format prediction site. We prompt the user to enter general information about their life and use that information to help predict whether the person is at risk to have a stroke.

Why does this matter?

Strokes cause as many as over 100,000 deaths each year. Being able to recognize the possible factors that could lead to an onset of a stroke in one's life would be extremely beneficial to avoid the potential damage of a stroke.

Data Cleaning

Why Data Cleaning ?

It is necessary that our data set be prepared and cleaned before any analysis can be run.

Removing Columns

There will be some data deemed extraneous due to their lack of contribution towards the output. These columns are therefore removed before data analysis, in order to streamline the data set and eliminate unnecessary data. We removed the "Residence Type" and "ID" columns, as we found their data did not correlate or had very minimal significance with the final stroke prediction result.

Dealing with Null Values

Real life data, especially medical data, will never be 100% perfect. There may be gaps or blank values that must be dealt with before a model can train on the data. For example, we had null values in the body mass index section, which was handled by replacing those values with an imputed mean value. We also had gender values labeled "Other". To fix this issue, we simply eliminated the data points with the "Other" gender values.

Encoding Categorical Data

As our data contains both numerical and categorical data, it is necessary to encode the categorical columns. This is because machine learning models only understand numerical data, and are unable to comprehend categorical data, which contains of objects. By encoding the categorical data, the values are encoded into zeros and ones, and therefore can be recognized by machine learning models.

Exploratory Data Analysis

Why do we require EDA?

Exploratory data analysis (EDA) involves using statistics and visualizations to analyze and identify trends in data sets. Exploratory data analysis is essential for any business. It allows data scientists to analyze the data before coming to any assumption. It ensures that the results produced are valid and applicable to business outcomes and goals. It helps us identify errors, outliers, and relationships in the data.

Does Stroke risk increase by age ?

What we observed from the plot

We can see that stroke risk increases as people get older
Stroke risk increases with age, but strokes can and do occur at any age.

Even though age is a factor in probability of getting a stroke,it all depends on your diet and physical activity.

Does Smoking increase risk of stroke?

What we observe from the graph

Our data suggests that smoking doesn't affect risk of stroke.
There could be a bias in the data, that there could be more people in the dataset that have never smoked, causing the result in the pie chart.

Even though our data suggest otherwise, we know smoking increases risk of stroke and heart disease based on medical research.

Do glucose levels affects risk of stroke?

What we observe in the graph

Our data shows that people with higher glucose levels are at higher risk of stroke

Does a history of heart disease increases risk of Stroke?

What we observe from the graph

We can see that people with previous heart disease have a higher chance of stroke

Machine Learning Pipeline

What is a Machine Learning Pipeline ?

What are machine learning pipelines? Pipelines are a simple way to keep the data preprocessing and modeling code organized. Specifically a pipeline allows the you to bundle the preprocessing and modeling steps into a single step using the make_pipeline() function. Some of the benefits of modeling with pipelines include a cleaner and organized code, fewer bugs, it is easier to productionize, and the ability to avoid Data Leakage.

step1

Data Cleaning & Feature Engineering

To clean our data, we decided to remove the columns in the dataset that would not add value to the application’s decision making, such as the Residence Type and ID columns. We also tried to remove any outliers to ensure our machine learning model would be accurate. For example, in the column for gender type, there was only one row where the gender was defined as ‘Other’ so we made a decision to remove the row. Our features are gender, age, hypertension (high blood pressure), heart disease trend, marital status, work type, average glucose level, and BMI (body mass index). The target variable is if the person has had stroke or not.

step2

Train - Test Split

We had to split the dataset into two groups; training and testing datasets. 75% of the data went to the train group, where the system was taught to identify patterns and associations in order to give an accurate prediction. The other 25% of the data went to the test group, where it was used to evaluate the effectiveness of the training and the accuracy of the model.

step3

Pipeline - Imputation and Encoding

Imputation is a crucial step because our data set contains 5110 rows and we could not afford to delete the rows with null values, so we decided to use imputation and encoding to deal with these values. For the columns with numerical values, we found the mean of the column and imputed the mean value for the null values. For the categorical data, we found the mode string, or the most occurring string, and inputted that value for the null values. Once the null values were dealt with, we encoded our dataset using One Hot Encoder to make all the data numerical.

step4

Pipeline - Scaling and Oversampling

In order for the model to understand and learn the data faster, we scaled the encoded data to between 0 and 1. In our dataset, 95.7% of the data was ‘no stroke’ whereas only 4.26% of the data was ‘stroke’. This caused a significant imbalance in the dataframe and led to errors in the system output. To combat this problem, we oversampled the data using the SMOTE technique, in which synthetic samples were generated from the original dataset to add to the minority class, thus making the ‘stroke’ to ‘no stroke’ data closer to 50-50%.

step5

Model Building

To train our machine learning models we used several different machine learning algorithms such as Logistic Regression, K Nearest Neighbors, Random Forest, and Neural Network.

step6

Fine tuning - Grid SearchCV

To find the most accurate set of parameters, we used GrindSearchCV with preset values. This allowed us to quickly and effectively find the best set of parameters for our model without having to test each individual combination of values for each machine learning model.

Models

Check our Models

Random Forest

The Random Forest algorithm is a method that is used to gain a higher accuracy rating by taking the majority output of an ensemble of models rather than just one. Because each individual model is slightly different from each other, the varying outputs allow the overall accuracy to increase as long as each model maintains above a 50% accuracy rating.

Neural Network

The Neural Network algorithm is essentially a network of artificial neurons and nodes that mimic the operations of the human brain. In this machine learning process, called deep learning, the neurons pass input through multiple hidden layers, generating an output with both the input and output in binary form. Neural networks are used in artificial intelligence to recognize patterns and solve common problems.

Logistic Regression

The Logistic Regression algorithm is a method that is used to predict a binary outcome, the appropriate regression analysis to decide between alternatives. Using prior observations on a data set, the model analyzes relationships between independent variables to predict a dependent variable. This model uses the Sigmoid function, also called a logistic function, which maps values between 0 and 1, and then classifies the value based on where it lies around the “S” shaped line, which represents logistic growth.

Predictions

Age

History of hypertension:

Do you have a history of heart disease:

Average Glucose Levels

What is your BMI

Gender

Have you ever been married:

What is your current work type :

What is your smoking status :

Team

Meet our team

Brayden Borges

One line about you

Some decription about you

Sara Wilsson

Designer

Export tempor illum tamen malis malis eram quae irure esse labore quem cillum quid cillum eram malis quorum velit fore eram velit sunt aliqua noster fugiat irure amet legam anim culpa.

Jena Karlis

Store Owner

Enim nisi quem export duis labore cillum quae magna enim sint quorum nulla quem veniam duis minim tempor labore quem eram duis noster aute amet eram fore quis sint minim.

Matt Brandon

Freelancer

Fugiat enim eram quae cillum dolore dolor amet nulla culpa multos export minim fugiat minim velit minim dolor enim duis veniam ipsum anim magna sunt elit fore quem dolore labore illum veniam.

John Larson

Entrepreneur

Quis quorum aliqua sint quem legam fore sunt eram irure aliqua veniam tempor noster veniam enim culpa labore duis sunt culpa nulla illum cillum fugiat legam esse veniam culpa fore nisi cillum quid.

John Larson

Entrepreneur

Quis quorum aliqua sint quem legam fore sunt eram irure aliqua veniam tempor noster veniam enim culpa labore duis sunt culpa nulla illum cillum fugiat legam esse veniam culpa fore nisi cillum quid.

Heart Stroke Prediction

A Machine Learning Product from the Electric Zombies

About

Data Cleaning

Exploratory Data Analysis

Any Relation Between Age and Stroke ?

Any relation between smoking status and stroke ?

Is glucose level a factor in getting a stroke ?

Are individuals with prior heart disease at risk ?

Does Stroke risk increase by age ?

Does Smoking increase risk of stroke?

Do glucose levels affects risk of stroke?

Does a history of heart disease increases risk of Stroke?

Machine Learning Pipeline

Data Cleaning & Feature Engineering

Train - Test Split

Pipeline - Imputation and Encoding

Pipeline - Scaling and Oversampling

Model Building

Fine tuning - Grid SearchCV

Models

Random Forest

Neural Network

Logistic Regression

Predictions

Team

Brayden Borges

One line about you

Sara Wilsson

Designer

Jena Karlis

Store Owner

Matt Brandon

Freelancer

John Larson

Entrepreneur

John Larson

Entrepreneur