Prediction of patients' readmission to hospital

(Binary Classification Problem)

by Suvrat Jain

Overview

For this project, the dataset used is the UCI dataset. The dataset (diabetic_data.csv) represents 10 years (1999-2008) of clinical care at 130 US hospitals and integrated delivery networks.

The dataset contains 50 explanatory variables that describe the patient and hospital outcomes. We predict the readmission of a patient discharged from a hospital within 30 days, based on the given dataset. The data is preprocessed, split into a train and test data set for training and testing purposes. The training and test set have been taken from this folder.

Goal:

The primary objective of this project's predictive analysis is to build a binary classification model that can predict early (<30 days) readmission given the patient’s features i.e. - To predict whether a patient will be readmitted in hospital given that they have been discharged from the hospital in the last 30 days.

Workflow:

We will start with cleaning and preprocessing the data provided. Post this, the features are selected and the class imbalance in the dataset is handled using various techniques. Finally, various machine learning models are trained and evaluated on the validation set to evaluate their performance and optimize to get the best performance on our test set. The best of the optimized model(s) is/are chosen and finally evaluated on the heldout test set.

Table of Contents

Dataset description

Each row of the dataset represents information about a patient and their respective hospital related outcome.

Here is a short description of each of the dataset's features:

From the IDs_mapping.csv file, we can map the description of the coding for the following attributes as shown below:

  1. admission_type_id
    • 1 -- Emergency
    • 2 -- Urgent
    • 3 -- Elective
    • 4 -- Newborn
    • 5 -- Not Available
    • 6 -- NULL
    • 7 -- Trauma Center
    • 8 -- Not Mapped

  2. discharge_disposition_id
    • 1 -- Discharged to home
    • 2 -- Discharged/transferred to another short term hospital
    • 3 -- Discharged/transferred to SNF
    • 4 -- Discharged/transferred to ICF
    • 5 -- Discharged/transferred to another type of inpatient care institution
    • 6 -- Discharged/transferred to home with home health service
    • 7 -- Left AMA
    • 8 -- Discharged/transferred to home under care of Home IV provider
    • 9 -- Admitted as an inpatient to this hospital
    • 10 -- Neonate discharged to another hospital for neonatal aftercare
    • 11 -- Expired
    • 12 -- Still patient or expected to return for outpatient services
    • 13 -- Hospice / home
    • 14 -- Hospice / medical facility
    • 15 -- Discharged/transferred within this institution to Medicare approved swing bed
    • 16 -- Discharged/transferred/referred another institution for outpatient services
    • 17 -- Discharged/transferred/referred to this institution for outpatient services
    • 18 -- NULL
    • 19 -- Expired at home. Medicaid only, hospice.
    • 20 -- Expired in a medical facility. Medicaid only, hospice.
    • 21 -- Expired, place unknown. Medicaid only, hospice.
    • 22 -- Discharged/transferred to another rehab fac including rehab units of a hospital
    • 23 -- Discharged/transferred to a long term care hospital
    • 24 -- Discharged/transferred to a nursing facility certified under Medicaid but not certified under Medicare
    • 25 -- Not Mapped
    • 26 -- Unknown/Invalid
    • 27 -- Discharged/transferred to a federal health care facility
    • 28 -- Discharged/transferred/referred to a psychiatric hospital of psychiatric distinct part unit of a hospital
    • 29 -- Discharged/transferred to a Critical Access Hospital (CAH)
    • 30 -- Discharged/transferred to another Type of Health Care Institution not Defined Elsewhere

  3. admission_source_id
    • 1 -- Physician Referral
    • 2 -- Clinic Referral
    • 3 -- HMO Referral
    • 4 -- Transfer from a hospital
    • 5 -- Transfer from a Skilled Nursing Facility (SNF)
    • 6 -- Transfer from another health care facility
    • 7 -- Emergency Room
    • 8 -- Court/Law Enforcement
    • 9 -- Not Available
    • 10 -- Transfer from critial access hospital
    • 11 -- Normal Delivery
    • 12 -- Premature Delivery
    • 13 -- Sick Baby
    • 14 -- Extramural Birth
    • 15 -- Not Available
    • 17 -- NULL
    • 18 -- Transfer From Another Home Health Agency
    • 19 -- Readmission to Same Home Health Agency
    • 20 -- Not Mapped
    • 21 -- Unknown/Invalid
    • 22 -- Transfer from hospital inpt/same fac reslt in a sep claim
    • 23 -- Born inside this hospital
    • 24 -- Born outside this hospital
    • 25 -- Transfer from Ambulatory Surgery Center
    • 26 -- Transfer from Hospice

Importing libraries & packages

We start by importing the libraries & packages that will be used for analysis, modelling, preprocessing in the project.

Reading the dataset

The training and test datasets are loaded in two different dataframes df_train and df_test.

For the ease of preprocessing and cleaning the data, both the datasets are merged, in order to perform preprocessing together and split the datasets to their original form post preprocessing.

The datasets will be split post preprocessing using the encounter_id for each of the data samples in both the datasets. Since each data sample has a unique encounter id, it is stored in two arrays to keep track of which samples were in training set and which were in the test set.

The training dataset has 76324 data samples and the test set has 25442.
Together, the whole dataset has 101766 data samples with 51 features.

Dataset exploration

We check the unique values for each of the attributes in order to find any unknown/missing data.

Our target variable here is readmitted. In order to predict a patient's readmission to the hospital, there are certain features in the dataset that are irrelevant for the predictive model.

The attributes race, gender, weight, payer_code, medical_specialty, diag_1, diag_2, and diag_3 have missing/unknown values.
We can find out the percentage of missing data for each of these attributes.

We drop the following columns from our dataset because of the reasons stated:

We check the frequency of patient readmissions and map each of the 3 unique possible values to appropriate encoding.

We can clearly see that almost 53% of the dataset has patient encounters which haven't been readmitted.
There are only about 11% of the dataset has patient encounters which have been readmitted within 30 days.

This is clearly a case of an imbalanced dataset which we will be handling ahead.

We create a new column Class for our target variable, in order to define what our two classes are.

All the <30 days readmitted patients are our positive classes (1) and all the patients who are not readmitted or are readmitted after 30 days are the negative classes (0).

Data preprocessing

We perform some preprocessing on the dataset in order to get it ready for training our model on.

All the missing values are replaced by NaN for imputation purposes.

race has 2273 missing values which will be replaced with 'Other'.

diag_1, diag_2, diag_3 have 21, 358, and 1423 missing values respectively, which are imputed using Simple Imputer and using the most frequent strategy for imputation.

gender has only 3 missing values which are also imputed using the most frequent strategy.

The age is a categorical variable with 10 unique categories.

The categories are mapped for integers from 1 to 10 in order to ease the process for encoding later on.

The description for the diag_1, diag_2, and diag_3 attributes are given here.
Using this description, we categorise the types of diagnosis into 9 distinct categories and map them in the three columns.

Exploratory Data Analysis

Let's explore the data and try to derive any useful insights that may be helpful in our analysis and model building ahead.

Feature Selection

After cleaning and preprocessing the data, we encode the categorical attributes & select the features to build our model with.

Numerical features are:

Categorical features are:

Numerical categorical features are:

For converting the categorical features to numbers, we will use one-hot encoding in which - a new column is generated for each unique value in that particular column. The values in the column are 1 if the sample has that unique value or it is 0 otherwise.

To generate these one-hot encoded columns, we will use the get_dummies function.

To avoid generating unnecessary columns, we can use the drop_first option, which drops the first categorical value for each column.

We know that the get_dummies function does not work on numerical values. In order to handle these numerical values and encode them, we can convert the numerical data to strings.

The features we will be using to build our model are these encoded columns.

We retain the encounter_id column so we can split the datasets to training and test set as they originally were with the exact same samples.

Our cleaned and preprocessed dataset now has 101766 samples with 151 features and a target variable.

We split the dataframe into a training and test set based on the encounter ids we saved in their respective arrays earlier.

Now that we have split the dataset back to its original state, we can drop the encounter_id column from our training and test sets since it is an uninformative feature and not relevant to building our model.

We have 76324 data samples in the training set and 25442 data samples in the test set post preprocessing and cleaning.

Imbalance dataset handling

Now that we are done with data preprocessing & cleaning, it might look like we are ready to train our model on the training set and analyse the performance. However, if we fit the training data into a predictive model and see the outcome, it is likely that we may obtain a model with high accuracy. But it is still not good. Why?

This is because we have an imbalanced dataset where there are much more negatives than positives, therefore the model might just assign all data samples as negative, which is not the goal of our predictive model.

In order to handle this imbalance, it is required to balance the data in some way to give the positives more weight.

We will be using 2 techniques for dealing with this imbalance:

Let us see the distribution of classes in the training set.

As it is clearly observed, the positive classes are only 8518 samples out of the overall training data.

We split the training data into validation set and train set in order to build our model and assess its performance on validation set. The validation set is chosen to split from the training set based on the proportion it has to the entire dataset in order to test the model accurately.

We separate the input features of training and test set.

We will be using SMOTE and RandomUnderSampler from the imblearn library to perform both the sampling operations.

Random Oversampling (SMOTE)

Now we have a balanced number of positive and negative classes which is 45138.

Random Undersampling

Now we have a balanced number of positive and negative classes which is 5744.

We will be using both - the undersampled and the oversampled training data to train and evaluate our model and see the difference in the model performance with both.

Model Building

Now that our dataset is completely ready to train our model, we train a few machine learning models.

We will be evaluating and optimizing these models and ultimately choosing the best model based on performance on the test set.

We will choose to train 3 types of classifier models and analyze their performance:

In order to evaluate our models, we utilize the following functions and metrics.

Now that we have handled the imbalance in our training data, we will set our threshold at 0.5 for labelling a predicted data sample to be of positive class.

Logistic Regression

We will be training our model on both - the undersampled and the oversampled training data to evaluate the performance.

Decision Tree Classifier

Random Forest Classifier

Analysis of baseline models

For analyzing our baseline models trained above, we create a dataframe with these models' evaluation results and plot them.

For our evaluation of the best model, we will be leveraging the Area under the ROC curve (AUC) metric.

Why?
AUC is a good performance metric to pick the best model since it captures the essential trade off between the true positive rate and the false positive rate. The AUC indicates a model’s ability to distinguish between positive and negative classes in a dataset.
An AUC of 1 indicates that the model made all predictions perfectly and it is a perfect model.
An area of 0.5 represents a model as good as random. We aim to obtain an AUC > 0.5 in order to accept our model as a decent classifier.

From the barplots shown above, we can make the following inferences:

1. In terms of accuracy, training the 3 models on oversampled data gives the highest accuracy for Logistic Regression model, Decision Tree Classifier, and Random Forest Classifier. Given that our dataset is imbalanced, accuracy isn't the best measure to evaluate our models because the model may be able to predict all negative samples as negative correctly even without sampling.

2. Undersampled data trains the model much better than oversampled data does. We obtain an AUC of 0.665 for Random Forest classifier trained on undersampled data.

3. Precision is the highest for Logistic Regression with a value of around 0.18 which means only 18% of the samples predicted to be of positive class are actually positive.

4. We are able to obtain a recall of around 0.6 through the Decision Tree classifier. This simply means that of all the samples that were truly of the positive class, around 60% were correctly labeled by the classifier to be of that class.

Model Selection: Hyperparameter Tuning

The next thing we will be performing is hyperparameter tuning for the models trained. Hyperparameter tuning is primarily used to optimize the design decisions made while building our machine learning models. For example, the maximum depth of Random Forest or number of neighbors for KNN classifier. All these hyperparameters can be optimized to improve the model in some way.

We will be optimizing the hyperparameters for Logistic Regression, Decision Tree classifier, and Random Forest classifier.
The optimization for KNN will not be performed since it is time consuming to train it and the results from KNN were not the best compared to other models.

We will be using a hyperparameter tuning techinque called Randomized Search in which the model randomly selects and tests all possible combinations of hyperparameters provided.

Logistic Regression: Hyperparameter Tuning

Since we obtained the best results for precision, recall, and AUC with undersampled data above, we will be using only undersampled data to tune the model.

In order to use our RandomizedSearchCV function, we need an evaluator metric to evaluate the set of hyperparameters. We will use the AUC metric for our evaluation.

Let us compare our baseline model to the optimized model and check how the performance varies.

There are very minute differences in the model performance.
The AUC for baseline model was 0.664 which increased to 0.671 while the precision has increased from 0.175 to 0.177, recall decreased from 0.565 to 0.555, and accuracy from 0.663 to 0.671.

We don't seem to observe any significant differences in performance.

Let us check the next model - Decision Tree Classifier

Decision Tree Classifier: Hyperparameter Tuning

Decision Tree Classifier does not have any significant changes in performance with hyperparameter tuning.

Let us tune the next model - Random Forest Classifier and check if there is any improvement in the model.

Random Forest Classifier: Hyperparameter Tuning

We can visualise and plot the optimized and baseline model performance metrics:

Model Selection & Evaluation

Since both - Logistic Regression model and Random Forest classifier has the best AUC out of all, we choose these 2 models as the best ones.

Let us test our best models on the test set and evaluate the performance.

Conclusion

This project consisted of making a binary classifier to predict the probability that a patient would be readmitted in hospital within 30 days of discharge. On the held out test dataset, our best model had an AUC of 0.665.

Since the complete dataset was highly imbalanced, the accuracy of the model is slightly compromised in terms of predicting the sample correctly.

Using the Random Forest Classifier model, we are able to catch about 55% of the readmissions from our model that performs much better than randomly selecting patients.