# MACHINE LEARNING

## Working with multiple types of Data in a single problem in Machine learning

Problem

You are given patients medical history. You have all the test reports that contain numerical data (BP, Sugar, heart rate, lipid profiles etc), radiology reports that contain images + text reports (with findings from image by the doctor) and nursing reports that contain subjective as well as objective information about patient’s health. Patient’s data is in the form of reports that is time series data. Based on these reports, you have to calculate 3 things:

a) What are the % chances that patient will survive and how many days he will be in the hospital. Which algorithm you will use and why?

b) What is the mortality rate i.e. whether a person will die in the ICU in the following week or not. Which algorithm you will choose and why?

c) Define some extra features useful for your problem that are missing in the question. In case thus is AI powered doctor , how will it communicate the current condition to the patient’s family ?

. . .

## Solution

In such type of problem it is advisable that you should be really concise and to the point and answer exactly what the other person is looking for. Also a diagram speaks volumes of your understanding ability so please draw a diagram to depict whatever your solution pipeline is.

Types of Data in the problem:

1. Tabular Data: Reports contain Numerical Historical Data(BP, Sugar, heart rate, lipid profiles etc).
2. Image Data: Radiology reports (Medical Imaging Dataset)
3. Text Data: Nurse Records plus Doctor Findings from Image Data (Radiology metadata)

## Tabular DataML approach: Classical Machine Learning

For the tabular Data we have to use Classical Machine Learning approach. This approach involves feature engineering first in which we can combine features to form new features or drop some unimportant features.

Considering the fact that this is Time Series Data in the data preparation step we need to define a window size. Window size is how much observations from the history you want to include as features. For example:

If you have last 5 reports and your window size is 5 then in the final table you can have 5 entries for one single feature columns like-

If in case you have window size of 2 then the final table will have features like-:

## Image DataML Approach: Convolutional Neural Networks

For Image Data we use Convolutional Neural Networks to extract features or labels. This can be done in any of the two ways:

1. Suppose if we have a labelled medical dataset which highlights the parts of lung that are affected by pneumonia. This dataset has 2 labels if the person has pneumonia or not. In this case we can do a binary classification on the images and add the final label as feature column with the tabular data.

2. If we do not have a abelled dataset we can pass the image through a pretrained CNN say Densenet-121. This is done to extract the features from the image. Make sure not to run the complete Densenet-121 model instead break it in the fully connected layer where we have a 256 vector of features. The next step is to add these features as columns with the tabular data.

## Text DataML Approach: Classical NLP and/or RNN

For text data we use classical NLP or RNN. The text must be parsed to remove words called tokenization. Then the words need to be encoded as integers or floating point values for use as input to a ML Algorithm. In this data also we have two choices:

1. Use TfidfVectorizer: TFidfVectorizer stands for Term Frequency — Inverse Document. TF-IDF are word frequency scores that try to highlight words that are more interesting, e.g. frequent in a document but not across documents. Using TF-IDF we can convert words into features and append them with our tabular data for final classification.

2. If we have labelled dataset of the sentiment of the reports then we can perform classification using LSTM/RNN and get the final label as a feature in tabular data.

Q 3-a: What are the % chances that patient will survive and how many days he will be in the hospital. Which algorithm you will use and why?

Answer: To do such sort of regression predictive analysis, I am assuming that I have a labelled dataset.

As I explained extracting features out of different types of data we can combine everything into tabular data and finally go with classical machine learning. Algorithm I would use is XGBoost and Light Gradient Boosting as they combine many weak regressors/classifiers to give the final output.

Or the best approach would be to create an Artificial Fully Connected Neural Network and pass the features to get a final answer on the % chances and how many days will be in hospital.

Q 3-b: What is the mortality rate i.e. whether a person will die in the ICU in the following week or not. Which algorithm you will choose and why?

Answer: Whether a person will die in the ICU in the following week or not is a classification problem. For such classification problems the best algorithms again are XGBoost/ LGBM or Aritifical Neural Networks.

c) Define some extra features useful for your problem that are missing in the question. In case thus is AI powered doctor , how will it communicate the current condition to the patient’s family ?

The features missing in the question is the age, gender and previous medical history of the patient. Much of the treatment is dependent on the fact what is the gender and age of the person.

As an AI powered doctor, we can show regression results to the family members so that they can be sure of what is coming for them. Like if the the chances of survival increased from 40 to 60% in the past 5 days then show them a positive note that the patient is improving!