Boyang Su
“One of the first long-term cohort studies of its kind, the NIH's Framingham Heart Study is considered the crown jewel of epidemiology." ––– The National Institutes of Health
In the early 1900s, heart diseases has long been the #1 cause of death in the United States. The Framingham Heart Study, launched in 1948, is a long-term research project that was developed to identify risk factors for cardiovascular disease.
The project was named after Framingham, a town in Massachusetts chosen as the study site. The initial cohort included over 5,200 Framingham residents aged 30 to 62 without heart disease symptoms. Regular medical tests and lifestyle data were collected to observe disease development and connections to heart health over time.
A notable aspect of the study was its inclusion of over half women in the original cohort. This diverse representation prompted a shift in perspective, challenging the notion of heart disease as a man's disease. Acknowledging the necessity to encompass broader demographics, researchers introduced the Omni Cohort in 1994 of 507 men and women, including African-American, Hispanic, Asian, and Native American descent. A subsequent enrollment in 2003 further expanded the Omni participants, aiming to capture the heart health profiles of a more varied population (Reference).
The study's findings have contributed significantly to our understanding of heart disease, its risk factors, and the importance of prevention and early intervention. Notably, the implementation of treatments for hypertension, cholesterol reduction, and smoking cessation has contributed to a substantial 50-year decrease in cardiovascular deaths. Between 1969 and 2013, the United States witnessed a remarkable 67.5% reduction in deaths due to heart disease and a 77% reduction in stroke-related deaths. (Reference).
Over the past century, the study has yielded around 3,000 articles in leading medical journals (Reference). Being one of the earliest longitudinal cohort studies, Framingham has sparked the initiation of numerous longitudinal cohort studies worldwide, and epidemiological tools are now widely employed for comprehending chronic conditions. Furthermore, the study's influence has extended to diverse fields including cancer, stroke, diabetes, and genetics(Reference). The Framingham Heart Study's extensive contributions have left an indelible mark on scientific research and public health policies, fostering a better understanding of heart health and strategies for disease prevention.
In this Jupyter notebook, we'll delve into the Framingham dataset, analyzing its contents, performing exploratory data analysis, and leveraging linear and logistic regression.
Let's see what insights into cardiovascular disease risk factors we can uncover through our exploration!
This notebook should be run in the Workspace on BRH. To use the Workspace, you need to login and request for Workspace access here.
The Framingham dataset is provided by BioDataCatalyst (BDC) Data Commons. Before starting to run the code in the notebook on Workspace, you need to make sure you have authorized BDC Login through the profile page.
Next, we import necessary libraries:
import pandas as pd
import numpy as np
import pyreadstat
import seaborn as sns
import matplotlib.pyplot as plt
# Modules for regression
import statsmodels.api as sm # For statistical models
from statsmodels.sandbox.regression.predstd import wls_prediction_std # For prediction interval
from sklearn.preprocessing import StandardScaler # For feature scaling
from sklearn.linear_model import LogisticRegression # For logistic regression
from sklearn.model_selection import train_test_split # For data splitting
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report # For model evaluation
Import the data set. fram_df
is a pandas DataFrame, while fram_meta
is a metadata object. For more information on metadata object, please refer to here.
# Read data from "frmgham2.sav" file using pyreadstat library
fram_df, fram_meta = pyreadstat.read_sav("frmgham2.sav", apply_value_formats=True)
# Display the first few rows of the DataFrame
fram_df.head()
RANDID | SEX | TOTCHOL | AGE | SYSBP | DIABP | CURSMOKE | CIGPDAY | BMI | DIABETES | ... | CVD | HYPERTEN | TIMEAP | TIMEMI | TIMEMIFC | TIMECHD | TIMESTRK | TIMECVD | TIMEDTH | TIMEHYP | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2448.0 | 1.0 | 195.0 | 39.0 | 106.0 | 70.0 | 0.0 | 0.0 | 26.97 | 0.0 | ... | 1.0 | 0.0 | 8766.0 | 6438.0 | 6438.0 | 6438.0 | 8766.0 | 6438.0 | 8766.0 | 8766.0 |
1 | 2448.0 | 1.0 | 209.0 | 52.0 | 121.0 | 66.0 | 0.0 | 0.0 | NaN | 0.0 | ... | 1.0 | 0.0 | 8766.0 | 6438.0 | 6438.0 | 6438.0 | 8766.0 | 6438.0 | 8766.0 | 8766.0 |
2 | 6238.0 | 2.0 | 250.0 | 46.0 | 121.0 | 81.0 | 0.0 | 0.0 | 28.73 | 0.0 | ... | 0.0 | 0.0 | 8766.0 | 8766.0 | 8766.0 | 8766.0 | 8766.0 | 8766.0 | 8766.0 | 8766.0 |
3 | 6238.0 | 2.0 | 260.0 | 52.0 | 105.0 | 69.5 | 0.0 | 0.0 | 29.43 | 0.0 | ... | 0.0 | 0.0 | 8766.0 | 8766.0 | 8766.0 | 8766.0 | 8766.0 | 8766.0 | 8766.0 | 8766.0 |
4 | 6238.0 | 2.0 | 237.0 | 58.0 | 108.0 | 66.0 | 0.0 | 0.0 | 28.50 | 0.0 | ... | 0.0 | 0.0 | 8766.0 | 8766.0 | 8766.0 | 8766.0 | 8766.0 | 8766.0 | 8766.0 | 8766.0 |
5 rows × 39 columns
The data set contains 39 columns and 11627 rows.
# Set option to display all columns in DataFrame
pd.set_option('display.max_columns', None)
# Print the shape (rows, columns) of the DataFrame
fram_df.shape
(11627, 39)
Next, we check the data types in each column.
fram_df.dtypes
RANDID float64 SEX float64 TOTCHOL float64 AGE float64 SYSBP float64 DIABP float64 CURSMOKE float64 CIGPDAY float64 BMI float64 DIABETES float64 BPMEDS float64 HEARTRTE float64 GLUCOSE float64 EDUC float64 PREVCHD float64 PREVAP float64 PREVMI float64 PREVSTRK float64 PREVHYP float64 TIME float64 PERIOD float64 HDLC float64 LDLC float64 DEATH float64 ANGINA float64 HOSPMI float64 MI_FCHD float64 ANYCHD float64 STROKE float64 CVD float64 HYPERTEN float64 TIMEAP float64 TIMEMI float64 TIMEMIFC float64 TIMECHD float64 TIMESTRK float64 TIMECVD float64 TIMEDTH float64 TIMEHYP float64 dtype: object
The data type for all columns is float. In order to understand what the column variables mean, we can look at a field of the metadata, column_names_to_labels
. More detailed explanation on variables of Framingham study can be found here.
fram_meta.column_names_to_labels
{'RANDID': 'Random ID', 'SEX': 'SEX', 'TOTCHOL': 'Serum Cholesterol mg/dL', 'AGE': 'Age (years) at examination', 'SYSBP': 'Systolic BP mmHg', 'DIABP': 'Diastolic BP mmHg', 'CURSMOKE': 'Current Cig Smoker Y/N', 'CIGPDAY': 'Cigarettes per day', 'BMI': 'Body Mass Index (kg/(M*M)', 'DIABETES': 'Diabetic Y/N', 'BPMEDS': 'Anti-hypertensive meds Y/N', 'HEARTRTE': 'Ventricular Rate (beats/min)', 'GLUCOSE': 'Casual Glucose mg/dL', 'EDUC': '0-11 years, HS or GED, Some Coll, Coll Grad+', 'PREVCHD': 'Prevalent CHD (MI,AP,CI)', 'PREVAP': 'Prevalent Angina', 'PREVMI': 'Prevalent MI (Hosp,Silent)', 'PREVSTRK': 'Prevalent Stroke (Infarct,Hem)', 'PREVHYP': 'Prevalent Hypertension', 'TIME': 'Days since Index Exam', 'PERIOD': 'Examination cycle', 'HDLC': 'HDL Cholesterol mg/dL', 'LDLC': 'LDL Cholesterol mg/dL', 'DEATH': 'Death indicator', 'ANGINA': 'Incident Angina Pectoris', 'HOSPMI': 'Incident Hospitalized MI', 'MI_FCHD': 'Incident Hosp MI-Fatal CHD', 'ANYCHD': 'Incident Hosp MI, AP, CI, Fatal CHD', 'STROKE': 'Incident Stroke Fatal/non-fatal', 'CVD': 'Incident Hosp MI or Stroke, Fatal or Non', 'HYPERTEN': 'Incident Hypertension', 'TIMEAP': 'Days Baseline-Inc Angina', 'TIMEMI': 'Days Baseline-Inc Hosp MI', 'TIMEMIFC': 'Days Baseline-Inc MI-Fatal CHD', 'TIMECHD': 'Days Baseline-Inc Any CHD', 'TIMESTRK': 'Days Baseline-Inc Stroke', 'TIMECVD': 'Days Baseline-Inc CVD', 'TIMEDTH': 'Days Baseline-Death', 'TIMEHYP': 'Days Baseline-Inc Hypertension'}
We observe that we have
Besides 'SEX' and 'EDUC', other categorical variables have 0
indicating "NO" and 1
indicating "YES". To ensure clarity and readability, we narrow our focus to the DataFrame containing only the columns relevant to our study. Furthermore, we create an additional column to provide an explanation for the 'SEX' column.
# Select specific columns from the DataFrame
fram_df = fram_df[['RANDID', 'SEX', 'AGE', 'SYSBP', 'DIABP', 'CURSMOKE', 'CIGPDAY', 'BMI', 'TOTCHOL', 'HDLC', 'LDLC', 'GLUCOSE', 'DIABETES',
'PERIOD', 'DEATH', 'ANYCHD', 'STROKE', 'CVD', 'HYPERTEN']]
# Create a mapping dictionary for converting numeric gender values to labels
sexMap = {1: 'Male', 2: 'Female'}
# Add a new 'GENDER' column based on the 'SEX' column using the mapping dictionary
new_col = fram_df['SEX'].map(sexMap)
fram_df['GENDER'] = new_col
Given the longitudinal nature of the study, each participant underwent three tests, spaced two years apart, unless they dropped out for various reasons. To facilitate analysis, we segment the fram_df
into three distinct DataFrames based on the PERIOD
column. The resulting DataFrames, named fram_1
, fram_2
, and fram_3
, correspondingly contain the data pertaining to the first, second, and third tests.
fram_1 = fram_df[fram_df['PERIOD']==1]
fram_2 = fram_df[fram_df['PERIOD']==2]
fram_3 = fram_df[fram_df['PERIOD']==3]
For each period, the RANDID
column should have no duplicates.
print('There are '+ str(fram_1['RANDID'].duplicated().sum())+' duplicates for period 1')
print('There are '+ str(fram_2['RANDID'].duplicated().sum())+' duplicates for period 2')
print('There are '+ str(fram_3['RANDID'].duplicated().sum())+' duplicates for period 3')
There are 0 duplicates for period 1 There are 0 duplicates for period 2 There are 0 duplicates for period 3
Exploratory Data Analysis (EDA) is a preliminary step in data analysis, involving visually and statistically examining a dataset. Through techniques such as summary statistics, data visualization, and distribution analysis, EDA helps to identify anomalies, trends, and potential correlations within the data. These results will help us in formulating hypotheses, refining data preprocessing strategies, and choosing the appropriate model.
Let's start with generating a summary statistics table for fram_df
. The .describe()
method calculates various statistics for each numerical column in the DataFrame. The .T
at the end transposes the table, making columns appear as rows and providing a more readable format where statistics are displayed vertically.
fram_df.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
RANDID | 11627.0 | 5.004741e+06 | 2.900877e+06 | 2448.00 | 2474378.000 | 5006008.00 | 7472730.00 | 9999312.0 |
SEX | 11627.0 | 1.568074e+00 | 4.953655e-01 | 1.00 | 1.000 | 2.00 | 2.00 | 2.0 |
AGE | 11627.0 | 5.479281e+01 | 9.564299e+00 | 32.00 | 48.000 | 54.00 | 62.00 | 81.0 |
SYSBP | 11627.0 | 1.363241e+02 | 2.279862e+01 | 83.50 | 120.000 | 132.00 | 149.00 | 295.0 |
DIABP | 11627.0 | 8.303776e+01 | 1.166014e+01 | 30.00 | 75.000 | 82.00 | 90.00 | 150.0 |
CURSMOKE | 11627.0 | 4.325277e-01 | 4.954479e-01 | 0.00 | 0.000 | 0.00 | 1.00 | 1.0 |
CIGPDAY | 11548.0 | 8.250346e+00 | 1.218689e+01 | 0.00 | 0.000 | 0.00 | 20.00 | 90.0 |
BMI | 11575.0 | 2.587735e+01 | 4.102640e+00 | 14.43 | 23.095 | 25.48 | 28.07 | 56.8 |
TOTCHOL | 11218.0 | 2.411624e+02 | 4.536803e+01 | 107.00 | 210.000 | 238.00 | 268.00 | 696.0 |
HDLC | 3027.0 | 4.936472e+01 | 1.562667e+01 | 10.00 | 39.000 | 48.00 | 58.00 | 189.0 |
LDLC | 3026.0 | 1.764670e+02 | 4.686339e+01 | 20.00 | 145.000 | 173.00 | 205.00 | 565.0 |
GLUCOSE | 10187.0 | 8.412487e+01 | 2.499378e+01 | 39.00 | 72.000 | 80.00 | 89.00 | 478.0 |
DIABETES | 11627.0 | 4.558356e-02 | 2.085892e-01 | 0.00 | 0.000 | 0.00 | 0.00 | 1.0 |
PERIOD | 11627.0 | 1.899286e+00 | 8.074072e-01 | 1.00 | 1.000 | 2.00 | 3.00 | 3.0 |
DEATH | 11627.0 | 3.033457e-01 | 4.597230e-01 | 0.00 | 0.000 | 0.00 | 1.00 | 1.0 |
ANYCHD | 11627.0 | 2.716092e-01 | 4.448086e-01 | 0.00 | 0.000 | 0.00 | 1.00 | 1.0 |
STROKE | 11627.0 | 9.125312e-02 | 2.879811e-01 | 0.00 | 0.000 | 0.00 | 0.00 | 1.0 |
CVD | 11627.0 | 2.493334e-01 | 4.326458e-01 | 0.00 | 0.000 | 0.00 | 0.00 | 1.0 |
HYPERTEN | 11627.0 | 7.432700e-01 | 4.368480e-01 | 0.00 | 0.000 | 1.00 | 1.00 | 1.0 |
Next, we can draw box plots for the numerical values to visualize the chart above.
# Melt the DataFrame for specific variables
df_melted_1 = fram_df.melt(var_name='Variables', value_name='Values', value_vars=['TOTCHOL', 'HDLC', 'LDLC', 'GLUCOSE'])
df_melted_2 = fram_df.melt(var_name='Variables', value_name='Values', value_vars=['AGE', 'SYSBP', 'DIABP', 'CIGPDAY', 'BMI'])
# Create a subplot with two columns for side-by-side plots
plt.figure(figsize=(10, 5))
# Plot the first box plot on the left side
plt.subplot(1, 2, 1)
sns.boxplot(x='Variables', y='Values', data=df_melted_1, palette='Pastel2')
plt.title('Box Plot for Serum Total Cholesterol, High Density\n Lipoprotein Cholesterol, Low Density Lipoprotein\n Cholesterol, and Casual serum glucose')
plt.xlabel('Variables')
plt.ylabel('Values')
# Plot the second box plot on the right side
plt.subplot(1, 2, 2)
sns.boxplot(x='Variables', y='Values', data=df_melted_2, palette='Pastel1')
plt.title('Box Plot for Age, Systolic Blood Pressure, Diastolic\n Blood Pressure, Number of cigarettes smoked each day,\n Body Mass Index, and Heart rate')
plt.xlabel('Variables')
plt.ylabel('Values')
# Adjust layout and display the plots
plt.tight_layout()
plt.show()
From the box plots above, several variables exhibit a long-tail distribution. This suggests there might be outliers, which we might want to remove later before we start building our models.
Next, we draw a few histograms to visualize the distributions of AGE
, BMI
, SYSBP
, TOTCHOL
. We will create a set of three side-by-side histograms, each representing the distribution during different periods, while also showing the distribution for males and females using bars with different colors.
# Create a figure with 1 row and 3 columns of subplots, sharing the same x and y axes
fig, axes = plt.subplots(1, 3, figsize=(15, 5), sharex=True, sharey=True)
# Arrange the DataFrames for each period and sort them by gender
data_frames = [fram_1.sort_values(by=['GENDER']), fram_2.sort_values(by=['GENDER']), fram_3.sort_values(by=['GENDER'])]
# Titles for the subplots
titles = ['Period 1', 'Period 2', 'Period 3']
# Loop through the data frames and corresponding axes
for i, df in enumerate(data_frames):
# Create a histogram plot with stacked bars for different genders, and overlay KDE
sns.histplot(data=df, x='AGE', hue='GENDER', palette='viridis', multiple='stack', edgecolor='white', ax=axes[i], bins=38)
# Set x and y labels for the current subplot
axes[i].set_xlabel('Age')
axes[i].set_ylabel('Frequency')
# Set the title for the current subplot
axes[i].set_title(titles[i])
# Adjust layout and display the subplots
plt.tight_layout()
plt.show()
We can see that the age distribution for both men and women is quite similar, showing a slight tendency to be skewed towards higher ages.
# Create a 1x3 grid of subplots
fig, axes = plt.subplots(1, 3, figsize=(15, 5), sharex=True, sharey=True)
# List of DataFrames for each period, sorted by gender
data_frames = [fram_1.sort_values(by=['GENDER']), fram_2.sort_values(by=['GENDER']), fram_3.sort_values(by=['GENDER'])]
# Titles for the subplots
titles = ['Period 1', 'Period 2', 'Period 3']
# Loop through the DataFrames and corresponding axes
for i, df in enumerate(data_frames):
# Create a histogram plot of BMI, colored by gender
sns.histplot(data=df, x='BMI', hue='GENDER', palette='coolwarm', multiple='layer', edgecolor='white', ax=axes[i])
# Set x and y labels for the current subplot
axes[i].set_xlabel('Body Mass Index (weight in kg/height m^2)')
axes[i].set_ylabel('Frequency')
# Set the title for the current subplot
axes[i].set_title(titles[i])
# Adjust layout and display the subplots
plt.tight_layout()
plt.show()
In this case, we notice that the BMI distribution appears to be roughly normal, and it's evident that men generally have a higher BMI compared to women.
# Create a 1x3 grid of subplots
fig, axes = plt.subplots(1, 3, figsize=(15, 5), sharex=True, sharey=True)
# List of DataFrames for each period, sorted by gender
data_frames = [fram_1.sort_values(by=['GENDER']), fram_2.sort_values(by=['GENDER']), fram_3.sort_values(by=['GENDER'])]
# Titles for the subplots
titles = ['Period 1', 'Period 2', 'Period 3']
# Loop through the DataFrames and corresponding axes
for i, df in enumerate(data_frames):
# Create a histogram plot of Systolic Blood Pressure, colored by gender
sns.histplot(data=df, x='SYSBP', hue='GENDER', palette='viridis', multiple='layer', edgecolor='white', ax=axes[i])
# Set x and y labels for the current subplot
axes[i].set_xlabel('Systolic Blood Pressure (mmHg)')
axes[i].set_ylabel('Frequency')
# Set the title for the current subplot
axes[i].set_title(titles[i])
# Adjust layout and display the subplots
plt.tight_layout()
plt.show()
The distribution of systolic blood pressure is quite similar for both women and men, and it is slightly positively-skewed.
# Create a 1x3 grid of subplots
fig, axes = plt.subplots(1, 3, figsize=(15, 5), sharex=True, sharey=True)
# List of DataFrames for each period, sorted by gender
data_frames = [fram_1.sort_values(by=['GENDER']), fram_2.sort_values(by=['GENDER']), fram_3.sort_values(by=['GENDER'])]
# Titles for the subplots
titles = ['Period 1', 'Period 2', 'Period 3']
# Loop through the DataFrames and corresponding axes
for i, df in enumerate(data_frames):
# Create a histogram plot of Serum Total Cholesterol, colored by gender
sns.histplot(data=df, x='TOTCHOL', hue='GENDER', palette='magma', multiple='layer', edgecolor='white', ax=axes[i])
# Set x and y labels for the current subplot
axes[i].set_xlabel('Serum Total Cholesterol (mg/dL)')
axes[i].set_ylabel('Frequency')
# Set the title for the current subplot
axes[i].set_title(titles[i])
# Adjust layout and display the subplots
plt.tight_layout()
plt.show()
The graph shows that cholesterol levels for both men and women are approximately normally distributed. However, it's interesting to note that the increase in cholesterol levels by age is more noticeable in women compared to men. The graph also indicates a more substantial decline in the number of men compared to women. One plausible explanation is that men might be more susceptible to mortality resulting from high cholesterol levels. This could account for the lower average cholesterol level among men.
We aim to explore potential causative factors for cardiovascular disease. Initially, we will assemble a new DataFrame df
that includes the plausible risk factors and disease events. Then we construct a correlation heatmap to visualize the relationships between these variables. Starting from here, we will focus on data in fram_1
.
# Create a new column that combines 'ANYCHD' and 'STROKE', counting for all indicators of cardiovascular diseases
fram_1.loc[:,'ANYCHD_OR_STROKE'] = (fram_1['ANYCHD'] + fram_1['STROKE'] > 0).astype(int)
# Create new DataFrame
df = fram_1[['SEX','TOTCHOL','AGE','SYSBP','DIABP','CURSMOKE','BMI','DIABETES','GLUCOSE','HYPERTEN','ANYCHD_OR_STROKE']]
# Calculate the correlation matrix
correlation_matrix = df.corr()
# Create the heat map with a color range of -1 to 1
plt.figure(figsize=(8, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5, vmin=-1, vmax=1)
plt.title('Correlation Heat Map')
plt.show()
A correlation heatmap uses colors to show how variables relate to each other:
First, we analyze how gender relates to cardiovascular disease. A common way to explore the interactions between categorical variables is the contingency table, also known as a cross-tabulation or a contingency matrix. In a contingency table, each cell represents the count or frequency of observations that fall into a specific combination of categories from the variables being analyzed.
# Create a contingency table
contingency_table = pd.crosstab(df['SEX'], df['ANYCHD_OR_STROKE'])
# Rename the rows (SEX categories) and columns (ANYCHD_OR_STROKE categories)
contingency_table = contingency_table.rename(
index={1: 'Male', 2: 'Female'},
columns={0: 'No', 1: 'Yes'}
)
# Calculate the percentage of "Yes" (positive) outcomes
contingency_table['percentage'] = contingency_table['Yes'] / (contingency_table['Yes'] + contingency_table['No'])
# Print the resulting contingency table
print(contingency_table)
ANYCHD_OR_STROKE No Yes percentage SEX Male 1128 816 0.419753 Female 1830 660 0.265060
The table provides a clear indication that men have an approximately 15% higher likelihood of experiencing cardiovascular disease compared to women.
Next, we examine the potential factors contributing to the systolic blood pressure. Looking at the row for SYSBP
in the heat map, we first exclude the variables that we know are dependent on SYSBP
, namely DIABP
and HYPERTEN
. Other variables that indicate a potential correlation are TOTCHOL
, AGE
, BMI
. To delve deeper into these connections, we'll visualize them using scatterplots and density-based colored scatterplots.
# Set the style to 'whitegrid'
sns.set(style='whitegrid')
# Create scatter plots for each variable against 'SYSBP'
plt.figure(figsize=(15, 4))
# Plot 'TOTCHOL' against 'SYSBP'
plt.subplot(1, 3, 1)
x_jitter = 10 * np.random.randn(df['TOTCHOL'].size) # Add jitter to x-axis data
y_jitter = 4 * np.random.randn(df['SYSBP'].size) # Add jitter to y-axis data
sns.scatterplot(data=df, x=df['TOTCHOL'] + x_jitter, y=df['SYSBP'] + y_jitter, s=5, alpha=0.5)
plt.xlabel('Serum Total Cholesterol (mg/dL)')
plt.ylabel('Systolic Blood Pressure (mmHg)')
plt.xlim(100, 450) # Set x-axis limits
plt.ylim(75, 225) # Set y-axis limits
# Plot 'BMI' against 'SYSBP'
plt.subplot(1, 3, 2)
x_jitter = 1 * np.random.randn(df['BMI'].size) # Add jitter to x-axis data
y_jitter = 4 * np.random.randn(df['SYSBP'].size) # Add jitter to y-axis data
sns.scatterplot(data=df, x=df['BMI'] + x_jitter, y=df['SYSBP'] + y_jitter, s=5, alpha=0.5)
plt.xlabel('BMI (kg/m^2)')
plt.ylabel('Systolic Blood Pressure (mmHg)')
plt.xlim(10, 45) # Set x-axis limits
plt.ylim(75, 225) # Set y-axis limits
# Plot 'AGE' against 'SYSBP'
plt.subplot(1, 3, 3)
x_jitter = 1 * np.random.randn(df['AGE'].size) # Add jitter to x-axis data
y_jitter = 4 * np.random.randn(df['SYSBP'].size) # Add jitter to y-axis data
sns.scatterplot(data=fram_df, x=df['AGE'] + x_jitter, y=df['SYSBP'] + y_jitter, s=5, alpha=0.5)
plt.xlabel('Age')
plt.ylabel('Systolic Blood Pressure (mmHg)')
plt.xlim(30, 70) # Set x-axis limits
plt.ylim(75, 225) # Set y-axis limits
# Adjust the layout to avoid overlapping of titles
plt.tight_layout()
# Display the plots
plt.show()
# Create a figure with a grid of 1 row and 3 columns, specifying the figure size
plt.figure(figsize=(15, 4))
# Create the first subplot for 'TOTCHOL' vs 'SYSBP'
plt.subplot(1, 3, 1)
# Generate a density-based colored scatter plot using kdeplot
sns.kdeplot(data=df, x=df['TOTCHOL'], y=df['SYSBP'], fill=True, cmap='Blues', levels=20)
# Set x-axis and y-axis labels
plt.xlabel('Serum Total Cholesterol (mg/dL)')
plt.ylabel('Systolic Blood Pressure (mmHg)')
# Adjust x-axis and y-axis limits
plt.xlim(100, 450)
plt.ylim(75, 225)
# Create the second subplot for 'BMI' vs 'SYSBP'
plt.subplot(1, 3, 2)
# Generate a density-based colored scatter plot using kdeplot
sns.kdeplot(data=df, x=df['BMI'], y=df['SYSBP'], fill=True, cmap='Purples', levels=20)
# Set x-axis and y-axis labels
plt.xlabel('BMI (kg/m^2)')
plt.ylabel('Systolic Blood Pressure (mmHg)')
# Adjust x-axis and y-axis limits
plt.xlim(10, 45)
plt.ylim(75, 225)
# Create the third subplot for 'AGE' vs 'SYSBP'
plt.subplot(1, 3, 3)
# Generate a density-based colored scatter plot using kdeplot
sns.kdeplot(data=df, x=df['AGE'], y=df['SYSBP'], fill=True, cmap='Reds', levels=20)
# Set x-axis and y-axis labels
plt.xlabel('Age')
plt.ylabel('Systolic Blood Pressure (mmHg)')
# Adjust x-axis and y-axis limits
plt.xlim(30, 70)
plt.ylim(75, 225)
# Adjust the layout to prevent overlapping of subplot titles
plt.tight_layout()
# Display the plots
plt.show()
The scatterplots provide a clear suggestion of a linear relationship between age and systolic blood pressure. However, the connection between BMI and cholesterol with systolic blood pressure is less evident. This prompts a curiosity to delve deeper into understanding how these factors impact systolic blood pressure, an important indicator for hypertension. To explore these relationships further, we turn to linear regression modeling.
Handling outliers and missing data is an important step before performing linear regression. Outliers and missing data can disproportionately influence the regression model's parameters and predictions, leading to biased results.
First, we remove rows that have outlier values in the numerical columns. Z-scores are calculated for columns 'TOTCHOL', 'SYSBP', 'DIABP', 'BMI', and 'GLUCOSE'. Rows with z-scores beyond ±3 standard deviations are identified as outliers and their indices are stored. Finally, the rows with these outlier indices are dropped to create a new DataFrame called filtered_df
.
A Z-score, also known as a standard score, is a statistical measure that quantifies how many standard deviations a data point is away from the mean of a dataset. It is useful for understanding the relative position of a data point within the distribution and thus identifying outliers. Typically, Z-scores are calculated using the formula: $$Z = \frac{X - \mu}{\sigma},$$ where $X$ is the data point, $\mu$ is the mean, and $\sigma$ is the standard deviation of the dataset.
# List of columns you want to filter
columns_to_filter = ['TOTCHOL', 'SYSBP', 'DIABP', 'BMI', 'GLUCOSE']
# Calculate z-scores for the specified columns
z_scores = (df[columns_to_filter] - df[columns_to_filter].mean()) / df[columns_to_filter].std()
# Find indices where z-scores are beyond ±3 standard deviations for each column
outlier_indices = z_scores[(z_scores > 3) | (z_scores < -3)].dropna().index
# Drop the rows with outlier indices from the DataFrame
filtered_df = df.drop(outlier_indices)
Next, we check the missing values in each column.
# Print the shape of the filtered DataFrame to see the number of rows and columns
print(filtered_df.shape)
# Check for missing values in the filtered DataFrame and sum them up for each column
filtered_df.isnull().sum()
(4434, 11)
SEX 0 TOTCHOL 52 AGE 0 SYSBP 0 DIABP 0 CURSMOKE 0 BMI 19 DIABETES 0 GLUCOSE 397 HYPERTEN 0 ANYCHD_OR_STROKE 0 dtype: int64
Upon examination, we observe that the number of missing values in the remaining columns is relatively small when compared to the entire dataset. Consequently, we have chosen to proceed by removing the rows containing these missing values.
#Drop rows with missing values from the DataFrame
filtered_df.dropna(inplace=True)
#Display the new shape of the DataFrame
filtered_df.shape
(4013, 11)
Dropping missing values is a simple way to make sure your data is complete. However, it has downsides such as losing valuable information and reducing the sample size. This can affect the analysis and generalizability, especially if missing data is linked to important variables. If the missing data isn't random and is connected to outcomes or predictors, dropping values can introduce bias. So, before using this method, it's crucial to thoroughly understand your data. Another commonly-used approach is imputation, where missing values are estimated using existing data.
Linear regression is a method for predicting one variable's value ($Y$) based on another's ($X$). We usually call $Y$ the dependent variable and $X$ the independent variable. The word "linear" explains the relation between the two variables, in other words, we want to fit a linear regression line $$\hat{Y} = c + \beta X$$ to the observed data $(X,Y)$.
How can we evaluate the fit of a line to the data? One most commly used metric is the sum of squares error $$SSE=\sum_{(X,Y)}(\hat{Y}-Y)^2=\sum_{(X,Y)}(c+\beta X-Y)^2.$$ We see $|Y-\hat{Y}|$ gives the vertical distance from a data point $(X,Y)$ to the regression line, and we sum the squared distance over all observed data. The difference between the predicted value and the actual value, $Y-\hat{Y}$, is also called the residual.
The ordinary least squares (OLS) regression is the method to obtain the best-fit linear regression line through minimizing the sum of squared distances.
When performing linear regressions, we assume (Reference):
# Select the predictor variable (X) and the target variable (y)
X = filtered_df['BMI']
y = filtered_df['SYSBP']
# Add a constant term to the predictor variable
X = sm.add_constant(X)
# Fit the OLS (ordinary least squares) model
model = sm.OLS(y, X).fit()
# Print the summary of the regression
print(model.summary())
OLS Regression Results ============================================================================== Dep. Variable: SYSBP R-squared: 0.108 Model: OLS Adj. R-squared: 0.108 Method: Least Squares F-statistic: 485.0 Date: Wed, 23 Aug 2023 Prob (F-statistic): 1.44e-101 Time: 04:31:14 Log-Likelihood: -17959. No. Observations: 4013 AIC: 3.592e+04 Df Residuals: 4011 BIC: 3.593e+04 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 86.3001 2.149 40.159 0.000 82.087 90.513 BMI 1.8083 0.082 22.023 0.000 1.647 1.969 ============================================================================== Omnibus: 807.041 Durbin-Watson: 1.985 Prob(Omnibus): 0.000 Jarque-Bera (JB): 1735.591 Skew: 1.162 Prob(JB): 0.00 Kurtosis: 5.232 Cond. No. 168. ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
To better interpret the OLS regression results, we explain some statistics and coefficients provided in the chart:
The t-statistics and p-values in our results indicate a significant relationship between BMI and systolic blood pressure.
To gain a deeper understanding of the validity of our linear regression model, we can assess whether the model assumptions are met. To do this, we examine a plot of the residuals to check homoscedasticity and normality.
# Calculate the residuals
residuals = model.resid
# Create residual plots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
# Residuals vs Fitted values plot
ax1.scatter(X['BMI'], residuals, alpha=0.6, s=1)
ax1.axhline(y=0, color='red', linestyle='--')
ax1.set_xlabel('BMI')
ax1.set_ylabel('Residuals')
ax1.set_title('Residuals Plot')
# Q-Q Plot (Quantile-Quantile Plot) of the residuals
sm.qqplot(residuals, line='s', ax=ax2, markersize=1)
ax2.set_title('Q-Q Plot of Residuals')
plt.tight_layout()
plt.show()
The residuals plot on the left shows that the variance of the residuals remains approximately consistent across different BMI levels.
On the right, the Q-Q (Quantile-Quantile) plot checks if the residuals follow a normal distribution. If the points on the plot form a straight line, it suggests the residuals are normally distributed. Deviations from the line indicate departures from normality, which can impact the reliability of the analysis. In our situation, these deviations highlight that the normality assumption isn't fully satisfied.
To further assess our model, we can plot the regression line together with the confidence interval and prediction interval.
Confidence Interval: a range around a sample statistic that likely contains the true population parameter with 95% confidence.
Prediction Interval: a range around a predicted value that likely contains the actual future observation with 95% confidence.
# Plot the regression results along with confidence interval
fig = plt.figure(figsize=(10, 6))
ax = fig.add_subplot(111)
ax.scatter(filtered_df['BMI'], y, s=1, alpha=0.6, label="Data")
ax.plot(filtered_df['BMI'].to_numpy(), model.predict(X).to_numpy(), color='red', label="Regression Line")
ax.set_xlabel('BMI')
ax.set_ylabel('SYSBP')
## Plot the prediction interval
prstd, iv_l, iv_u = wls_prediction_std(model)
ax.plot(filtered_df['BMI'].to_numpy(), iv_u.to_numpy(), color='grey', linestyle='--', label='95% Prediction Interval')
ax.plot(filtered_df['BMI'].to_numpy(), iv_l.to_numpy(), color='grey', linestyle='--')
# Plot the confidence intervals
predict_mean_ci = model.get_prediction(X).conf_int()
ax.plot(filtered_df['BMI'].to_numpy(), predict_mean_ci[:, 0], color='purple', linestyle='--', linewidth=0.5, label='95% Confidence Interval')
ax.plot(filtered_df['BMI'].to_numpy(), predict_mean_ci[:, 1], color='purple', linestyle='--', linewidth=0.5)
# Display the legend
ax.legend()
plt.show()
However, the relatively small r-squared value indicates that a single variable might not explain the variation of systolic blood pressure well enough. Hence, we move on to performing multivariable linear regresssion. In order to better compare the importance of each variable, we normalize the independent variables before we conduct linear regresssion.
# Select predictor variables (X) and the target variable (y)
X = filtered_df[['TOTCHOL', 'AGE', 'BMI', 'GLUCOSE']]
y = filtered_df['SYSBP'] # Replace with the actual target variable name
# Initialize the StandardScaler
scaler = StandardScaler()
# Fit the scaler on the predictor variables and transform them
X_scaled_values = scaler.fit_transform(X)
# Create a new DataFrame with scaled values and original column names
X_scaled = pd.DataFrame(X_scaled_values, columns=X.columns)
# Add a constant term to the predictor variables
X_scaled_with_constant = sm.add_constant(X_scaled)
# Reset the index of the target variable y
y_reset = y.reset_index(drop=True)
# Fit the multivariable linear regression model
model = sm.OLS(y_reset, X_scaled_with_constant).fit()
# Print the summary of the regression results
print(model.summary())
OLS Regression Results ============================================================================== Dep. Variable: SYSBP R-squared: 0.244 Model: OLS Adj. R-squared: 0.243 Method: Least Squares F-statistic: 323.2 Date: Wed, 23 Aug 2023 Prob (F-statistic): 2.39e-241 Time: 04:31:27 Log-Likelihood: -17627. No. Observations: 4013 AIC: 3.526e+04 Df Residuals: 4008 BIC: 3.529e+04 Df Model: 4 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 133.0470 0.309 430.639 0.000 132.441 133.653 TOTCHOL 1.9756 0.320 6.167 0.000 1.348 2.604 AGE 7.4275 0.323 23.024 0.000 6.795 8.060 BMI 6.0429 0.314 19.249 0.000 5.427 6.658 GLUCOSE 1.2864 0.312 4.119 0.000 0.674 1.899 ============================================================================== Omnibus: 637.733 Durbin-Watson: 2.008 Prob(Omnibus): 0.000 Jarque-Bera (JB): 1374.770 Skew: 0.935 Prob(JB): 2.97e-299 Kurtosis: 5.174 Cond. No. 1.38 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Based on the information from the table, the R-squared value has shown a notable enhancement. This improvement suggests that the prior single linear regression lacked sufficient variables to account for the variations in the predicted value. Moreover, all four variables exhibit significance. Among them, age and BMI stand out as having a relatively stronger positive impact on systolic blood pressure when compared to cholesterol and glucose levels.
The preceding section focused on linear regression with the goal of understanding the contributing factors to hypertension, which is often signaled by elevated systolic blood pressure. Linear regression is effective for analyzing relationships involving continuous numerical variables. However, when we shift our attention to the indicator for cardiovascular diseases—the column labeled ANYCHD_OR_STROKE
—we encounter a categorical variable. This transition renders linear regression unsuitable for predicting this categorical value.
Thus, we introduce logistic regression. The logistic function, presented in its standard form as: $$p(x) = \frac{1}{1 + e^{-(\beta_0+\beta x)}}.$$ describes a curve that follows a sigmoid shape and is monotonically increasing. Notably, the values of $p(x)$ lie between $0$ and $1$ and $p(-\beta_0/\beta)=1/2$ gives the midpoint of this function. The constant $\beta$ denotes the logistic growth rate of the curve: the greater $|\beta|$ is the steeper the curve is. In our context, $p(x)$ serves as a predictive tool for estimating the probability of belonging to the positive class, which means the associated dependent variable $y = 1$.
The metric used to calculate the goodness of fit is logistic loss. The logistic loss for a data point $(x,y)$ is $$L(y,\widehat{p}) = -[y \ln(\widehat{p}) + (1 - y) \ln(1 - \widehat{p})],$$ where $y$ is the true binary outcome (0 or 1) and $\widehat{p}=p(x)$ is the predicted probability of the positive class (1). To better understand this loss function, consider the case when $y=1$. We have $L(y,\widehat{p})=-\ln(\widehat{p})$, which is smaller if $\widehat{p}$ is closer to $1$.
In the process for finding the best-fitting logistic function, we aim to minimize the loss $$\sum_{(x,y)}L(y,\widehat{p})=\sum_{y=0}-\ln(1-p(x))+\sum_{y=1}-\ln(p(x)),$$ which is equivalent to maximizing the likelihood function $$\prod_{y=0}(1-p(x))\prod_{y=1}p(x).$$
We proceed with a logistic regression to predict cardiovascular disease based on gender, cholesterol, age, systolic blood pressure, BMI, and glucose. To assess our model's effectiveness, we adopt a straightforward approach by dividing the data into two sets: the training and testing set. We will first construct our model using the training set, and subsequently evaluate its performance on the testing set.
This step of train test split is the simplest form of cross-validation, a machine learning technique that involves dividing the dataset into subsets for repeated model training and evaluation. This method aids in estimating how well the model generalizes, recognizing potential overfitting, and providing a more reliable assessment of its predictive capabilities. More advanced methods like k-fold cross-validation and leave-one-out cross-validation provide more comprehensive ways of assessing a model's performance.
X = filtered_df[['SEX', 'TOTCHOL', 'AGE', 'SYSBP', 'BMI', 'GLUCOSE']]
y = filtered_df['ANYCHD_OR_STROKE']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
# Create and fit the logistic regression model
logistic_model = LogisticRegression()
logistic_model.fit(X_train, y_train)
# Make predictions on the test set
y_pred = logistic_model.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
confusion_mat = confusion_matrix(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)
print("Accuracy:", accuracy)
print("Confusion Matrix:\n", confusion_mat)
print("Classification Report:\n", classification_rep)
Accuracy: 0.7014925373134329 Confusion Matrix: [[239 25] [ 95 43]] Classification Report: precision recall f1-score support 0 0.72 0.91 0.80 264 1 0.63 0.31 0.42 138 accuracy 0.70 402 macro avg 0.67 0.61 0.61 402 weighted avg 0.69 0.70 0.67 402
Based on the output, we observe that our model achieves an accuracy rate of over 70% when it comes to predicting cardiovascular diseases. Below is a explanation for the confusion matrix.
Some important concepts relavant to the assessment of classification outcomes are:
In our model, the recall is relatively lower compared to precision, primarily due to a larger number of false negatives. This suggests there's potential for enhancing sensitivity in our model, improving its ability to correctly identify more instances of the positive class.
Another key consideration when working with logistic regression is addressing the issue of imbalanced data. Looking at the confusion matrix, we observe 334 positive instances compared to only 68 negative instances. This imbalance can result in a model that appears to perform well on the majority class but struggles with the minority class.
To address this problem, we can assign class weights during training or leverage models like Random Forest and Gradient Boosting that handle imbalanced data effectively. These strategies enhance model accuracy and fairness.