Data Exploration of Childhood Asthma Management Program Study Teaching Dataset¶

Qiong Liu

This teaching dataset was developed using the Childhood Asthma Management Program(CAMP) as the data source. This trial was designed to assess the long-term effects of three treatments(budesonide, nedocromil, or placebo) on pulmonary function. The dataset includes longitudinal data of 695 particpants from CAMP trial. This teaching dataset was permutated and anonymized for teaching and training purposes. This data was not prepared to reproduce the primary outcome results.

In this tutorial, we will demonstrate how to pull the object file of CAMP study from BioData Catalyst data commons into a BRH workspace, and perform data exploration and visualization using Python packages.

Import Python libraries¶

In [1]:
import pandas as pd
import numpy as np
import pyreadstat
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px
from scipy import stats

pd.set_option('mode.chained_assignment', None)

Read object file¶

Note: Please authorize BioDataCatalyst RAS login first under user profile page.

In [2]:
!gen3 drs-pull object dg.4503/8d84511c-76f9-4464-8fdf-6dd668ed9c64

camp_df, camp_meta = pyreadstat.read_sav("camp_teach.sav", apply_value_formats=True)
In [3]:
# Display column names and column description
col_names =  camp_meta.column_names_to_labels
pd.DataFrame(col_names.items(), columns=['Label', 'Name'])
Out[3]:
Label Name
0 TX Treatment group: bud,ned,pbud,or pned
1 TG Treatment group: A=bud, B=ned, C=plbo
2 id None
3 age_rz Age in years at Randomization
4 GENDER m=male, f=female
5 ETHNIC w=white,b=black,h=hispanic,o=other
6 hemog Hemoglobin (g/dl)
7 PREFEV PreBD FEV1
8 PREFVC PreBD FVC
9 PREFF PreBD FEV1/FVC ratio %
10 PREPF PreBD peak flow
11 POSFEV PostBD FEV1
12 POSFVC PostBD FVC
13 POSFF PostBD FEV1/FVC ratio %
14 POSPF PostBD peak flow
15 PREFEVPP PreBD FEV1 %pred
16 PREFVCPP PreBD FVC %pred
17 POSFEVPP PostBD FEV1 %pred
18 POSFVCPP PostBD FVC %pred
19 wbc White Blood Cell count (1000 cells/ul)
20 agehome Age of current home (years)
21 anypet Any pets, 1=Yes 2=No
22 woodstove Used wood stove for heating/cooking, 1=Yes 2=No
23 dehumid Use a dehumidifier, 1=Yes 2=No 3=DK
24 parent_smokes Either Parent/partner smokes in home, 1=Yes 2=No
25 any_smokes Anyone (including visitors) smokes in home, 1=...
26 visitc Followup Visit (mos)
27 fdays Days since randomization

FEV1 is the amount of air you can force from your lungs in one second. The normal range of FEV1 varies from person to person. They’re based on standards for an average healthy person of your age, race, height, and gender. Each person has their own predicted FEV1 value. Therefore, FEV1 percentage of predicted (FEVPP) can be used as the key measurement for pulmonary function assessment.

Participants demographic data exploration and visualization¶

In [4]:
# add age group to the dataframe
def age_group(agelist):
    grouplabel1 = "Early Childhood (2-5yr)"
    grouplabel2 = "Middle Childhood (6-11yr)"
    grouplabel3 = "Early Adolescence (12-18yr)"
    grouplist = []
    for i in agelist:
        if i <= 5:
            grouplist.append(grouplabel1)
        elif i <= 11:
            grouplist.append(grouplabel2)
        elif i >= 12:
            grouplist.append(grouplabel3)
        else:
            grouplist.append("NA")
    return grouplist
camp_df['age_group'] = age_group(camp_df['age_rz'])

first_visit = camp_df.loc[(camp_df["visitc"]=="000")]
first_visit.head(3)
Out[4]:
TX TG id age_rz GENDER ETHNIC hemog PREFEV PREFVC PREFF ... wbc agehome anypet woodstove dehumid parent_smokes any_smokes visitc fdays age_group
0 ned B 1.0 5.0 m o 12.5 1.38 1.75 79.0 ... 65.0 50.0 1.0 2.0 2.0 1.0 1.0 000 0.0 Early Childhood (2-5yr)
15 ned B 2.0 11.0 m b 12.5 1.78 2.49 71.0 ... 82.0 34.0 1.0 2.0 1.0 2.0 1.0 000 0.0 Middle Childhood (6-11yr)
31 ned B 4.0 7.0 f w 13.6 2.28 2.62 87.0 ... 54.0 34.0 2.0 2.0 2.0 2.0 2.0 000 0.0 Middle Childhood (6-11yr)

3 rows × 29 columns

In [5]:
# The row number of the df first_visit shows how many participants were enrolled to the study
first_visit.shape
Out[5]:
(695, 29)
In [6]:
# Shows the counts of both genders
first_visit['GENDER'].value_counts()
Out[6]:
m    412
f    283
Name: GENDER, dtype: int64
In [7]:
# Plot the composition of age groups by gender among participants in the CAMP study
count_sex_age = pd.crosstab(index=first_visit['age_group'], columns=first_visit['GENDER'])

labels=['Early Adolescence (12-18yr)', 'Early childhood (2-5yr)', 'Middle childhood (6-11yr)']
pie_age_gender = make_subplots(1, 2, specs=[[{'type':'domain'}, {'type':'domain'}]],
                    subplot_titles=['Female', 'Male'])
pie_age_gender.add_trace(go.Pie(labels=labels, values=count_sex_age['f'], scalegroup='one',
                     name="Female"), 1, 1)
pie_age_gender.add_trace(go.Pie(labels=labels, values=count_sex_age['m'], scalegroup='one',
                     name="Male"), 1, 2)

pie_age_gender.update_layout(title_text='Gender and Age Characteristics of CAMP Study',
                         annotations=[dict(text='Female', x=0.225, y=0.47, font_size=15, showarrow=False),
                                      dict(text='Male', x=0.78, y=0.46, font_size=15, showarrow=False)],
                            width=800, height=400)
pie_age_gender.update_traces(hole=.4, hoverinfo="label+value+percent+name")
pie_age_gender.show()
In [8]:
# Plot the composition of ethnicity groups by gender among participants in the CAMP study
count_sex_ethnic = pd.crosstab(index=first_visit['ETHNIC'], columns=first_visit['GENDER'])

ethnic_labels= ["black","hispanic","other","white"]
pie_ethnic_gender = make_subplots(1, 2, specs=[[{'type':'domain'}, {'type':'domain'}]],
                    subplot_titles=['Female', 'Male'])
pie_ethnic_gender.add_trace(go.Pie(labels=ethnic_labels, values=count_sex_ethnic['f'], scalegroup='one',
                     name="Female"), 1, 1)
pie_ethnic_gender.add_trace(go.Pie(labels=ethnic_labels, values=count_sex_ethnic['m'], scalegroup='one',
                     name="Male"), 1, 2)

pie_ethnic_gender.update_layout(title_text='Gender and Ethnicity Characteristics of CAMP Study',
                         annotations=[dict(text='Female', x=0.225, y=0.47, font_size=15, showarrow=False),
                                      dict(text='Male', x=0.78, y=0.46, font_size=15, showarrow=False)],
                               width=800, height=400)
pie_ethnic_gender.update_traces(hole=.4, hoverinfo="label+value+percent+name")
pie_ethnic_gender.show()
In [9]:
# Counts of participants of different treatment groups
first_visit['TX'].value_counts()
Out[9]:
ned     210
bud     210
pned    141
pbud    134
Name: TX, dtype: int64
  • At the begining of the study, there were a total of 695 participants enrolled to the study.
  • Both treatment groups of budesonide and nedocromil had 210 participants. A total of 141 participants and 134 participants were enrolled to the group of placebo nedocromil and placebo budesonide, respectively.
  • More male participants (412) were enrolled than females (283).
  • Over 3 quarters of the participants at enrollment were from age group middle childhood (6-11 yr). The composition of age groups between male and female participants are similar.
  • Nearly three quarters of the participants were white. The rest of the participants were from black, hispanic, or other ethnicity groups.

Key measurements exploration and visualization¶

PREFEVPP is the FEV1 percentage of predicted before bronchodialators. We chose this variable because it reflects the pulmonary function of participants without the confounding effect of bronchodialator. In this section, we explored the variable of PREFEVPP between different ethnicity groups and differnt follow up visits

In [10]:
first_visit_rmna = first_visit[first_visit['PREFEVPP'].isna()==False]
In [11]:
# Visualize the boxplots of PREFEVPP of different ethnicity groups at first visit
histo_PREFEVPP_ethnic  = px.box(first_visit_rmna, x='ETHNIC', 
                                y="PREFEVPP", color="GENDER", 
                                title="Boxplot of PREFEVPP at First Visit")
histo_PREFEVPP_ethnic.show()
In [12]:
camp_df['visit_month'] = camp_df['visitc'].astype(int)

visit_month_list = [0,12,24,36,48,60,72]
# extrac id that have PREFEVPP value for all of these visits
all_id = camp_df['id'].unique().tolist()
fig_id_list=[]
for i in all_id:
    i_df = camp_df[(camp_df['id']==i) & (camp_df['PREFEVPP'].isna()==False)]
    i_df_visit =  i_df['visit_month'].tolist()
    if set(visit_month_list).issubset(set(i_df_visit)):
        fig_id_list.append(i)
    else: pass
In [13]:
camp_df_subset = camp_df.loc[camp_df['id'].isin(fig_id_list) & (camp_df['visit_month'].isin(visit_month_list))]
line_PREFEVPP_visit =  px.box(camp_df_subset, x='visit_month', y='PREFEVPP', 
                              color="GENDER",
                              facet_row="TX",
                              width=800, height=800)
line_PREFEVPP_visit.show()

Analysis of long-term effect of budesonide and nedocromil on pulmonary function¶

The purpose of this notebook is to assess the long term effect of two drugs. Given the number of observations at different time points of follow up visits, we selected 72 months (6 years since intervention treatment) as the time point to evaluate the long term effect of medicine intervention.

In [14]:
# Define id list that have both records of PREFEVPP at 72 month visit and first visit
def intersection(lst1, lst2):
    return list(set(lst1) & set(lst2))

visit_72_id = camp_df.loc[(camp_df["visitc"]=="072") & (camp_df["PREFEVPP"].isna()==False),]['id'].tolist()
visit_0_id = camp_df.loc[(camp_df["visitc"]=="000") & (camp_df["PREFEVPP"].isna()==False),]['id'].tolist()

id_intersect =intersection(visit_72_id, visit_0_id)

visit_72_df = camp_df.loc[(camp_df['id'].isin(id_intersect)) & (camp_df['visit_month'].isin([72])),['id','PREFEVPP','POSFEVPP']]
visit_0_df = camp_df.loc[(camp_df['id'].isin(id_intersect)) & (camp_df['visit_month'].isin([0])),]

visit_72_df = visit_72_df.rename(columns={"PREFEVPP":"PREFEVPP_72", "POSFEVPP":"POSFEVPP_72"})

# merge two dfs 
fev1_72_df = pd.merge(visit_0_df, visit_72_df, how='inner', on='id')

# create a new variable PREFEVPP_diff that calculates the difference between PREFEVPP value at 72 and 0 month
fev1_72_df['PREFEVPP_diff'] = fev1_72_df['PREFEVPP_72']-fev1_72_df['PREFEVPP']
fev1_72_df.head()
Out[14]:
TX TG id age_rz GENDER ETHNIC hemog PREFEV PREFVC PREFF ... dehumid parent_smokes any_smokes visitc fdays age_group visit_month PREFEVPP_72 POSFEVPP_72 PREFEVPP_diff
0 ned B 1.0 5.0 m o 12.5 1.38 1.75 79.0 ... 2.0 1.0 1.0 000 0.0 Early Childhood (2-5yr) 0 92.0 98.0 11.0
1 ned B 2.0 11.0 m b 12.5 1.78 2.49 71.0 ... 1.0 2.0 1.0 000 0.0 Middle Childhood (6-11yr) 0 94.0 110.0 4.0
2 ned B 4.0 7.0 f w 13.6 2.28 2.62 87.0 ... 2.0 2.0 2.0 000 0.0 Middle Childhood (6-11yr) 0 102.0 111.0 -2.0
3 ned B 5.0 5.0 m h 13.8 1.02 1.17 87.0 ... 2.0 2.0 2.0 000 0.0 Early Childhood (2-5yr) 0 92.0 107.0 -20.0
4 bud A 9.0 12.0 f w 12.6 1.51 1.84 82.0 ... 2.0 2.0 2.0 000 0.0 Early Adolescence (12-18yr) 0 85.0 103.0 -24.0

5 rows × 33 columns

In [15]:
# The number of participants of different treatment groups
fev1_72_df['TX'].value_counts()
Out[15]:
bud     178
ned     168
pned    124
pbud    109
Name: TX, dtype: int64
In [16]:
# Here are the histogram distributions of PREFEVPP diff across 4 treatment groups
# The histogram shows that the PREFEVPP_diff follows normal distribution
fig_his = px.histogram(fev1_72_df, x="PREFEVPP_diff", facet_row="TX",
                       title="Histograms of PREFEVPP Diffs Between 72 and 0 Month", 
                       height=600, width=800)
fig_his.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]))
fig_his.update_layout(
    xaxis_title="Diff of PREFEVPP between 72 and 0 month"
)
fig_his.show()
In [17]:
# Extract PREFEVPP_diff from each treatment group
bud_diff = fev1_72_df[fev1_72_df['TX']=='bud']['PREFEVPP_diff']
pbud_diff = fev1_72_df[fev1_72_df['TX']=='pbud']['PREFEVPP_diff']
ned_diff = fev1_72_df[fev1_72_df['TX']=='ned']['PREFEVPP_diff']
pned_diff = fev1_72_df[fev1_72_df['TX']=='pned']['PREFEVPP_diff']
In [18]:
# T test between bud treatment group and bud placebo control group 
stats.ttest_ind(bud_diff, pbud_diff, equal_var=False, nan_policy='raise')
Out[18]:
Ttest_indResult(statistic=1.5158970151693059, pvalue=0.13114722800749987)
In [19]:
# T test between ned treatment group and ned placebo control group 
stats.ttest_ind(ned_diff, pned_diff, equal_var=False, nan_policy='raise')
Out[19]:
Ttest_indResult(statistic=0.20451809019881795, pvalue=0.8380950900486477)
  • The P values of both T-tests were bigger than 0.05. We failed to reject the hypothesis H0 that 2 independent samples have identical average (expected) values.
  • Therefore, there was no significant difference found between the intervention treatment and their placebo group for both medications in this dataset.