Diabetes is a major public health concern in the United State. An estimated 30.3 million people in the U.S., or 9.4% of the population, have diabetes [1]. It also imposes a substantial burden on society, provided that the total estimated cost of diagnosed diabetes in 2017 was $327 billion, including $237 billion in direct medical costs and $90 billion in reduced productivity [2]. As a management of diabetes, people who are diagnosed with diabetes are advised to control their diet and choose foods that are lower in calories for their health [3]. However, it is unclear if people in the U.S. actually follow this guideline.
Hence, the research question we are going to answer in this report is:
Do people diagnosed with diabetes consume fewer calories in the United States?
We found it to be an interesting question and worth studying, given the high prevalence of diabetes and the importance of diet control in diabetes self-management.
We plan to use National Health and Nutrition Examination Survey (NHANES) data, and fit a Linear Mixed Model (LMM) to analyze the effect of diabetes diagnosis on an individual’s daily consumption of total calories, controling for other covariates. The LMM result will give us information on the calorie intake difference among people with and without diabetes.
We used the National Health and Nutrition Examination Survey (NHANES) 2015-2016 data, which is publicly available on Centers for Disease Control and Prevention (CDC) website.
The datasets and variables we used for our research question are summarized below:
Dataset | Dataset Description | Variable name | Variable Description | Variable type |
---|---|---|---|---|
(ALL) | SEQN | Respondent sequence number |
Numeric | |
Demographics data | ||||
DEMO_I | Demographics | RIDAGEYR | Age in years at screening | Numeric |
DEMO_I | Demographics | RIAGENDR | Gender | Numeric |
DEMO_I | Demographics | RIDEXPRG | Pregnancy status at exam | Numeric |
Dietary data | ||||
DR1TOT_I | Total Nutrient Intakes First Day |
DR1TKCAL | Energy (kcal) | Numeric |
DR2TOT_I | Total Nutrient Intakes Second Day |
DR2TKCAL | Energy (kcal) | Numeric |
Examination data | ||||
BMX_I | Body Measurement | BMXBMI | Body Mass Index (kg/m\(^2\)) | Numeric |
Questionnaire data | ||||
DIQ_I | Diabetes | DIQ010 | Doctor told you have diabetes | Numeric |
PAQ_I | Physical Activity | PAD615 | Minutes vigorous-intensity work | Numeric |
PAQ_I | Physical Activity | PAD630 | Minutes moderate-intensity work | Numeric |
PAQ_I | Physical Activity | PAD680 | Minutes sedentary activity | Numeric |
Linear mixed model (LMM) is an extension of simple linear model to include both fixed and random effects. It is particular useful when there is non independence in the data [4]. In NHANES data, participants were interviewed for dietary intake on two different days, thus the total calorie intake data for the same respondent is not independent. Hence, we fitted an LMM to analyze the effect of diabetes diagnosis on consumption of total calories from both interview days, controlling for demographics, BMI and physical activity, and including a random intercept for each respondent.
Note: NHANES has sampling weights associated with each respondent. However, in our analysis, we treate each respondent as i.i.d. data, ignoring the sampling weights.
Below we provide detailed steps of our analysis:
XPT files needed for the analysis were downloaded from the CDC website. Different datasets were merged into one using SEQN (respondent sequence number), and only variables we listed in the above Data section were included in the merged data.
We have one merged dataset for day 1 only, one for day 2 only, and another one with day 1 and day 2 data stacked together.
Respondents with unknown diabetes are excluded from analysis.
Respondents with confirmed pregnancy (RIDEXPRG) were excluded from our analysis.
Ages over 89 are considered protected health infomration (PHI) [5]. Therefore, to avoid identifiability, NHANES data top-coded age variable (RIDAGEYR) at 80 years old, which means respondents aged 80 or older were all recorded as 80 in NHANES demographics data. As a result, We only include adolescent and adult respondents aged 12 to 79 in our analysis.
We created a new binary diabetes
variable, where respondents with diabetes diagnosis or borderline diagnosis are both considered positive (1) in this new variable, and respondents with confirmed non-diabetes diagnosis are considered negative (0).
We recoded the two-level (1 vs 2) gender variable to a new male
variable, where 1 indicates male, and 0 indicates female.
We added a day
variable to indicate the dietary interview day (1 or 2).
Missingness of each variable was checked in our analysis. Variables with large proportion of missing were furthre excluded from our analysis.
Observations with missing in any of the variables we are going to use were deleted from the final working dataset.
We checked the normality of our response variable - total energy (kcal), by interview day, and discussed the necessity of transformation.
The correlation matrix of predictors were examined.
We then fitted a simple linear regression model using day 1 data only, and examined the variance inflation factor (VIF) to assess multicollinearity.
Predictors were the selected based on the multicollinearity inspection results.
We used both days’ data to then fit a linear mixed model, including a random intercept for each respondent.
We provided predicted total calorie intake by diabetes and gender status when continuous covariates are held at population means.
The data preparation, model checking and model fitting were performed parallel in R [6], SAS [7], and Python [8]. The core packages/functions used for LMM are R - lmer4 package [9], SAS - proc mixed [10], and Python statsmodels package [11]. A summary of tools we used is shown in the table below:
Language | Version | LMM | Others |
---|---|---|---|
R | 3.6.0 | lme4 | foreign, tidyr, plyr, dplyr, corrplot, ggplot2, ggpredict |
SAS | 9.4 | proc mixed | proc reg |
Python | 3.6 | statsmodels | pandas, dfply, scipy, matplotlib, patsy |
Question: Do people diagnosed with diabetes consume fewer calories in the US?
DEMO = read.xport("./Data/DEMO_I.XPT")
DIQ = read.xport("./Data/DIQ_I.XPT")
PAQ = read.xport("./Data/PAQ_I.XPT")
DR1 = read.xport("./Data/DR1TOT_I.XPT")
DR2 = read.xport("./Data/DR2TOT_I.XPT")
BMX = read.xport("./Data/BMX_I.XPT")
Variables of interest:
Dataset | Variable Name | Variable Description | Notes on variable use |
---|---|---|---|
(ALL) | SEQN | Respondent sequence number | For merging purpose |
DEMO | RIDAGEYR | Age in years at screening | Top-coded at 80 and above, |
convert to age groups | |||
RIAGENDR | Gender | ||
RIDEXPRG | Pregnancy status at exam | Exclude pregnant participants | |
DIQ | DIQ010 | Doctor told you have diabetes | Binary |
PAQ | PAD615 | Vigorous-intensity work (min) | |
PAD630 | Moderate-intensity work (min) | ||
PAD680 | Sedentary activity (min) | ||
DR1 | DR1TKCAL | Energy (kcal) | Response variable |
DR2 | DR2TKCAL | Energy (kcal) | Response variable |
BMX | BMXBMI | Body Mass Index (kg/m**2) |
# Merge all of the relevant variables
dm = join_all(list(DEMO[,c("SEQN", "RIDAGEYR", "RIAGENDR", "RIDEXPRG")],
DIQ[,c("SEQN", "DIQ010")],
PAQ[,c("SEQN", "PAD615", "PAD630", "PAD680")],
DR1[,c("SEQN", "DR1TKCAL")],
DR2[,c("SEQN", "DR2TKCAL")],
BMX[,c("SEQN", "BMXBMI")]),
by = "SEQN")
names(dm) = c("SEQN","age","gender","pregnancy","DIQ010",
"vig_work","mod_work","sed_act","DR1","DR2","bmi")
# create a "not in" operator
`%notin%` = Negate(`%in%`)
# exclude data without diabetes diagnosis
c = c(1,2,3)
dm = dm[-which(dm$DIQ010 %notin% c), ]
# exclude pregnant participants
dm = dm[-which(dm$pregnancy==1), ]
# create a new variable representing diabetes
# 1 - diabetes or borderline
# 0 - non-diabetes
dm$diabetes = ifelse(dm$DIQ010==2, 0, 1)
# create a new variable representing male
# 1 - male
# 0 - female
dm$male = ifelse(dm$gender==1, 1, 0)
# pivot the data into "long" format
# and create a variable representing dietary interview day (1/2)
dm = dm %>%
pivot_longer(
cols = starts_with("DR"),
names_to = "day",
names_prefix = "DR",
values_to = "kcal",
values_drop_na = FALSE
)%>%
arrange(SEQN, day)
# check for missing values for each variable
colSums(is.na(dm))
## SEQN age gender pregnancy DIQ010 vig_work mod_work
## 0 0 0 16566 0 16312 13884
## sed_act bmi diabetes male day kcal
## 5248 1636 0 0 0 4314
vig_work
and mod_work
have much more missing values than other variables, so it is reasonable to remove them from the data frame.
pregnancy
can also be removed since we only need this variable to filter out pregnant individuals.
DIQ010
and gender
are now redundant due to the new variables.
Note that age
is top-coded at 80, which means participants aged 80 or older were all recorded as age 80. To avoid inaccurate data, we decided to set an age bound from 12 to 79.
dm = dm %>%
select(-pregnancy, -vig_work, -mod_work, -DIQ010, -gender) %>%
filter(age>=12 & age<=79)
dm$day = as.numeric(dm$day)
# remove rows with missing value
dm_final = dm[complete.cases(dm),]
dr1_final = dm_final[dm_final$day==1,]
dr2_final = dm_final[dm_final$day==2,]
# check if the response variable - kcal is normally distributed
# day1
ggplot(dr1_final, aes(x = kcal)) +
geom_histogram(aes(y = ..density..),
bins = 100,
fill = "blue",
alpha = 0.8) +
stat_function(fun = dnorm,
args = list(mean = mean(dr1_final$kcal),
sd = sd(dr1_final$kcal))) +
labs(title = "Histogram of kcal (Day 1)", x = "Total Energy (kcal)", y = "Frequency") +
theme(plot.title = element_text(hjust = 0.5),
panel.background = element_rect(fill = "transparent"))
# day2
ggplot(dr2_final, aes(x = kcal)) +
geom_histogram(aes(y = ..density..),
bins = 100,
fill = "blue",
alpha = 0.8) +
stat_function(fun = dnorm,
args = list(mean = mean(dr2_final$kcal),
sd = sd(dr2_final$kcal))) +
labs(title = "Histogram of kcal (Day 2)", x = "Total Energy (kcal)", y = "Frequency") +
theme(plot.title = element_text(hjust = 0.5),
panel.background = element_rect(fill = "transparent"))
We can also use QQ plot to examine the normality.
# qqplots for day 1 and day 2
dm_final %>%
ggplot(aes(sample = kcal)) +
stat_qq(color = "blue") +
stat_qq_line(color = "red") +
facet_grid(rows = vars(day)) +
labs(title = "QQ plot for kcal by day") +
theme(plot.title = element_text(hjust = 0.5),
panel.background = element_rect(fill = "transparent"))
Overall kcal
seems to be approximately normal with a longer right tail for both day1 and day2. There is not much concern about the normality.
In this case, no transformation would be needed.
# fit a linear regression model
model = dr1_final %>%
lm(formula = kcal ~ age + sed_act + bmi + factor(diabetes) + factor(male))
# check collinearity
summary(model)
##
## Call:
## lm(formula = kcal ~ age + sed_act + bmi + factor(diabetes) +
## factor(male), data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2291.7 -604.9 -110.7 452.9 6101.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1845.41325 55.12131 33.479 < 2e-16 ***
## age -3.88702 0.66705 -5.827 5.94e-09 ***
## sed_act -0.04048 0.01407 -2.876 0.00404 **
## bmi 5.45100 1.72730 3.156 0.00161 **
## factor(diabetes)1 -116.09300 37.89569 -3.063 0.00220 **
## factor(male)1 565.29386 24.26871 23.293 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 915.4 on 5746 degrees of freedom
## Multiple R-squared: 0.09506, Adjusted R-squared: 0.09427
## F-statistic: 120.7 on 5 and 5746 DF, p-value: < 2.2e-16
faraway::vif(model)
## age sed_act bmi factor(diabetes)1
## 1.172235 1.001537 1.098070 1.161806
## factor(male)1
## 1.010543
cor1 = cor(dr1_final[, c("age", "sed_act", "bmi", "diabetes", "male")])
cor1
## age sed_act bmi diabetes male
## age 1.000000000 -0.034440453 0.24274212 0.341759578 -0.009280101
## sed_act -0.034440453 1.000000000 -0.01079093 -0.009274661 -0.017471579
## bmi 0.242742118 -0.010790931 1.00000000 0.222468238 -0.088875419
## diabetes 0.341759578 -0.009274661 0.22246824 1.000000000 0.025974190
## male -0.009280101 -0.017471579 -0.08887542 0.025974190 1.000000000
corrplot(cor1,
type = "upper",
title = "Correlation plot of covariates",
mar = c(0, 0, 1, 0))
No collinearity found.
# fit a linear mixed model
model2 = dm_final %>%
lmer(formula = kcal ~ age + sed_act + bmi + factor(diabetes) + factor(male) + (1|SEQN))
summary(model2)
## Linear mixed model fit by REML ['lmerMod']
## Formula: kcal ~ age + sed_act + bmi + factor(diabetes) + factor(male) +
## (1 | SEQN)
## Data: .
##
## REML criterion at convergence: 172309.7
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -4.0472 -0.5527 -0.1008 0.4318 6.9880
##
## Random effects:
## Groups Name Variance Std.Dev.
## SEQN (Intercept) 292949 541.2
## Residual 510041 714.2
## Number of obs: 10528, groups: SEQN, 5752
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 1808.46413 45.80575 39.481
## age -2.49956 0.55339 -4.517
## sed_act -0.03034 0.01184 -2.563
## bmi 2.31402 1.43038 1.618
## factor(diabetes)1 -101.15748 31.38359 -3.223
## factor(male)1 538.47663 20.17781 26.687
##
## Correlation of Fixed Effects:
## (Intr) age sed_ct bmi fctr(d)1
## age -0.317
## sed_act -0.143 0.034
## bmi -0.814 -0.179 0.003
## fctr(dbts)1 0.210 -0.305 -0.006 -0.156
## factor(ml)1 -0.300 0.002 0.015 0.095 -0.047
# Predicted total energy intake at population mean
me = ggpredict(model2, c("diabetes","male"))
plot(me, dodge = 0) +
labs(title = "Predicted total calories by diabetes and gender",
x = "Diabetes (0 = No, 1 = Yes)",
y = "Total Energy (kcal)") +
theme(plot.title = element_text(hjust = 0.5))
/******************************************************************************/
* Stats 506, Fall 2019 ;
* Group Project - Group 3 ;
*------------------------------------------------------------------------------;
* This script analyzes the question: ;
* "Do people diagnosed with diabetes consume fewer calories in US?" ;
* NHANES 2015-2016 data are used in this problem. ;
* The script has following parts to complete the task: ;
* A. Merge source data ;
* Merge information from: ;
* Demographics - subject ID, age, gender ;
* Dietary - subject ID, total calories (day 1), total calories (day 2) ;
* Examination - subject ID, BMI ;
* Questionnaire - subject ID, diabetes, pregnancy ;
* - vigorous/moderate/sedentary activity minutes ;
* B. Check missing patterns ;
* Check missing patterns of all variables ;
* Take a closer look at three physical activity variables ;
* C. Handle top-coded age ;
* NHANES data top-coded age at 80 and above. ;
* Choose a method to handle it appropriately. ;
* D. Transformation of response variable ;
* Examine the normality of response variable ;
* Determine whether a transformation is needed ;
* E. Initial model to examine collinearity - use Day 1 only ;
* Use day 1 data only to fit a linear regressio model ;
* Determine if multicollinearity issue exists ;
* If yes, handle if appropriately ;
* F. Linear mixed model to account for both day 1 and day 2 ;
* Fit a linear mixed model to include both days data ;
* Include a random intercept for each subject ;
*------------------------------------------------------------------------------;
* Author: Jie Cao (caojie@umich.edu) ;
* Last updated on: Dec 3, 2019 ;
/******************************************************************************/
* 80: -------------------------------------------------------------------------;
/* Data directory ----------------------------------------------------------- */
* List of XPT data downloaded from 2015 - 2016 NHANES;
libname DEMO xport "M:\506\data\XPT\DEMO_I.XPT";
libname DIQ xport "M:\506\data\XPT\DIQ_I.XPT";
libname PAQ xport "M:\506\data\XPT\PAQ_I.XPT";
libname DR1 xport "M:\506\data\XPT\DR1TOT_I.XPT";
libname DR2 xport "M:\506\data\XPT\DR2TOT_I.XPT";
libname BMX xport "M:\506\data\XPT\BMX_I.XPT";
* Directory to save datasets extracted from XPT files;
libname NH "M:\506\data\SAS";
/* Extract XPT files and save as SAS datasets ------------------------------- */
proc copy in = DEMO out = NH;
run;
proc copy in = DIQ out = NH;
run;
proc copy in = PAQ out = NH;
run;
proc copy in = DR1 out = NH;
run;
proc copy in = DR2 out = NH;
run;
proc copy in = BMX out = NH;
run;
/* Formats ------------------------------------------------------------------- */
proc format;
/* A format to flag numeric varibles with missing status */
value missfmt
. = "Missing"
other = "Non-missing";
/* Gender format */
value male
0 = "Female"
1 = "Male";
run;
/* Prepare working data ------------------------------------------------------ */
***************************;
* A. Merge source data ;
****************************
* Dietary day 1 interview plus needed variables;
proc sql;
create table dr1 as
select dm.SEQN as subject,
1 as day,
(case
/* DIQ010 = 2: doctor confirmed no diabetes */
when dm.DIQ010 = 2 then 0
/* DIQ010 = 1 or 3: doctor confirmed diabetes (1) or borderline (3) */
else 1
end) as diabetes,
demo.RIDAGEYR as ageyr ,
(case
when demo.RIAGENDR = 1 then 1
when demo.RIAGENDR = 2 then 0
else .
end) as male format male.,
bm.BMXBMI as bmi,
pa.PAD615 as vigorous_min,
pa.PAD630 as moderate_min,
pa.PAD680 as sedentary_min,
dr1.DR1TKCAL as tot_cal
from NH.Diq_i dm
left join NH.Demo_i demo on dm.SEQN = demo.SEQN
left join NH.Bmx_i bm on dm.SEQN = bm.SEQN
left join NH.Paq_i pa on dm.SEQN = pa.SEQN
left join NH.Dr1tot_i dr1 on dm.SEQN = dr1.SEQN
/* Include subjects with known answers to diabete diagnosis &
exclude participants confirmed with pregnancy */
where dm.DIQ010 in (1, 2, 3) and demo.RIDEXPRG NE 1;
quit;
NOTE: Table WORK.DR1 created, with 9501 rows and 10 columns.
* Dietary day 2 interview plus needed variables;
proc sql;
create table dr2 as
select dm.SEQN as subject,
2 as day,
(case
/* DIQ010 = 2: doctor confirmed no diabetes */
when dm.DIQ010 = 2 then 0
/* DIQ010 = 1 or 3: doctor confirmed diabetes (1) or borderline (3) */
else 1
end) as diabetes,
demo.RIDAGEYR as ageyr,
(case
when demo.RIAGENDR = 1 then 1
when demo.RIAGENDR = 2 then 0
else .
end) as male format male.,
bm.BMXBMI as bmi,
pa.PAD615 as vigorous_min,
pa.PAD630 as moderate_min,
pa.PAD680 as sedentary_min,
dr2.DR2TKCAL as tot_cal
from NH.Diq_i dm
left join NH.Demo_i demo on dm.SEQN = demo.SEQN
left join NH.Bmx_i bm on dm.SEQN = bm.SEQN
left join NH.Paq_i pa on dm.SEQN = pa.SEQN
left join NH.Dr2tot_i dr2 on dm.SEQN = dr2.SEQN
/* Include subjects with known answers to diabete diagnosis &
exclude participants confirmed with pregnancy */
where dm.DIQ010 in (1, 2, 3) and demo.RIDEXPRG NE 1;
quit;
NOTE: Table WORK.DR2 created, with 9501 rows and 10 columns.
* Combine two days' data;
proc sql;
create table dm_cal as
select *
from dr1
outer union corr
select *
from dr2
order by subject, day;
quit;
NOTE: Table WORK.DM_CAL created, with 19002 rows and 10 columns.
**********************************************;
* B. Check missing patterns ;
**********************************************;
* Add variables to count number of missing;
data check;
set dm_cal;
/* Number of missing among three physical activity minutes
for each participant */
mins_miss = cmiss(of vigorous_min--sedentary_min);
/* Number of missing among all (numeric) variables we may
potentially use for each participant */
nmiss = cmiss(of diabetes--tot_cal);
run;
* Tabulate number of missing;
proc sort data = check;
by day;
run;
proc freq data = check;
tables mins_miss nmiss;
by day;
run;
1064 (11.2%) have complete three physical activity minutes data, both day 1 and day 2;
979 (10.3%) have complete all numeric data for day 1;
801 (8.4%) have complete all numeric data for day 2;
* Check missing patterns for numeric variables we are interested in;
proc freq data = check;
format diabetes--tot_cal missfmt.;
tables diabetes--tot_cal / missing missprint nocum;
by day;
run;
No missing in diabetes (by nature of merging process), age, gender;
BMI: 818 (8.6%) missing in both day 1 and day 2;
Vigorous_min: 8156 (85.8%) missing in both day 1 and day 2;
Moderate_min: 6942 (73.1%) missing in both day 1 and day 2;
Sedentary_min: 2624 (27.6%) missing in both day 1 and day 2;
tot_cal: 1452 (15.3%) missing in day 1, 2862 (30.1%) in day 2;
Conclusion: we decide not to use vigorous activity minutes and moderate activity minutes due to large amount of missing.
********************************;
* C. Handle top-coded age ;
********************************;
/* NHANES top-coded age variable at 80 years, therefore, we consider recode age
into age groups. */
data dm_cal;
set dm_cal;
/* Restrict to age 12+ */
where 12 <= ageyr < 80;
run;
NOTE: There were 13172 observations read from the data set WORK.DM_CAL. WHERE (ageyr>=12 and ageyr<80);
NOTE: The data set WORK.DM_CAL has 13172 observations and 10 variables.
*******************************************
* D. Transformation of response variable ;
******************************************;
/* Examine the distribution of response variable */
proc univariate data = dm_cal;
var tot_cal;
histogram tot_cal / normal;
qqplot tot_cal / normal(mu = est sigma = est color = red l = 2);
class day;
run;
Conclusion: The response variable (total energy) approximately follows normal distribution, thus, no transformation is needed.
/* Modeling ----------------------------------------------------------------- */
************************************************************;
* E. Initial model to examine collinearity - use Day 1 only ;
************************************************************;
data dm_cal_final dr1_final dr2_final;
set dm_cal(drop = vigorous_min moderate_min);
/* Remove observations with missing in any of the variables*/
if cmiss(of _all_) then delete;
/* Output two day's, day 1 only and day 2 only data */
output dm_cal_final;
if day = 1 then output dr1_final;
if day = 2 then output dr2_final;
run;
NOTE: There were 13172 observations read from the data set WORK.DM_CAL.
NOTE: The data set WORK.DM_CAL_FINAL has 10528 observations and 8 variables.
NOTE: The data set WORK.DR1_FINAL has 5752 observations and 8 variables.
NOTE: The data set WORK.DR2_FINAL has 4776 observations and 8 variables.
/* Examine the correlation matrix */
proc corr data = dr1_final;
var diabetes ageyr male bmi sedentary_min;
run;
Above is the correlation matrix for the five predictors we picked. We want to check if any of the variables included have a high correlation - about 0.8 or higher - with any other variable. Upon review of this correlation matrix, there does not appear to be any pair of variable with a particular high correlation.
/* Multicollinearity Investigation of VIF */
proc reg data = dr1_final;
model tot_cal = diabetes ageyr male bmi sedentary_min / vif tol collin;
run;
Next, we fit an ordinary linear regression model. In the above results, we check Variance Inflation Factor (VIF) or Tolerance (1/VIF) for potential multicollinearity. VIF greater than 10 or Tolerance less than 0.1 should be warned if exists. Obviously, there is no treat of multicollinearity indicated in our model.
We further look at the collinearity diagnostics (eigenvalues and condition numbers). If one or more of the eigenvalues are small (close to zero) and the corresponding condition number being large, then we have an indication of multicollinearity. As we can see from the above results, none of our eigenvalues and condition index associations match this description.
Conclusion: No multicollinearity issue found in our model.
************************************************************;
* F. Linear mixed model to account for both day 1 and day 2 ;
************************************************************;
proc mixed data = dm_cal_final;
class diabetes(ref = "0") male;
model tot_cal = diabetes ageyr male bmi sedentary_min / s;
random intercept / subject = subject;
lsmeans diabetes male / at means;
run;
* 80: -------------------------------------------------------------------------;
Author: Wenjing Li
Last Update Date: Dec.11, 2019
This script analyzes the question:
“Do people diagnosed with diabetes consume less calories in US?”
NHANES 2015-2016 data are used in this problem.
import pandas as pd
from dfply import *
from scipy import stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import pylab
import scipy.stats as stats
from patsy import dmatrices
from statsmodels.stats.outliers_influence import variance_inflation_factor
List of XPT files from NHANES 2015-2016 that we will use in this analysis.
DEMO_I - Demographics
DR1TOT_I - Dietary, Day 1
DR2TOT_I - Dietary, Day 2
BMX_I - Body Measusrement
DIQ_I - Diabetes
PAQ_I - Physical Activity
demo = pd.read_sas('../Data/DEMO_I.XPT')
dr1tot = pd.read_sas('../Data/DR1TOT_I.XPT')
dr2tot = pd.read_sas('../Data/DR2TOT_I.XPT')
bmx = pd.read_sas('../Data/BMX_I.XPT')
diq = pd.read_sas('../Data/DIQ_I.XPT')
pad = pd.read_sas('../Data/PAQ_I.XPT')
demo_new = (demo >>
select(X.SEQN, X.RIDAGEYR, X.RIAGENDR, X.RIDEXPRG))
dr1tot_new = (dr1tot >>
mutate(dr1 = X.DR1TKCAL) >>
select(X.SEQN, X.dr1))
dr2tot_new = (dr2tot >>
mutate(dr2 = X.DR2TKCAL) >>
select(X.SEQN, X.dr2))
dietary = pd.merge(dr1tot_new, dr2tot_new, on='SEQN', how='outer')
bmx_new = (bmx >>
select(X.SEQN, X.BMXBMI))
diq_new = (diq >>
select(X.SEQN, X.DIQ010))
pad_new = (pad >>
select(X.SEQN, X.PAD615, X.PAD630, X.PAD680))
questionnaire = pd.merge(diq_new, pad_new, on='SEQN', how='outer')
merge_2 = pd.merge(bmx_new, questionnaire, on='SEQN', how='outer')
merge_3 = pd.merge(merge_2, demo_new, on='SEQN', how='outer')
Include data with confirmed diabetes diagnosis
# Better to specify the values we want to include, instead of exclusion
diabetes_in = [1, 2, 3]
merge_3 = merge_3[merge_3.DIQ010.isin(diabetes_in)]
Exclude pregnant participants
merge_3 = merge_3[merge_3.RIDEXPRG != 1 ]
Include participants aged 12-79 years old
NHANES top-coded age at 80, which means participants aged 80 or older are all coded as 80.
Hence, we decide to only look at adolescents greater than 12 years old) and adults less than 80 years old.
merge_3 = merge_3[merge_3.RIDAGEYR < 80]
merge_3 = merge_3[merge_3.RIDAGEYR >= 12]
merge_3.shape
(6586, 9)
Binary diabetes variable:
1 - diabetes or borderline
0 - non-diabetes
diabete = []
for i in merge_3.DIQ010:
if i == 2:
diabete.append(0)
else:
diabete.append(1)
merge_3['diabetes'] = diabete
Binary male variable:
1 - male
0 - female
male = []
for i in merge_3.RIAGENDR:
if i == 1:
male.append(1)
else:
male.append(0)
merge_3['male'] = male
Dietary intervew day variable:
1 - interview day 1
2 - interview day 2
merge = pd.merge(merge_3, dietary, on='SEQN', how='left')
# pivot the data into "long" format
merge = merge.melt(id_vars=merge.iloc[:,0:11], var_name='day', value_name='y')
# create a variable representing dietary interview day (1/2)
days = []
for i in merge.day:
if i == 'dr1':
days.append(1)
else:
days.append(2)
merge['day'] = days
# remove rows with missing value
final = merge.drop(["RIDEXPRG", "RIAGENDR", "PAD615", "PAD630", "DIQ010"], axis=1)
# remove rows with any missing value
final = final.dropna()
final_day1 = final[final["day"] == 1]
final_day2 = final[final["day"] == 2]
final.shape
(10528, 8)
final_day1.shape
(5752, 8)
final_day2.shape
(4776, 8)
Day 1 Total Energy (kcal)
plt.hist(final_day1["y"], bins=100, density=1)
plt.title('Density Distribution of Total Energy (day 1)')
plt.xlabel('Energy (kcal)')
plt.ylabel('Frequency')
plt.savefig('./Figs/hist_day1.png')
plt.show()
Day 2 Total Energy (kcal)
plt.hist(final_day2["y"], bins=100)
plt.title('Density Distribution of Total Energy (day 2)')
plt.xlabel('Energy (kcal)')
plt.ylabel('Frequency')
plt.savefig('./Figs/hist_day2.png')
plt.show()
We can also use QQ-Plot to check the normality of the response variable.
plt.subplot(211)
# day 1 total energy
stats.probplot(final_day1["y"], dist="norm", plot=pylab)
plt.xlabel('')
plt.ylabel('Sample quantiles')
plt.title('Normal Q-Q Plot')
plt.tick_params(axis='x', bottom=False, labelbottom=False)
plt.subplot(212)
# day 2 total energy
stats.probplot(final_day2["y"], dist="norm", plot=pylab)
plt.ylabel('Sample quantiles')
plt.title('')
plt.savefig('./Figs/qq_plot.png')
pylab.show()
Conclusion: Approximately normal - no need to transform the response variable.
Correlation matrix of predictors
predictors = final_day1[final_day1.columns[1:6]]
predictors.corr()
BMXBMI | PAD680 | RIDAGEYR | diabetes | male | |
---|---|---|---|---|---|
BMXBMI | 1.000000 | -0.010791 | 0.242742 | 0.222468 | -0.088875 |
PAD680 | -0.010791 | 1.000000 | -0.034440 | -0.009275 | -0.017472 |
RIDAGEYR | 0.242742 | -0.034440 | 1.000000 | 0.341760 | -0.009280 |
diabetes | 0.222468 | -0.009275 | 0.341760 | 1.000000 | 0.025974 |
male | -0.088875 | -0.017472 | -0.009280 | 0.025974 | 1.000000 |
OLS model for day 1
x = sm.add_constant(final_day1[final_day1.columns[1:6]])
model = sm.OLS(final_day1['y'], x)
results = model.fit()
print(results.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.095
Model: OLS Adj. R-squared: 0.094
Method: Least Squares F-statistic: 120.7
Date: Thu, 12 Dec 2019 Prob (F-statistic): 8.08e-122
Time: 00:07:27 Log-Likelihood: -47383.
No. Observations: 5752 AIC: 9.478e+04
Df Residuals: 5746 BIC: 9.482e+04
Df Model: 5
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 1845.4133 55.121 33.479 0.000 1737.355 1953.472
BMXBMI 5.4510 1.727 3.156 0.002 2.065 8.837
PAD680 -0.0405 0.014 -2.876 0.004 -0.068 -0.013
RIDAGEYR -3.8870 0.667 -5.827 0.000 -5.195 -2.579
diabetes -116.0930 37.896 -3.063 0.002 -190.383 -41.803
male 565.2939 24.269 23.293 0.000 517.718 612.870
==============================================================================
Omnibus: 1368.319 Durbin-Watson: 2.034
Prob(Omnibus): 0.000 Jarque-Bera (JB): 4401.805
Skew: 1.199 Prob(JB): 0.00
Kurtosis: 6.552 Cond. No. 4.59e+03
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.59e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
C:\Users\wenji\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\fromnumeric.py:2389: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
return ptp(axis=axis, out=out, **kwargs)
Variance Infation Factor (VIF)
# Break into left and right hand side; y and X
y, X = dmatrices("y ~ BMXBMI + PAD680 + RIDAGEYR + diabetes + male", data=final_day1, return_type="dataframe")
# For each Xi, calculate VIF
vif = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif[1:] # No need to look at VIF for intercept
[1.0980699620399632, 1.00153717004665, 1.172234696954212, 1.1618062474485054, 1.010543313776021]
Conclusion: No collinearity issue.
# LMM model
mixed = smf.mixedlm("y ~ BMXBMI + PAD680 + RIDAGEYR + diabetes + male", final, groups = final["SEQN"])
mixed_fit = mixed.fit()
#print the summary
print(mixed_fit.summary())
Mixed Linear Model Regression Results
============================================================
Model: MixedLM Dependent Variable: y
No. Observations: 10528 Method: REML
No. Groups: 5752 Scale: 510040.7789
Min. group size: 1 Likelihood: -86154.8654
Max. group size: 2 Converged: Yes
Mean group size: 1.8
------------------------------------------------------------
Coef. Std.Err. z P>|z| [0.025 0.975]
------------------------------------------------------------
Intercept 1808.464 45.807 39.480 0.000 1718.684 1898.244
BMXBMI 2.314 1.430 1.618 0.106 -0.489 5.118
PAD680 -0.030 0.012 -2.563 0.010 -0.054 -0.007
RIDAGEYR -2.500 0.553 -4.517 0.000 -3.584 -1.415
diabetes -101.157 31.384 -3.223 0.001 -162.669 -39.646
male 538.477 20.178 26.686 0.000 498.928 578.025
Group Var 292948.651 22.660
============================================================
Prediction at population mean, by diabetes and gender
# prediction for person who has diabetes
diabetes = 1
## for man
male = 1
pre_1 = 1808.464+2.314*mean(final['BMXBMI'])-0.030*mean(final['PAD680'])-2.500*mean(final['RIDAGEYR'])- \
101.157*diabetes+538.477*male
## for woman
male = 0
pre_2 = 1808.464+2.314*mean(final['BMXBMI'])-0.030*mean(final['PAD680'])-2.500*mean(final['RIDAGEYR'])- \
101.157*diabetes+538.477*male
# prediction for person who doesn't have diabetes
diabetes = 0
## for man
male = 1
pre_3 = 1808.464+2.314*mean(final['BMXBMI'])-0.030*mean(final['PAD680'])-2.500*mean(final['RIDAGEYR'])- \
101.157*diabetes+538.477*male
## for woman
male = 0
pre_4 = 1808.464+2.314*mean(final['BMXBMI'])-0.030*mean(final['PAD680'])-2.500*mean(final['RIDAGEYR'])- \
101.157*diabetes+538.477*male
print('In the population mean, the estimated calorie intake for men diagnosed with diabetes is :', round(pre_1, 2),'\n' \
'In the population mean, the estimated calorie intake for women diagnosed with diabetes is :', round(pre_2, 2),'\n' \
'In the population mean, the estimated calorie intake for men without diabetes is :', round(pre_3, 2),'\n' \
'In the population mean, the estimated calorie intake for women without diabetes is :', round(pre_4, 2))
In the population mean, the estimated calorie intake for men diagnosed with diabetes is : 2194.44
In the population mean, the estimated calorie intake for women diagnosed with diabetes is : 1655.97
In the population mean, the estimated calorie intake for men without diabetes is : 2295.6
In the population mean, the estimated calorie intake for women without diabetes is : 1757.12
All three analysis tools we used (R-lmer4, SAS-proc mixed, Python-statsmodels) gave the same results. The estimates, 95% confidence intervals (CI) and p-values for the fixed effects are summarized in Table 1.
Covariate | Estimate | 95% CI | p-value |
---|---|---|---|
(Intercept) | 1808.46 | (1718.68, 1898.24) | <0.0001 |
Age (yr) | -2.50 | (-3.58, -1.41) | <0.0001 |
Sedentary Activity (minutes) | -0.03 | (-0.05, -0.01) | 0.0104 |
BMI (kg/m^2) | 2.31 | (-0.49, 5.12) | 0.1058 |
Diabetes | -101.16 | (-162.67, -39.65) | 0.0013 |
Male | 538.48 | (498.93, 578.03) | <0.0001 |
Our LMM showed that the predictor variables of interest are all statistically significant at level of 0.05, except BMI, suggesting that these predictors, including diabetes status, have a significant effect on predicting the response variable - total daily consumption of calories. An individual with diabetes is predicted to consume 101 fewer calories compared to a non-diabetes individual, while other covariates are held the same.
The standard deviation for each covariance component (subject and error) are shown in the Table 2.
Variance Component | Standard Deviation |
---|---|
Subject | 541.25 |
Error | 714.17 |
Marginal effect measures how the response variable changes when a specific predictor changes, while other covariates are assumed to be held constant.
We used a continuous response variable in our LMM model, therefore, the marginal effects are equal to the beta coefficients we obtained from the model summary. Marginal effects for continuous covariates measure the instantaneous rate of change, i.e. an approximation of the amount of change in total calorie intake produced by one-unit change in age/BMI/Sedentary activity in our model. Marginal effects for binary covariates (diabetes and male in our model) measure discreate change, i.e. how many total calorie consumption change as the diabetes status changes no to yes, or gender changes from female to male.
Since marginal effects for discrete variables are more meaningful, and we are interested in the effect of diabetes on total calorie intake, we present predicted total calorie intake by diabetes status and gender.
First, we present the population means of the continuous covariates:Measure | Age (yr) | BMI (kg/m^2) | Sedentary Activity (minute) |
---|---|---|---|
Mean | 41.6 | 28.78 | 464.65 |
We also provide predicted consumption of daily total calories by diabetes and gender in the below table and figure.
Diabetes Status | Female | Male |
---|---|---|
Non-diabetes | 1756.99 (1728.35, 1785.62) | 2295.46 (2265.76, 2325.16) |
Diabetes | 1655.83 (1595.28, 1716.38) | 2194.31 (2135.14, 2253.47) |
Conclusion: People diagnosed with diabetes consume fewer (~101 kcal) calories in the U.S. compared to non-diabetes individuals, if controlled for gender, age, BMI and sedentary activity.
Our conclusion indicates that diabetic individuals in the U.S generally follow the guideline to reduce their calorie consumption, while taking age, gender, BMI and sedentary activity into account. Further analyses could be done by controlling for more potential confounders and bring in sampling weights to better estimate the U.S. population distribution.
We found an interesting warning message in Python analysis section, when we were examining multicollinearity by fitting a simple linear regression on day 1 data only.
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.59e+03. This might indicate that there are strong multicollinearity or other numerical problems.
This is in contrast to the conclusion in R and SAS analysis sections. The correlation matrix, VIFs and eigenvalues/condition numbers inspection all agree with each other that no multicollinearity issue is found. Our Python analysis gave the same correlation matrix and VIFs as those where done in R and SAS. Therefore, we suspect Python package statsmodels
uses a different strategy to obtain condition numbers. Interestingly, we found a similar question raised on StackExchange. There are also two open issues (#553 and #2378) on statsmodels’ Github page. It seems that statsmodels
report condition number as a ratio of largest to smallest eigenvalue, rather than one condition index for one eigenvalue reported by R or SAS. Hence, the high condition number may be due to scaling, not a true indicator of multicollinearity.
Another interesting finding in our report is the difference in prediction of total calorie intake between R ggpredict
function and SAS lsmeans
in proc mixed
. We intended to predict the total calorie intake by diabetes status and gender, while holding other continuous covariates at population means. We did not find a good tool to generate this prediction together with standard error/confidence interval automatically in Python, other than manual calculation. Hence, we only report findings from R and SAS, and used Python manual calculation as a verification. We have summarised our explanation below:
Both methods gave the same population means (as shown in Table 3) for age, BMI and sedentary activity. The difference is that:
R ggpredict
then predicts the total calorie intake for the 4 scenarios by taking diabetes and gender into account together, i.e. predict the total calorie intake when a. non-diabetes (diabetes = 0), female (male = 0), b. non-diabetes (diabetes = 0), male (male = 1), c. diabetes (diabetes = 1), female (male = 0), and d. diabetes (diabetes = 1), male (male = 1). We used the same strategy and manually calculated the prediction in Python to verify the results.
SAS lsmeans
predicts the total calorie intake for the 4 scenarios by counting for diabetes or gender separately over a balanced population, i.e. predict the total calorie intake when a. non-diabetes (diabetes = 0) with half female and half male, b. diabetes (diabetes = 1) with half female and half male, c. female (male = 0) with half diabetes and half non-diabetes, and d. male (male = 1) with half diabetes and half non-diabetes.
While it is close to the truth that the gender distribution is balanced, it is unreasonable to assume balanced diabetes status in the population. Hence, we decide to report R ggpredict
results in our Table 4.
Thoughts: Based on two discussions we have above, in our example, R and SAS appear to be more powerful tools in statistical analysis than Python. It’s likely because Python was developed for general-purpose, while both R and SAS were developed specifically for statistical computing.
[1] NIDDK. Diabetes Statistics. https://www.niddk.nih.gov/health-information/health-statistics/diabetes-statistics. Accessed 12/11/2019.
[2] American Diabetes Association. Economic costs of diabetes in the US in 2017. Diabetes care 41, no. 5 (2018): 917-928.
[3] CDC. 4 Steps to Manage Your Diabetes. https://www.cdc.gov/diabetes/ndep/pdfs/4steps/4Steps-English.pdf. Accessed 12/11/2019.
[4] UCLA Institute for Digital Research & Education. Introduction to Linear Mixed Models. https://stats.idre.ucla.edu/other/mult-pkg/introduction-to-linear-mixed-models/. Accessed 12/11/2019.
[5] HIPPA. Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule. https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.html. Accessed 12/11/2019.
[6]. R Core Team (2019). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna. https://www.R-project.org
[7] SAS Institute Inc. SAS® 9.4. Cary, NC: SAS Institute Inc; 2019.
[8] Python Software Foundation. Python Language Reference, version 3.6. http://www.python.org
[9] Douglas Bates and Martin Machler and Ben Bolker and Steve Walker. Fitting Linear Mixed-Effects Models Using lme4. Journal of Statistical Software 67(1), 1-48.
[10] SAS Institute Inc. 2017. SAS/STAT® 14.3 User’s Guide. Cary, NC: SAS Institute Inc.
[11] Seabold, Skipper, and Josef Perktold. “Statsmodels: Econometric and statistical modeling with python.” Proceedings of the 9th Python in Science Conference. 2010.