Submission Instructions.
- At this point you should be using your cleaned version of the book
data
- Submit your draft PDF to the Google Drive:
Topic 02: Logistic regression and classification/Draft
folder by the due date.
- Read the question carefully. Some book questions require you to use
appropriate variable selection techniques.
Part I: Logistic Regression
- Playing with odds: PMA6 12.1
- Logistic Regression modeling: PMA6 12.7, 12.8
- Compare the models in part 2 for acute vs chronic illness.
- Are the measures that are important in explaining the outcome
similar?
- For covariates that are the same, compare the effect of that
covariate on each of the outcomes.
- Northridge Earthquake: (PMA6 12.22-12.29 modified). This problem use
the Northridge earthquake data set. [See
here for more info about this event.] We are interested in knowing
if homeowners (
V449
) were more likely than renters to
report emotional injuries (W238
) as a result of the
Northridge earthquake, controlling for age (RAGE
), gender
(RSEX
), and ethnicity (NEWETHN
).
- Download and use my cleaning script as
a starter.
- Fit a logistic regression model on emotional injury as the outcome
using the variables listed above as predictors. Use
tbl_regression
to create a nice table of results that
include odds ratios and 95% confidence intervals. See if you can figure
out how to drop the intercept term from the table.
- Interpret each predictor (except the intercept) in context of the
problem.
- Are the estimated effects of home ownership upon reporting emotional
injuries different for men and women, controlling for age and ethnicity?
That is, is there a significant interaction effect between gender and
home ownership?
Part II: Binary Classification
Playing Hookey (PMA6 12.18-12.19 modified)
- Perform a binary logistic regression analysis using the Parental HIV
data to model the probability of having been absent from school without
a reason (variable
HOOKEY
). Find the variables that best
predict whether an adolescent had been absent without a reason or not.
Use a hefty dose of common sense here, not all
variables are reasonable to use (e.g. using the # of times a student
skips school to predict whether or not they will predict school)
- Use the default value for the
predict()
function to
create a vector of predictions for each student.
- Explore the distribution of predictions against a few variables that
you identified (via the model) as being highly predictive of skipping
school
- Create a confusion matrix for these predictions and interpret:
accuracy, balanced accuracy, sensitivity, specificity, PPV, NPV.