EPI 590R - Finer control over statistics

We fit a series of univariate regressions

income_table <- tbl_uvregression(
  nlsy,
  y = income,
  include = c(
    sex_cat, race_eth_cat,
    eyesight_cat, income, age_bir
  ),
  method = lm
)
income_table

Characteristic	N	Beta	95% CI¹	p-value
age_bir	4,773	595	538, 652	<0.001
sex_cat	10,195
Male		—	—
Female		-358	-844, 128	0.15
race_eth_cat	10,195
Hispanic		—	—
Black		-1,747	-2,507, -988	<0.001
Non-Black, Non-Hispanic		3,863	3,195, 4,530	<0.001
eyesight_cat	6,789
Excellent		—	—
Very good		-578	-1,319, 162	0.13
Good		-1,863	-2,719, -1,006	<0.001
Fair		-4,674	-5,910, -3,439	<0.001
Poor		-6,647	-9,154, -4,140	<0.001
¹ CI = Confidence Interval

But a table is a limited form of output

We might want to dig in a little more to those regressions

One helpful option from {gtsummary} is to extract data from the table directly
This can be reported in a manuscript (rather than copying and pasting from the table)

inline_text(income_table, variable = "age_bir")

[1] "595 (95% CI 538, 652; p<0.001)"

We’ll look at this again later!

What if we want all the numbers, say to create a figure?

Under the hood, {gtsummary} is using the {broom} package to extract the statistics from the various models
We can also use that package directly!

Statistical models in R can be messy

mod_sex_cat <- lm(income ~ sex_cat, data = nlsy)

We could look at the model summary:

summary(mod_sex_cat)


Call:
lm(formula = income ~ sex_cat, data = nlsy)

Residuals:
   Min     1Q Median     3Q    Max 
-14880  -8880  -3943   5477  60478 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    14880.3      172.6  86.237   <2e-16 ***
sex_catFemale   -357.8      247.8  -1.444    0.149    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12510 on 10193 degrees of freedom
  (2491 observations deleted due to missingness)
Multiple R-squared:  0.0002044, Adjusted R-squared:  0.0001064 
F-statistic: 2.084 on 1 and 10193 DF,  p-value: 0.1488

Statistical models in R can be messy

If we want to do something with the various values, we could extract each statistic individually:

coef(mod_sex_cat)

  (Intercept) sex_catFemale 
   14880.3152     -357.8029

confint(mod_sex_cat)

                  2.5 %     97.5 %
(Intercept)   14542.079 15218.5512
sex_catFemale  -843.608   128.0022

summary(mod_sex_cat)$r.squared

[1] 0.0002044429

summary(mod_sex_cat)$coefficients

                Estimate Std. Error   t value  Pr(>|t|)
(Intercept)   14880.3152   172.5521 86.236672 0.0000000
sex_catFemale  -357.8029   247.8349 -1.443715 0.1488499

`{broom}` has three main functions: `augment()`, `glance()`, `tidy()`

augment() adds fitted values, residuals, and other statistics to the original data

library(broom)
augment(mod_sex_cat)

# A tibble: 10,195 × 9
   .rownames income sex_cat .fitted  .resid     .hat .sigma   .cooksd .std.resid
   <chr>      <dbl> <fct>     <dbl>   <dbl>    <dbl>  <dbl>     <dbl>      <dbl>
 1 1          30000 Female   14523.  15477. 0.000202 12506.   1.55e-4      1.24 
 2 2          20000 Female   14523.   5477. 0.000202 12507.   1.94e-5      0.438
 3 3          22390 Female   14523.   7867. 0.000202 12507.   4.01e-5      0.629
 4 4          22390 Female   14523.   7867. 0.000202 12507.   4.01e-5      0.629
 5 5          36000 Male     14880.  21120. 0.000190 12505.   2.72e-4      1.69 
 6 6          35000 Male     14880.  20120. 0.000190 12505.   2.46e-4      1.61 
 7 7           8502 Male     14880.  -6378. 0.000190 12507.   2.48e-5     -0.510
 8 8           7227 Female   14523.  -7296. 0.000202 12507.   3.44e-5     -0.583
 9 9          17000 Male     14880.   2120. 0.000190 12507.   2.74e-6      0.170
10 10          3548 Female   14523. -10975. 0.000202 12506.   7.79e-5     -0.878
# ℹ 10,185 more rows

`{broom}` has three main functions: `augment()`, `glance()`, `tidy()`

glance() creates a table of statistics that pertain to the entire model

glance(mod_sex_cat)

# A tibble: 1 × 12
  r.squared adj.r.squared  sigma statistic p.value    df   logLik     AIC    BIC
      <dbl>         <dbl>  <dbl>     <dbl>   <dbl> <dbl>    <dbl>   <dbl>  <dbl>
1  0.000204      0.000106 12506.      2.08   0.149     1 -110644. 221295. 2.21e5
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

`{broom}` has three main functions: `augment()`, `glance()`, `tidy()`

tidy() is the most useful to me and probably you!

It extracts coefficients and confidence intervals from models

tidy(mod_sex_cat, conf.int = TRUE)

# A tibble: 2 × 7
  term          estimate std.error statistic p.value conf.low conf.high
  <chr>            <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
1 (Intercept)     14880.      173.     86.2    0       14542.    15219.
2 sex_catFemale    -358.      248.     -1.44   0.149    -844.      128.

`tidy()` works on over 100 statistical methods in R!

Anova, ARIMA, Cox, factor analysis, fixed effects, GAM, GEE, IV, kappa, kmeans, multinomial, proportional odds, principal components, survey methods, …

See the full list here
All the output shares column names
This makes it really easy to work with the output and reuse code across analyses

Some models have additional arguments

For example, we might want exponentiated coefficients:

logistic_model <- glm(glasses ~ eyesight_cat + sex_cat + income,
                      data = nlsy, family = binomial())
tidy(logistic_model, conf.int = TRUE, exponentiate = TRUE)

# A tibble: 7 × 7
  term                  estimate std.error statistic  p.value conf.low conf.high
  <chr>                    <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)              0.499   5.96e-2   -11.7   1.74e-31    0.444     0.560
2 eyesight_catVery good    0.920   5.96e-2    -1.39  1.64e- 1    0.819     1.03 
3 eyesight_catGood         0.916   6.91e-2    -1.27  2.04e- 1    0.800     1.05 
4 eyesight_catFair         0.802   1.00e-1    -2.20  2.77e- 2    0.658     0.976
5 eyesight_catPoor         1.03    2.01e-1     0.147 8.83e- 1    0.694     1.53 
6 sex_catFemale            2.04    5.00e-2    14.2   5.46e-46    1.85      2.25 
7 income                   1.00    1.93e-6     7.49  6.95e-14    1.00      1.00

We can also combine the results of lots of regressions

# we already made mod_sex_cat
mod_race_eth_cat <- lm(income ~ race_eth_cat, data = nlsy)
mod_eyesight_cat <- lm(income ~ eyesight_cat, data = nlsy)
mod_age_bir <- lm(income ~ age_bir, data = nlsy)

tidy_sex_cat <- tidy(mod_sex_cat, conf.int = TRUE)
tidy_race_eth_cat <- tidy(mod_race_eth_cat, conf.int = TRUE)
tidy_eyesight_cat <- tidy(mod_eyesight_cat, conf.int = TRUE)
tidy_age_bir <- tidy(mod_age_bir, conf.int = TRUE)

With a little finagling, we have the same data as in the original univartiate regression table…

dplyr::bind_rows(
  sex_cat = tidy_sex_cat,
  race_eth_cat = tidy_race_eth_cat,
  eyesight_cat = tidy_eyesight_cat,
  age_bir = tidy_age_bir, .id = "model") |>
  dplyr::mutate(
    term = stringr::str_remove(term, model),
    term = ifelse(term == "", model, term))

With a little finagling, we have the same data as in the original univartiate regression table…

# A tibble: 12 × 8
   model        term    estimate std.error statistic  p.value conf.low conf.high
   <chr>        <chr>      <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
 1 sex_cat      (Inter…   14880.     173.      86.2  0          14542.    15219.
 2 sex_cat      Female     -358.     248.      -1.44 1.49e- 1    -844.      128.
 3 race_eth_cat (Inter…   12867.     302.      42.7  0          12276.    13459.
 4 race_eth_cat Black     -1747.     387.      -4.51 6.58e- 6   -2507.     -988.
 5 race_eth_cat Non-Bl…    3863.     341.      11.3  1.20e-29    3195.     4530.
 6 eyesight_cat (Inter…   17683.     270.      65.6  0          17155.    18212.
 7 eyesight_cat Very g…    -578.     378.      -1.53 1.26e- 1   -1319.      162.
 8 eyesight_cat Good      -1863.     437.      -4.26 2.05e- 5   -2719.    -1006.
 9 eyesight_cat Fair      -4674.     630.      -7.42 1.35e-13   -5910.    -3439.
10 eyesight_cat Poor      -6647.    1279.      -5.20 2.07e- 7   -9154.    -4140.
11 age_bir      (Inter…    1707.     733.       2.33 1.99e- 2     270.     3143.
12 age_bir      age_bir     595.      29.1     20.4  3.71e-89     538.      652.

Even easier cleanup!

We could instead clean up the names and add reference rows with the {tidycat} package:

tidy(logistic_model, conf.int = TRUE, exponentiate = TRUE) |> 
  tidycat::tidy_categorical(logistic_model, exponentiate = TRUE) |>
  dplyr::select(-c(3:5))

# A tibble: 9 × 8
  term               estimate conf.low conf.high variable level effect reference
  <chr>                 <dbl>    <dbl>     <dbl> <chr>    <fct> <chr>  <chr>    
1 (Intercept)           0.499    0.444     0.560 (Interc… (Int… main   Non-Base…
2 <NA>                  1        1         1     eyesigh… Exce… main   Baseline…
3 eyesight_catVery …    0.920    0.819     1.03  eyesigh… Very… main   Non-Base…
4 eyesight_catGood      0.916    0.800     1.05  eyesigh… Good  main   Non-Base…
5 eyesight_catFair      0.802    0.658     0.976 eyesigh… Fair  main   Non-Base…
6 eyesight_catPoor      1.03     0.694     1.53  eyesigh… Poor  main   Non-Base…
7 <NA>                  1        1         1     sex_cat  Male  main   Baseline…
8 sex_catFemale         2.04     1.85      2.25  sex_cat  Fema… main   Non-Base…
9 income                1.00     1.00      1.00  income   inco… main   Non-Base…

This makes it easy to make forest plots, for example

library(ggplot2)
tidy(logistic_model, conf.int = TRUE, exponentiate = TRUE) |> 
  tidycat::tidy_categorical(logistic_model, exponentiate = TRUE) |> 
  dplyr::slice(-1) |> # remove intercept
  ggplot(mapping = aes(x = level, y = estimate, 
                       ymin = conf.low, ymax = conf.high)) +
  geom_point() +
  geom_errorbar() +
  facet_grid(cols = vars(variable), scales = "free", space = "free") +
  scale_y_log10()

This makes it easy to make forest plots, for example

Exercises

Download a new script with these examples.
Run it.
Teach yourself to use broom::tidy() to extract the results of the Poisson regression with robust standard errors and combine them with the results of the log-binomial regression.
Start creating some tables for your final project!

15:00