Introduction: Just tell me “the” effect

Oftentimes stakeholders are interested in whether something has a positive or negative effect and if so, by how much. When we use a logistic regression to model a binary outcome, communicating these results is not so straightforward. To review, the formula for the logistic regression model is the following:

\[ \mbox{logit}(p) = \log\left(\begin{array}{c}\frac{p}{1 - p}\end{array}\right) = \beta_0 + \beta_1x_1 + \ldots + \beta_nx_n \]

In the equation p is the probability of the outcome, each of the x_is is a variable we are using to predict the outcome, and each of the β coefficients represents some increase or decrease on the logistic regression curve associated with that variable. This equation tells us that each of the βs are in the log odds scale. This is precisely one of the reasons why it is difficult to convey the magnitude of effect for a logistic regression model. If everyone knew what a log odds was and how to convey measurements in the log odds scale, then interpreting logistic regression results would be a walk in the park. But for most of us, the log odds is not a unit of measurement we understand. A scale that we do, however, understand is the probability scale which ranges from 0 to 1. Fortunately, we can go from the log odds to probability scale by rearranging the equation above and solving for the probability p:

\[ \begin{aligned} p & = \mbox{logit}^{-1}(\beta_0 + \beta_1x_1 + \ldots + \beta_nx_n) \\ & \\ & = \frac{e^{\beta_0 + \beta_1x_1 + \ldots + \beta_nx_n}}{1 + e^{\beta_0 + \beta_1x_1 + \ldots + \beta_nx_n}} \end{aligned} \]

If we graph the equation we get a sigmoid-shaped curve:

The slope of the curve is not constant: a one-unit increase for different values of z can result in vastly different probabilities p. Where the slope of the curve is relatively flat, for example from z = -5 to z = -2.5, the difference in probability from a one-unit increase is very small. On the other hand, where the slope curve is steep, for example from z = 0 to z = 2.5, the difference in probability from a one-unit increase is large.

The non-linear nature of the curve means that the magnitude of effect in percentage points varies. If the effect varies, then what is “the” effect that we report, and particularly to a stakeholder who does not know what a logistic regression is? In this discussion we will explore various visualization options to present logistic regression results to non-technical audiences, and the pros and cons of each option. We will also discuss in which situations you might choose one visualization over another.

Sample dataset and model

Dataset

Our simulated dataset describes students who took Balloon Animal-Making 201 at University Imaginary. It is available in data/course_outcomes.csv (included in the GitHub repository).

library(tidyverse)
df = read.csv("data/course_outcomes.csv", header = T, stringsAsFactors = F)

The dataset contains the following variables:

Variable	Possible Responses	Variable Type
Mac user	TRUE/FALSE	binary
Wear glasses	TRUE/FALSE	binary
Pet type	dog, cat, fish, none	categorical
Favorite color	blue, red, green, orange	categorical
Prior undergraduate GPA	0.0-4.0	continuous
Height	54-77 inches	continuous
Went to tutoring	TRUE/FALSE	binary
Passed	TRUE/FALSE	binary

We centered and standardized the continuous variables:

df = df %>%
  mutate(cs.prior.gpa = (prior.gpa - mean(prior.gpa)) / sd(prior.gpa),
         cs.height = (height - mean(height)) / sd(height))

We set the reference levels of the categorical variables (pet type and favorite color) to no pet and blue, respectively:

df = df %>%
  mutate(pet.type = fct_relevel(pet.type, "none", "dog", "cat", "fish"),
         favorite.color = fct_relevel(favorite.color, "blue", "red", "green",
                                      "orange"))

Model

We built a logistic regression model to determine what variables are associated with a student passing the Balloon Animal-Making course.

library(lme4)
pass.m = glm(passed ~ mac + glasses + pet.type + favorite.color + cs.prior.gpa +
               cs.height + tutoring,
             data = df, family = binomial(link = "logit"))
summary(pass.m)$coefficients

##                         Estimate Std. Error    z value      Pr(>|z|)
## (Intercept)           1.53678432 0.09665694 15.8993682  6.400698e-57
## macTRUE              -0.04070026 0.08251424 -0.4932514  6.218350e-01
## glassesTRUE           0.19330654 0.07787099  2.4823948  1.305026e-02
## pet.typedog          -0.25143138 0.08483778 -2.9636722  3.039919e-03
## pet.typecat           0.09616174 0.11927784  0.8061995  4.201278e-01
## pet.typefish         -1.19359401 0.16656361 -7.1659949  7.722363e-13
## favorite.colorred    -0.03945396 0.09265674 -0.4258078  6.702479e-01
## favorite.colorgreen  -0.38137532 0.10062190 -3.7901819  1.505370e-04
## favorite.colororange -0.24204783 0.13900517 -1.7412865  8.163337e-02
## cs.prior.gpa          1.03092175 0.03887172 26.5211237 5.531945e-155
## cs.height            -0.25908893 0.03833829 -6.7579681  1.399404e-11
## tutoringTRUE          0.22698497 0.07583279  2.9932300  2.760416e-03

We pulled out the model coefficients and created a dataframe. This dataframe will be used throughout our visualization-exploration journey.

coefs.df = summary(pass.m)$coefficients %>%
  data.frame() %>%
  rownames_to_column("parameter") %>%
  mutate(pretty.parameter =
           case_when(parameter == "(Intercept)" ~ "Intercept",
                     grepl("TRUE$", parameter) ~
                       str_to_title(gsub("TRUE", "", parameter)),
                     grepl("pet\\.type", parameter) ~
                       paste("Pet:", str_to_title(gsub("pet\\.type", "", parameter))),
                     grepl("favorite\\.color", parameter) ~
                       paste("Favorite color:",
                             str_to_title(gsub("favorite\\.color", "", parameter))),
                     parameter == "cs.prior.gpa" ~
                       paste("Prior GPA\n(", round(sd(df$prior.gpa), 1),
                             "-pt increase)", sep = ""),
                     parameter == "cs.height" ~
                       paste("Height\n(", round(sd(df$height), 1),
                             "-in increase)", sep = ""))) %>%
  dplyr::select(parameter, pretty.parameter, est = Estimate, se = Std..Error,
                z = z.value, p = Pr...z..)

Colors

The code below uses a few named colors for consistency across graphs:

good.color = "#0571B0"
neutral.color = "gray"
bad.color = "#CA0020"

Disclaimer: causality

Several of the visualizations below strongly imply a causal effect of some predictor on the outcome. However, this discussion is not about causality, and we will not discuss how to robustly evaluate causality. Although the logistic regression model estimates the values of the predictor variables, whether a causal relationship actually exists between the predictor and outcome variable requires careful consideration and domain knowledge surrounding the research question. You should use your judgment when considering these visualizations: if your data doesn’t justify a causal interpretation, don’t use them!

Presenting model coefficients

As the researcher, perhaps the easiest visualization we can create from the logistic regression analysis is a plot of the raw outputs of the predictors estimated by the model. In this section we will review visualizations based on the raw coefficients or simple functions thereof.

Log odds

Oftentimes in papers we see a summary of the model’s raw outputs in a table, like this:

Parameter	Estimate	Standard error	z	p
Intercept	1.5367843	0.0966569	15.8993682	0.0000000
Mac	-0.0407003	0.0825142	-0.4932514	0.6218350
Glasses	0.1933065	0.0778710	2.4823948	0.0130503
Pet: Dog	-0.2514314	0.0848378	-2.9636722	0.0030399
Pet: Cat	0.0961617	0.1192778	0.8061995	0.4201278
Pet: Fish	-1.1935940	0.1665636	-7.1659949	0.0000000
Favorite color: Red	-0.0394540	0.0926567	-0.4258078	0.6702479
Favorite color: Green	-0.3813753	0.1006219	-3.7901819	0.0001505
Favorite color: Orange	-0.2420478	0.1390052	-1.7412865	0.0816334
Prior GPA (0.6-pt increase)	1.0309217	0.0388717	26.5211237	0.0000000
Height (3-in increase)	-0.2590889	0.0383383	-6.7579681	0.0000000
Tutoring	0.2269850	0.0758328	2.9932300	0.0027604

Although this table may be appropriate for technical audiences who are familiar with logistic regression, but non-technical audiences will not understand the meaning behind each of the columns. Furthermore, the numbers in the table can be taxing on the eyes making it difficult to absorb insights from your model. We can aid the audience by swapping out the table with a caterpillar plot of the predictors.

log.odds.p = coefs.df %>%
  filter(parameter != "(Intercept)") %>%
  mutate(
    pretty.parameter = fct_reorder(pretty.parameter, est),
    lower.95 = est + (qnorm(0.025) * se),
    lower.50 = est + (qnorm(0.25) * se),
    upper.50 = est + (qnorm(0.75) * se),
    upper.95 = est + (qnorm(0.975) * se),
    signif = case_when(p > 0.05 ~ "Not significant",
                       est > 0 ~ "Positive",
                       est < 0 ~ "Negative"),
    signif = fct_relevel(signif, "Positive",
                         "Not significant",
                         "Negative")
  ) %>%
  ggplot(aes(x = pretty.parameter, color = signif)) +
  geom_linerange(aes(ymin = lower.95,
                     ymax = upper.95),
                 size = 1) +
  geom_linerange(aes(ymin = lower.50,
                     ymax = upper.50),
                 size = 2) +
  geom_point(aes(y = est), size = 3) +
  geom_hline(yintercept = 0) +
  scale_color_manual(
    "Relationship to\nlog odds of passing",
    values = c(good.color, neutral.color, bad.color)
  ) +
  labs(x = "", y = "Change in log odds",
       title = "Estimated relationships between\nstudent characteristics\nand log odds of passing") +
  coord_flip(clip = "off")

This plot clearly presents all the predictors used in the model; catalogs which effects are positive, negative, or of no effect; and contains confidence intervals to convey the uncertainty in our estimates. You could argue that a table could do these things as well: color the cells of the table, add columns for the confidence intervals, and order the cells. Therefore, the characteristic which makes this plot more visually appealing than a table is that the numbers are isolated in one place: the x-axis. However, using the raw outputs from the logistic regression means that the magnitude of effect is in the log odds scale. This visual is perfect for audiences who know exactly what a log odds is, and problematic for non-technical audiences who do not. The visual can leave your non-technical audience member wondering what a 0.4 change in the log odds even means. Is the difference between a change in the log odds of 0.4 and 0.8 big or small? If the scale doesn’t mean anything to your audience member then it is your duty as the researcher to either provide an explanation or make an adjustment.

Secret log odds

Rather than trying to have your audience understand log odds, the simplest adjustment you could make is to relabel the x-axis. This second visual is exactly the same as the fist visual, with the only difference being in the x-axis. Instead of the x-axis showing the log odds scale explicitly, it simply indicates whether the chance of passing is higher, the same, or lower.

secret.log.odds.p = coefs.df %>%
  filter(parameter != "(Intercept)") %>%
  mutate(
    pretty.parameter = fct_reorder(pretty.parameter, est),
    lower.95 = est + (qnorm(0.025) * se),
    lower.50 = est + (qnorm(0.25) * se),
    upper.50 = est + (qnorm(0.75) * se),
    upper.95 = est + (qnorm(0.975) * se),
    signif = case_when(p > 0.05 ~ "Not significant",
                       est > 0 ~ "Positive",
                       est < 0 ~ "Negative"),
    signif = fct_relevel(signif, "Positive",
                         "Not significant",
                         "Negative")) %>%
  ggplot(aes(x = pretty.parameter, color = signif)) +
  geom_linerange(aes(ymin = lower.95, ymax = upper.95), size = 1) +
  geom_linerange(aes(ymin = lower.50, ymax = upper.50), size = 2) +
  geom_point(aes(y = est), size = 3) +
  geom_hline(yintercept = 0) +
  scale_y_continuous(
    breaks = c(-1, 0, 1), #<<
    labels = c("← Lower", #<<
               "Same", #<<
               "Higher →") #<<
  ) +
  scale_color_manual(
    "Relationship to\nlog odds of passing",
    values = c(good.color, neutral.color, bad.color)
  ) +
  labs(x = "", y = "Chance of passing",
       title = "Estimated relationships between\nstudent characteristics\nand chance of passing") +
  coord_flip()

This graph is perfect if your audience is only interested in knowing which variables (if any) are related to the outcome, and what the sign of each relationship is. Within limits, it can also show which variables are more strongly related to the outcome than others (depending on how continuous predictors are scaled). However, relabeling the axis this way means that there are bands of different lengths but no unit of measurement to describe their absolute effect size. Therefore, just like the first graph, we run into the same issue: the non-technical audience member still will not get a sense of the overall magnitude of effect. This is problematic because say, for example, your stakeholder asked you to evaluate whether offering tutoring sessions to students in Balloon Animal-Making 201 is helping students pass the class. The effect of the tutoring intervention is positive, but this graph doesn’t indicate what that means in practical terms: does it increase a student’s chance of passing by 5%? 90%? Your stakeholder may deem that a 5% increase is not a high enough increase to justify funding the tutoring program.

Odds ratios

The log odds scale is hard to interpret directly, but your audience may be more familiar with the “odds” part of log odds. The quantity \(\frac{p}{1 - p}\) is another way of expressing ideas like “3-to-1 odds” or “2-to-5 odds”. Can we exponentiate the log odds to get something more interpretable?

Sort of, but there are two non-trivial obstacles. The first is that your audience is likely to be more familiar with odds expressed as integer ratios (“3-to-1” or “2-to-5”) rather than “3” or “0.4”. Your model is unlikely to produce odds that are rational numbers for small integers, so you’ll need to either explain your unusual-looking odds or convert them to approximations (for example, “roughly 1-to-3 odds” for a value of 0.35).

The second, and more serious, problem is that the coefficients of your model (which are probably what your audience cares most about) don’t represent odds directly; they represent changes in odds. Each coefficient represents an additive change on the log odds scale; when we exponentiate to get odds, each coefficient represents a multiplicative change. That is, for each β_i, a one-unit increase in x_i multiplies the odds of the outcome by e^β_i. To put it yet another way, the coefficients represent a change in the odds ratio.

\[ \begin{aligned} e^{\beta_i} & = \frac{e^{\beta_i}e^{\beta_ix_i}}{e^{\beta_ix_i}} \\ \\ & = \frac{e^{\beta_ix_i + \beta_i}}{e^{\beta_ix_i}} \\ \\ & = \frac{e^{\beta_i(x_i + 1)}}{e^{\beta_ix_i}} \\ \\ & = \frac{e^{\beta_i(x_i + 1)} \cdot e^{\beta_0 + \beta_1x_1 + \ldots + \beta_{i - 1}x_{i - 1} + \beta_{i + 1}x_{i + 1} \ldots \beta_nx_n}}{e^{\beta_ix_i} \cdot e^{\beta_0 + \beta_1x_1 + \ldots + \beta_{i - 1}x_{i - 1} + \beta_{i + 1}x_{i + 1} \ldots \beta_nx_n}} \\ \\ & = \frac{e^{\log(\mbox{odds for } x_i + 1)}}{e^{\log(\mbox{odds for } x_i)}} \\ \\ & = \frac{\mbox{odds for } x_i + 1}{\mbox{odds for } x_i} \end{aligned} \]

We can plot exponentiated coefficients just like raw coefficients.

odds.ratio.p = coefs.df %>%
  filter(parameter != "(Intercept)") %>%
  mutate(
    pretty.parameter = fct_reorder(pretty.parameter, est),
    lower.95 = est + (qnorm(0.025) * se),
    lower.50 = est + (qnorm(0.25) * se),
    upper.50 = est + (qnorm(0.75) * se),
    upper.95 = est + (qnorm(0.975) * se),
    signif = case_when(p > 0.05 ~ "Not significant",
                       est > 0 ~ "Positive",
                       est < 0 ~ "Negative"),
    signif = fct_relevel(signif, "Positive",
                         "Not significant",
                         "Negative")
  ) %>%
  mutate(across(matches("est|lower|upper"), #<<
                ~ exp(.))) %>% #<<
  ggplot(aes(x = pretty.parameter, color = signif)) +
  geom_linerange(aes(ymin = lower.95,
                     ymax = upper.95),
                 size = 1) +
  geom_linerange(aes(ymin = lower.50,
                     ymax = upper.50),
                 size = 2) +
  geom_point(aes(y = est), size = 3) +
  geom_hline(yintercept = 1) +
  scale_y_continuous(
    labels = scales::percent_format() #<<
  ) +
  scale_color_manual("Relationship to\nlog odds of passing",
                     values = c(good.color, neutral.color, bad.color)) +
  labs(x = "", y = "% change in odds ratio",
       title = "Estimated relationships between\nstudent characteristics\nand odds ratio of passing") +
  coord_flip()

Changes in odds ratios may be a bit easier to describe for your audience than changes in log odds. (For example, tripling the odds ratio is like going from 3-to-1 odds to 9-to-1 odds, or from 1-to-3 odds to an even chance.) But we’re still pretty far removed from the kinds of scales your audience will be most familiar with, like percentages. Worse, there’s a danger that the percent change in log odds might be misinterpreted as the absolute probability of the outcome (or the change in its probability), which is not what these plots represent at all. Finally, when we represent the coefficients as changes in odds ratios, we expand the scale for coefficients with a positive effect and compress it for coefficients with a negative effect. (The odds ratio graph suggests that a 0.6-point increase in prior GPA has a much larger effect on passing than having a pet fish, while the graph of log odds suggests that the effects have approximately the same magnitude.)

Presenting probabilities

Probability relative to some baseline

Your audience may not be familiar with log odds or odds ratios, but they are certainly familiar with probabilities. We’ve already discussed the primary obstacle to translating logistic regression coefficients into probabilities: the change in percentage points depends on the baseline starting value. But if we can choose an appropriate baseline probability, we can present our model coefficients on a probability scale after all.

One approach is to use the overall intercept of the model as a baseline. When we take the inverse logit of the intercept, we get the probability of passing for a student with average prior GPA, average height, and the baseline value of each categorical variable – not a Mac user, doesn’t wear glasses, has no pet, favorite color is blue, and didn’t go to tutoring. This baseline probability of passing, as it turns out, is 82%. We can then add each coefficient individually to the intercept to discover the predicted probability of passing for a baseline student with one characteristic changed:

intercept = coefs.df$est[coefs.df$parameter == "(Intercept)"]
prob.baseline.p = coefs.df %>%
  filter(parameter != "(Intercept)") %>%
  mutate(pretty.parameter = fct_reorder(pretty.parameter, est),
         lower.95 = est + (qnorm(0.025) * se),
         lower.50 = est + (qnorm(0.25) * se),
         upper.50 = est + (qnorm(0.75) * se),
         upper.95 = est + (qnorm(0.975) * se),
         signif = case_when(p > 0.05 ~ "Not significant",
                            est > 0 ~ "Positive",
                            est < 0 ~ "Negative"),
         signif = fct_relevel(signif, "Positive", "Not significant", "Negative")) %>%
  mutate(across(
    matches("est|lower|upper"), #<<
    ~ invlogit(. + intercept) #<<
  )) %>%
  ggplot(aes(x = pretty.parameter, color = signif)) +
  geom_linerange(aes(ymin = lower.95, ymax = upper.95), size = 1) +
  geom_linerange(aes(ymin = lower.50, ymax = upper.50), size = 2) +
  geom_point(aes(y = est), size = 3) +
  geom_hline(
    yintercept = invlogit(intercept) #<<
  ) +
  scale_y_continuous(
    limits = c(0, 1), #<<
    labels = scales::percent_format() #<<
  ) +
  scale_color_manual("Relationship to\nprobability of passing",
                     values = c(good.color, neutral.color, bad.color)) +
  labs(x = "", y = "Probability of passing",
       title = "Estimated relationships between\nstudent characteristics\nand probability of passing") +
  coord_flip() +
  theme_bw()

This graph includes confidence intervals around the predicted probabilities, just like the confidence intervals around the coefficients in the earlier graphs. Alternatively, we can use arrows to emphasize that we’re showing predicted differences from a baseline:

prob.baseline.arrows.p = coefs.df %>%
  filter(parameter != "(Intercept)") %>%
  mutate(pretty.parameter = fct_reorder(pretty.parameter, est),
         signif = case_when(p > 0.05 ~ "Not significant",
                            est > 0 ~ "Positive",
                            est < 0 ~ "Negative"),
         signif = fct_relevel(signif, "Positive",
                              "Not significant",
                              "Negative"),
         est = invlogit(est + intercept)) %>%
  ggplot(aes(x = invlogit(intercept), #<<
             xend = est, #<<
             y = pretty.parameter, #<<
             yend = pretty.parameter, #<<
             color = signif)) + #<<
  geom_segment( #<<
    size = 1, #<<
    arrow = arrow(length = unit(0.1, "in"), #<<
                  type = "closed") #<<
  ) + #<<
  geom_vline(xintercept = invlogit(intercept)) +
  scale_x_continuous(
    limits = c(0, 1),
    labels = scales::percent_format()
  ) +
  scale_color_manual("Relationship to\nprobability of passing",
                     values = c(good.color, neutral.color, bad.color)) +
  labs(x = "Probability of passing", y = "",
       title = "Estimated relationships between\nstudent characteristics\nand probability of passing") +
  theme_bw()

The biggest advantage of this approach is that it uses a scale that your audience is already very familiar with: probabilities (expressed as percentages). Moreover, it avoids a common, but misleading, way of presenting changes in probabilities: the “percent change” formulation. If I tell you that tutoring doubles a student’s chances of passing, you don’t know whether it raises the student’s chances from 1% to 2% or from 40% to 80% – and that difference probably matters to you! Instead, it presents the “before” and “after” probabilities simultaneously, which is much more helpful context for the audience.

The biggest difficulty with this approach is choosing an appropriate baseline. The idea of representing an “average” student seems reasonable enough, although the process for doing so depends on how the model is specified. In our model, the intercept represents a student with average values for the continuous predictors because we standardized those predictors; if we hadn’t, the intercept would be quite different and would represent a student with a prior GPA and height of 0. Such a baseline wouldn’t make any sense; in that case, we would want to make adjustments (for example, by adding the mean prior GPA and height back to the intercept, multiplied by their respective coefficients).

More seriously, with respect to the categorical predictors, our baseline represents a student who is not actually all that average: it’s a student who is not a Mac user, doesn’t wear glasses, etc. All of our baseline categories are the most frequent values in the dataset, but it’s nevertheless the case that students with all of the most frequent values are a small minority (only 4% of the dataset). Whether this is an acceptable baseline depends on your particular situation.

Another approach would be to take the raw pass rate from the overall dataset and use that as the baseline. (To get probabilities for each predictor, we would take the logit of the raw pass rate, add the appropriate coefficient, and take inverse logit to get back to a probability.) Or you could choose a probability that is meaningful for your particular application (for example, maybe your stakeholders are especially interested in students who are “on the edge”, and so you might choose 50% as a baseline). What all this discussion shows is that presenting your model as changes in probabilities doesn’t eliminate the need to explain the visualization to your stakeholders. You may not have to explain the probability scale, but you will definitely need to explain how and why you chose the baseline you did.

Multiple baselines by group

If there’s no single appropriate baseline probability for your dataset, you might choose to make multiple plots, one for each of several baselines. In the example below, we show four baselines, one for each type of pet a student might have. The baseline for each group is the overall model intercept plus the coefficient for that type of pet.

prob.group.p = expand.grid(pet = c("None", "Dog", "Cat", "Fish"),
                           other.parameter = coefs.df %>%
                             filter(!grepl("pet\\.type|Intercept", parameter)) %>%
                             pull(parameter)) %>%
  mutate(pet.parameter = paste("pet.type", str_to_lower(pet), sep = "")) %>%
  left_join(coefs.df, by = c("pet.parameter" = "parameter")) %>%
  mutate(pretty.parameter = coalesce(pretty.parameter, "Pet: None"),
         mu = intercept + coalesce(est, 0),
         baseline.mu = mu) %>%
  dplyr::select(pet, other.parameter, mu, baseline.mu) %>%
  left_join(coefs.df, by = c("other.parameter" = "parameter")) %>%
  mutate(pretty.parameter = fct_reorder(pretty.parameter, est),
         mu = mu + est,
         lower.95 = mu + (qnorm(0.025) * se),
         lower.50 = mu + (qnorm(0.25) * se),
         upper.50 = mu + (qnorm(0.75) * se),
         upper.95 = mu + (qnorm(0.975) * se),
         signif = case_when(p > 0.05 ~ "Not significant",
                            est > 0 ~ "Positive",
                            est < 0 ~ "Negative"),
         signif = fct_relevel(signif, "Positive", "Not significant", "Negative")) %>%
  mutate(across(matches("mu|lower|upper"), ~ invlogit(.))) %>%
  ggplot(aes(x = pretty.parameter, color = signif)) +
  geom_linerange(aes(ymin = lower.95, ymax = upper.95), size = 1) +
  geom_linerange(aes(ymin = lower.50, ymax = upper.50), size = 2) +
  geom_point(aes(y = mu), size = 3) +
  geom_hline(aes(yintercept = baseline.mu)) +
  scale_y_continuous(limits = c(0, 1), labels = scales::percent_format()) +
  scale_color_manual("Relationship to\nprobability of passing",
                     values = c(good.color, neutral.color, bad.color)) +
  facet_wrap(~ pet) +
  theme(panel.spacing.x = unit(0.65, "lines")) +
  labs(x = "", y = "Probability of passing", subtitle = "By type of pet",
       title = "Estimated relationships between\nstudent characteristics\nand probability of passing") +
  coord_flip()

As before, we can use arrows instead of confidence intervals:

prob.group.arrows.p = expand.grid(pet = c("None", "Dog", "Cat", "Fish"),
                                  other.parameter = coefs.df %>%
                                    filter(!grepl("pet\\.type|Intercept", parameter)) %>%
                                    pull(parameter)) %>%
  mutate(pet.parameter = paste("pet.type", str_to_lower(pet), sep = "")) %>%
  left_join(coefs.df, by = c("pet.parameter" = "parameter")) %>%
  mutate(pretty.parameter = coalesce(pretty.parameter, "Pet: None"),
         mu = coefs.df$est[coefs.df$parameter == "(Intercept)"] +
           coalesce(est, 0),
         baseline.mu = mu) %>%
  dplyr::select(pet, other.parameter, mu, baseline.mu) %>%
  left_join(coefs.df, by = c("other.parameter" = "parameter")) %>%
  mutate(pretty.parameter = fct_reorder(pretty.parameter, est),
         mu = mu + est,
         signif = case_when(p > 0.05 ~ "Not significant",
                            est > 0 ~ "Positive",
                            est < 0 ~ "Negative"),
         signif = fct_relevel(signif, "Positive", "Not significant",
                              "Negative")) %>%
  mutate(across(matches("mu"), ~ invlogit(.))) %>%
  ggplot(aes(x = baseline.mu, xend = mu, y = pretty.parameter,
             yend = pretty.parameter, color = signif)) +
  geom_segment(size = 1,
               arrow = arrow(length = unit(0.1, "in"), type = "closed")) +
  geom_vline(aes(xintercept = baseline.mu)) +
  scale_x_continuous(limits = c(0, 1), labels = scales::percent_format()) +
  scale_color_manual("Relationship to\nprobability of passing",
                     values = c(good.color, neutral.color, bad.color)) +
  facet_wrap(~ pet) +
  theme(panel.spacing.x = unit(0.65, "lines")) +
  labs(x = "Probability of passing", y = "",
       title = "Estimated relationships between\nstudent characteristics\nand probability of passing",
       subtitle = "By type of pet")

This approach is more cluttered than a single graph. But, if that’s not a deal-breaker for you, it has several advantages. First, it emphasizes that the baseline we show is a choice, and that different students have different baselines.

Second, this approach explicitly shows how the effect of a given predictor varies depending on the baseline. In the example above, prior GPA is associated with a larger percentage point change for students who have a fish than for other types of students, because students with fish start at a different baseline.

Finally, this approach allows you to show different effects by group. In our example, the larger effect for students with fish is purely a function of their lower baseline. But we could easily imagine a model with true interactions between pet type and other predictors. (For example, maybe tutoring is more effective for dog owners than for cat owners.) In that case, we would just add the interaction term when computing the probability for the relevant predictor.

Banana graphs

Even if we present multiple groups, we still have to choose a (possibly arbitrary) baseline for each group. We can overcome this picking-and-choosing problem by iterating across every baseline. For example, we can calculate the 0% to 100% predicted probability of passing Balloon Animal-Making 201 for a student who does not own a pet fish, and use those baseline values to compute the predicted probability of a student who does own a pet fish and plot these values on a graph. The x-axis value for each point on the curve is the baseline probability, i.e., the probability that the student does not own a pet fish. The y-axis value for each point on the curve is the probability increase or decrease from the baseline, i.e., the probability that the student does own a pet fish. Due to the shape of the curve, we call this graph the banana graph. We also include a solid line which goes diagonally across the middle of the graph as a reference line. When the banana-shaped curve is:

Above the solid line: the predictor variable has a higher predicted probability of passing than its baseline
On the solid line: the predictor variable has the same predicted probability of passing as its baseline
Below the solid line: the predictor variable has a lower predicted probability of passing than its baseline

In the figure below, you can see that the banana-shaped curve is below the solid line, meaning, students who own a pet fish are less likely to pass than students who do not own a pet fish.

est = coefs.df$est[coefs.df$parameter == "pet.typefish"]
se = coefs.df$se[coefs.df$parameter == "pet.typefish"]
effect.color = case_when(coefs.df$p[coefs.df$parameter == "pet.typefish"] > 0.05 ~ neutral.color,
                         est > 0 ~ good.color,
                         T ~ bad.color)
banana.p = data.frame(x = seq(0.01, 0.99, 0.01),
                      upper.95 = 0.975,
                      upper.50 = 0.75,
                      median = 0.5,
                      lower.50 = 0.25,
                      lower.95 = 0.025) %>%
  mutate(across(matches("median|upper|lower"), #<<
                function(q) { #<<
                  current.x = get("x") #<<
                  invlogit(logit(current.x) + #<<
                             est + #<<
                             (qnorm(q) * se)) #<<
                })) %>% #<<
  ggplot(aes(x = x, group = 1)) +
  geom_segment(x = 0, xend = 1, y = 0, yend = 1) +
  geom_ribbon(aes(ymin = lower.95, ymax = upper.95),
              fill = effect.color, alpha = 0.2) +
  geom_ribbon(aes(ymin = lower.50, ymax = upper.50),
              fill = effect.color, alpha = 0.4) +
  geom_line(aes(y = median), color = effect.color) +
  scale_x_continuous(labels = scales::percent_format()) +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(x = "Baseline probablity of passing",
       y = "Probability of passing with effect",
       title = gsub("\n", " ",
                    coefs.df$pretty.parameter[coefs.df$parameter == "pet.typefish"]),
       subtitle = "Estimated relationship to probability of passing")

However, these banana graphs can be difficult for audience members to understand and tricky to explain. To overcome this issue, we suggest that you begin by presenting one banana graph enlarged on the page, choosing one example point, and annotating the graph with how the example point should be interpreted (as shown below). If your audience is unfamiliar with these graphs it can be overwhelming, and their eyes might gloss over the figures if you begin filling the page with these graphs. Starting with one banana graph and enlarging it prevents your audiences’ eyes from wandering and becoming overwhelmed. The annotation provides a concrete example of how the audience should be interpreting each point on the graph.

Once you’ve helped the audience understand one banana graph, you can then go on to present multiple banana graphs, like so:

banana.multiple.p =
  expand.grid(x = seq(0.01, 0.99, 0.01), #<<
              pet = c("fish", "dog", "cat")) %>% #<<
  mutate(pet = paste("pet.type", pet, sep = ""),
         upper.95 = 0.975,
         upper.50 = 0.75,
         median = 0.5,
         lower.50 = 0.25,
         lower.95 = 0.025) %>%
  inner_join(coefs.df, #<<
             by = c("pet" = "parameter")) %>% #<<
  mutate(across(matches("median|upper|lower"),
                function(q) {
                  current.x = get("x")
                  invlogit(logit(current.x) + est + (qnorm(q) * se))
                })) %>%
  mutate(effect.color = case_when(p > 0.05 ~ neutral.color,
                                  est > 0 ~ good.color,
                                  T ~ bad.color)) %>%
  ggplot(aes(x = x, color = effect.color, fill = effect.color, group = 1)) +
  geom_segment(x = 0, xend = 1, y = 0, yend = 1, color = "black") +
  geom_ribbon(aes(ymin = lower.95, ymax = upper.95), color = NA, alpha = 0.2) +
  geom_ribbon(aes(ymin = lower.50, ymax = upper.50), color = NA, alpha = 0.4) +
  geom_line(aes(y = median)) +
  scale_x_continuous(labels = scales::percent_format()) +
  scale_y_continuous(labels = scales::percent_format()) +
  scale_color_identity() +
  scale_fill_identity() +
  labs(x = "Baseline probablity of passing",
       y = "Probability of passing with effect",
       title = "Estimated relationship to\nprobability of passing") +
  facet_wrap(~ pretty.parameter, ncol = 1) #<<

These graphs are perfect for showing the whole range of predicted probabilities. However, the downside to these graphs is, if you’re planning on showing the effect of multiple variables, they can take up a great deal of real estate on your report. The banana graphs require one graph per predictor variable. In the case that you must present the effect of many predictor variables, you may overwhelm your audience by the sheer volume of graphs. Additionally, the caterpillar plots in the previous examples could show all the predictors and their effect size in one graph, which means one can easily compare values between any two predictors in one caterpillar plot. However, with the banana graphs, you audience’s eyes have to dart from one graph to the next to compare any two predictor variables. Another limitation to this visual is that, if it is your audience’s first time seeing these banana graphs, then it may be difficult for them to understand. Although we discussed a design layout to overcome this issue, it is still important to think about whether there is a better visual which does not require you to go to certain lengths to explain what the graph is trying to convey.

Presenting counterfactual counts

Sometimes stakeholders are interested in a measure that’s even more basic than probabilities: the number of times something happens (or doesn’t happen).

For example, suppose your stakeholders want to use your analysis to assess the impact of tutoring on pass rates in Balloon Animal-Making 201. Since the tutoring program costs money, they’re interested not just in whether tutoring helps students, but how much it helps them. In other words, is the program worth it, or could the money be used more effectively elsewhere? A very direct way to assess the real-world impact of the program is to estimate a literal count: the number of extra students who passed because of tutoring.

Extra successes

Here’s one way to approach this idea. In our dataset, 2,571 students received tutoring; of those, 2,023 passed the class. Suppose those students had not received tutoring; in that case, how many students do we think would have passed? In other words, how many “extra” passes did we get because of tutoring?

We can get a simple point estimate by using our model to predict outcomes for the subset of our data where students received tutoring, but with the tutoring predictor set to FALSE instead of TRUE – in other words, by running a set of counterfactual predictions. But a point estimate doesn’t convey the amount of uncertainty around our estimate; we can get confidence intervals by simulating many sets of outcomes and aggregating over them, as follows:

Take all observations in the dataset for students who received tutoring.
Set the value of the tutoring predictor to FALSE for all students.
For each of (5,000, or some other suitably large number) simulations:
1. Account for uncertainty in parameter estimates: Sample a value for each parameter in the model. We took a sample from the normal distribution with its mean set to the estimated parameter value and its standard deviation set to the standard error. (Ideally, our sampling procedure would also take into account the correlations among the fitted parameters. We haven’t done this here; the correlations in this particular model are all quite small.)
2. Compute the predicted probability of passing for each student using these parameter values.
3. Account for uncertainty in outcomes: Randomly assign each prediction to “pass” or “fail”, weighted by the predicted probability of passing.
4. Sum the total number of students predicted to pass. Subtract this total from the number of students who actually passed. This is the predicted number of “extra” passes.
Summarize the distribution of predicted number of extra passing students over all the simulations.

The histogram below shows the results of 5,000 simulations like this for our dataset. Our model estimates that tutoring did indeed increase the number of students who passed – probably by somewhere around 50-150 students.

extra.p = with(
  list(
    temp.df = map_dfr( #<<
      1:5000, #<<
      function(d) { #<<
        data.frame( #<<
          draw = d, #<<
          mu = model.matrix( #<<
            pass.m, #<<
            data = df %>% #<<
              filter(tutoring) %>% #<<
              mutate(tutoring = F) #<<
          ) %*% #<<
            rnorm(nrow(coefs.df), #<<
                  mean = coefs.df$est, #<<
                  sd = coefs.df$se) #<<
        ) #<<
      } #<<
    ) #<<
  ),
  {
    temp.df %>%
      mutate(pred = runif(n()) < invlogit(mu)) %>%
      group_by(draw) %>%
      summarise(pred.passed = sum(pred)) %>%
      ungroup() %>%
      mutate(extra.passed = #<<
               sum(df$passed & df$tutoring) #<<
               - pred.passed) %>% #<<
      ggplot(aes(x = extra.passed)) +
      geom_histogram(fill = "gray") +
      geom_vline(xintercept = 0) +
      labs(x = "Number of extra students\nwho passed because of tutoring",
           y = "Number of simulations",
           title = "Estimated number of extra students\nwho passed because of tutoring")
  }
)

If your stakeholders care about the magnitude of an effect in terms of counts, this approach is clear and straightforward; it directly answers the question stakeholders are asking. Its primary drawback is that, even more than other visualizations we’ve explored, it strongly implies a causal relationship between the predictor and the outcome: we’re claiming that the tutoring caused some students to pass who otherwise wouldn’t have. Moreover, this approach assumes that the counterfactual makes sense. This is fine for the tutoring predictor; it’s reasonable to ask what would have happened if a student hadn’t gone to tutoring. But it makes much less sense for other predictors; for example, what does it mean to ask what would have happened if a student’s favorite color were red instead of blue?

Using a histogram to summarize the simulations may not be the best choice if your audience is likely to be distracted by the whole idea of using simulation to estimate uncertainty. The meaning of the y-axis isn’t straightforward and will require some explanation.

Extra successes by group

Just as we did with the probability-based approaches above, we can summarize counterfactuals by group for a more fine-grained view of our model’s predictions. The following graph shows the number of extra passing students by pet type. To save space, we’ve flattened the histograms into a caterpillar plot; this has the additional advantage of avoiding a scale that shows the number of simulations, and instead focusing on the range of predictions (which is what we wanted in the first place).

extra.group.p = with(
  list(
    temp.df = map_dfr(1:5000,
                      function(d) { data.frame(draw = d,
                                               pet.type = df$pet.type[df$tutoring],
                                               mu = model.matrix(pass.m, data = df %>% filter(tutoring) %>% mutate(tutoring = F)) %*%
                                                 rnorm(nrow(coefs.df), mean = coefs.df$est, sd = coefs.df$se)) })
  ),
  { temp.df %>%
      mutate(pred = runif(n()) < invlogit(mu)) %>%
      group_by(pet.type, draw) %>% #<<
      summarise(pred.passed = sum(pred), .groups = "keep") %>%
      ungroup() %>%
      left_join(df %>% #<<
                  filter(tutoring) %>% #<<
                  group_by(pet.type) %>% #<<
                  summarise(actual.passed = #<<
                              sum(passed)) %>% #<<
                  ungroup(), #<<
                by = "pet.type") %>% #<<
      mutate(pet.type = str_to_title(pet.type), extra.passed = actual.passed - pred.passed) %>%
      group_by(pet.type) %>%
      summarise(lower.95 = quantile(extra.passed, 0.025), lower.50 = quantile(extra.passed, 0.25), median = median(extra.passed), upper.50 = quantile(extra.passed, 0.75), upper.95 = quantile(extra.passed, 0.975)) %>%
      ungroup() %>%
      mutate(pet.type = fct_reorder(pet.type, median)) %>%
      ggplot(aes(x = pet.type)) +
      geom_linerange(aes(ymin = lower.95, ymax = upper.95), size = 1) +
      geom_linerange(aes(ymin = lower.50, ymax = upper.50), size = 2) +
      geom_point(aes(y = median), size = 3) +
      geom_hline(yintercept = 0) +
      labs(subtitle = "By type of pet", x = "Pet type",
           y = "Number of extra students\nwho passed because of tutoring",
           title = "Estimated number of extra students\nwho passed because of tutoring") +
      coord_flip() }
)

Splitting up the estimates by group has advantages similar to the ones described above for presenting probabilities by group: it shows how the effect varies by group, and if our model had relevant interactions, we would be able to include those.

It’s worth emphasizing that the kinds of differences we see here are different from those we saw in the probability graphs. For example, recall that we saw that tutoring gives a larger percentage point boost to students with fish, because those students started out with a lower baseline probability of passing. But this graph shows that the absolute number of extra passing students with fish is smaller than for other groups; this is because there simply aren’t that many students with fish in the first place. Your context will determine which kind of effect is most relevant. Do your stakeholders want to know absolute numbers, so that they can avoid spending resources that won’t actually benefit many people? Or do they want to understand whether a particular intervention disproportionately benefits (or harms) certain groups, even if those groups are small?

Potential successes compared to group size

There’s an alternative way to present counterfactuals that attempts to show both the effect size for each group and the overall size of that group. For this example, we’ll switch the direction of the counterfactual: instead of predicting how many tutored students passed who otherwise wouldn’t have, we’re going to predict how many untutored students would have passed if they had received tutoring. (Again, we’re making a strong causal claim, which may not be appropriate for your model!)

The graph below shows, for each pet type, the number of untutored students with that kind of pet who actually passed the class (red diamonds). It also shows the estimated number of untutored students who would have passed if they had received tutoring (black dots and lines). We can see clearly that our model predicts a benefit from tutoring for all students except dog owners; it also makes clear that the groups have substantially different sizes.

potential.group.p = with(
  list(
    temp.df = map_dfr(1:5000,
                      function(d) { data.frame(draw = d,
                                               pet.type = df$pet.type[!df$tutoring],
                                               mu = model.matrix(pass.m, data = df %>% filter(!tutoring) %>% mutate(tutoring = T)) %*%
                                                 rnorm(nrow(coefs.df), mean = coefs.df$est, sd = coefs.df$se)) })
  ),
  { temp.df %>%
      mutate(pred = runif(n()) < invlogit(mu)) %>%
      group_by(pet.type, draw) %>%
      summarise(n.passed = sum(pred), #<<
                .groups = "keep") %>% #<<
      ungroup() %>%
      group_by(pet.type) %>%
      summarise(lower.95 = quantile(n.passed, 0.025), lower.50 = quantile(n.passed, 0.25), upper.50 = quantile(n.passed, 0.75), upper.95 = quantile(n.passed, 0.975), n.passed = median(n.passed)) %>%
      ungroup() %>%
      mutate(pass.type = "Predicted") %>% #<<
      bind_rows( #<<
        df %>% #<<
          filter(!tutoring) %>% #<<
          group_by(pet.type) %>% #<<
          summarise(n.passed = sum(passed)) %>% #<<
          ungroup() %>% #<<
          mutate(pass.type = "Actual") #<<
      ) %>% #<<
      mutate(pet.type = str_to_title(pet.type), pet.type = fct_reorder(pet.type, n.passed, max)) %>%
      ggplot(aes(x = pet.type, color = pass.type, shape = pass.type)) +
      geom_linerange(aes(ymin = lower.95, ymax = upper.95), size = 1, show.legend = F) + geom_linerange(aes(ymin = lower.50, ymax = upper.50), size = 2, show.legend = F) +
      geom_point(aes(y = n.passed), size = 3) +
      scale_color_manual(values = c("red", "black")) + scale_shape_manual(values = c(18, 16)) +
      labs(x = "Pet type", color = "", shape = "", subtitle = "By type of pet",
           y = "Number of untutored students\npredicted to pass with tutoring",
           title = "Estimated number of untutored students\nwho would have passed with tutoring") +
      expand_limits(y = 0) + coord_flip() }
)

This approach is a step towards acknowledging different group sizes, and it also helps put the absolute numbers in context. (50 extra passes is extremely impressive if the baseline was 50, but less so if the baseline was 10,000.) But note that it doesn’t convey the size of the whole group, only of the number of passing students in the group. For example, the graph doesn’t explain whether the number of passing students with fish is small because there aren’t many of them in the first place, or because fish owners are more likely to fail. (In this case it’s both, but that won’t always be true.) In addition, if you have groups of vastly unequal sizes, then the effects for smaller groups will be squashed at the bottom of the scale and difficult to see.

Conclusion

For logistic regression models, just as there is not one straightforward effect we can report to our non-technical audience members, there is also not one visualization for these models that can (should) be presented to them. There are visualizations which are more or less appropriate for different situation. Knowing our stakeholders as well as the context and purpose of our research should be our guides to determine which visualization is most appropriate. Although it’s easy to get caught up in presenting the model’s results for the sake of presenting the model’s results, it’s important to recall that our results mean something and they should mean something to our audiences too.

We hope that this guide can serve as a springboard for you to create other visualizations for presenting logistic regression results. There is no right or wrong way, only better and worse ways for a particular project, so, get creative! Use colors, the layout, and annotations to your advantage, and share your ideas with others. Good luck!

Visualizing logistic regression results for non-technical audiences

Abby Kaplan and Keiko Cawley

9/20/2022