17  Differential Item Functioning

Author

Derek C. Briggs and Claude Code (Opus 4.6 & 4.7)

18 Introduction

Test results are routinely used as the basis for decisions regarding placement, advancement, and licensure. It is crucial that the tests used for these decisions allow for valid interpretations. One potential threat to validity is item bias—when a test item unfairly favors one group over another, it can undermine the fairness and accuracy of the entire test.

This tutorial introduces statistical procedures for detecting differential item functioning (DIF), a necessary but not sufficient condition for item bias. We cover three main approaches:

  • The Mantel-Haenszel (MH) procedure, a widely used non-IRT method based on contingency tables
  • Logistic regression, which provides a flexible model-based bridge between observed-score and IRT approaches
  • The lordif algorithm, an IRT-based approach that combines ordinal logistic regression with iterative purification of the matching criterion

Throughout, we use data from the CDE mathematics test we used previously to illustrate each method. The tutorial assumes familiarity with IRT models and test equating concepts from earlier in the course.


19 Key Concepts: Impact, Bias, and DIF

19.1 Item Impact

Item impact occurs when students in one group tend to perform better on an item than students in another group. For example, if 62% of males but only 48% of females answer an item correctly, there is an observed performance difference. But this difference alone does not tell us whether the item is biased—it may simply reflect true differences in the ability the item is designed to measure.

19.2 Item Bias

Item bias refers to a systematic difference between two quantities that should be equal. In the context of testing, possible explanations for observed group differences (impact) include:

  • True differences in the ability being measured
  • Inclusion of test items that measure an unintended dimension that favors one group over the other
  • Construct-irrelevant sources such as differences in test administration conditions

Only the second and third explanations would be considered biases that could potentially be addressed by changing the design of the test.

19.3 Differential Item Functioning

Differential item functioning (DIF) is present when examinees from different groups have differing probabilities or likelihoods of success on an item, after they have been matched on the ability of interest. The last clause is critical: DIF is about differences that persist after conditioning on ability, not just raw performance differences.

DIF is a necessary but not sufficient condition for concluding that an item is biased. An item flagged for DIF may measure a secondary dimension that is relevant to the construct, in which case the item is functioning differently but is not necessarily biased. Conversely, it is possible for a test to be considered unfair even if no individual items show DIF—for example, if there are true group differences attributable to unequal opportunity to learn.

19.4 Uniform vs. Non-Uniform DIF

DIF can be classified into two types based on how the item characteristic curves (ICCs) for two groups differ:

  • Uniform DIF: One group is advantaged across the entire ability scale. In IRT terms, the ICCs for the two groups differ only in their difficulty (location) parameter. The curves are shifted but do not cross.
  • Non-uniform DIF: The relative advantage switches direction at some point on the ability scale. In IRT terms, the ICCs differ in their discrimination (slope) parameter, and the curves cross. This is sometimes called “crossing DIF.”
Code
theta <- seq(-3, 3, 0.01)

par(mfrow = c(1, 2), mar = c(4, 4, 3, 1))

# Uniform DIF: different difficulty
p_ref <- 1 / (1 + exp(-1.2 * (theta - 0)))
p_foc <- 1 / (1 + exp(-1.2 * (theta - 0.8)))
plot(theta, p_ref, type = "l", lwd = 2,
     xlab = expression(theta), ylab = "P(Correct)",
     main = "Uniform DIF", ylim = c(0, 1))
lines(theta, p_foc, lwd = 2, lty = 2, col = "red")
legend("bottomright", c("Reference", "Focal"),
       lty = c(1, 2), col = c("black", "red"), lwd = 2, cex = 0.8)

# Non-uniform DIF: different discrimination
p_ref <- 1 / (1 + exp(-1.5 * (theta - 0)))
p_foc <- 1 / (1 + exp(-0.6 * (theta - 0)))
plot(theta, p_ref, type = "l", lwd = 2,
     xlab = expression(theta), ylab = "P(Correct)",
     main = "Non-Uniform DIF", ylim = c(0, 1))
lines(theta, p_foc, lwd = 2, lty = 2, col = "red")
legend("bottomright", c("Reference", "Focal"),
       lty = c(1, 2), col = c("black", "red"), lwd = 2, cex = 0.8)

Code
par(mfrow = c(1, 1))

Non-uniform DIF is harder to detect and interpret because the direction of group advantage changes at different locations on the scale. Most DIF detection methods are primarily sensitive to uniform DIF.

19.5 Evaluating DIF: A Two-Stage Process

Evaluating DIF involves two stages:

  • Stage 1: Is there DIF? If yes, is the DIF both statistically significant and practically significant?
  • Stage 2: What explains the DIF? This requires substantive judgment about item content, often involving expert review.

19.6 Obstacles to Establishing DIF

Several practical obstacles complicate DIF detection:

  • Small sample sizes limit statistical power to detect differences
  • Choice of matching criterion: Should it be internal (total test score) or external? How do we know the matching criterion itself is unbiased?
  • The fundamental problem: If ALL items on a test are biased against a focal group, no internal matching criterion can distinguish bias from impact. The utility of DIF analysis rests on the assumption that at least some items are unbiased.
  • DIF cancellation and amplification: Individual items may show DIF in opposite directions, partially or fully canceling at the test level—or they may reinforce each other.

20 Data

We will use data from the CDE mathematics test, a 31-item multiple-choice assessment. The data file also includes a gender indicator. The focal group is females and the reference group is males. The default when using mantelhaen.test() in R is for the lower alphanumeric value (females) to be designated the focal category and the higher value (males) the reference category.

Code
fulldata <- read.fwf("CTBMathSci.rwo",
                     widths = c(rep(1, 76), 3, 1, 6), skip = 2)
cde <- fulldata[, c(1:31, 78)]
names(cde)[32] <- "gender"

Let’s check the sample size by gender.

Code
table(cde$gender)

  F   M 
755 745 

There are 755 females and 745 males in the data.

20.1 Unconditional Item Performance by Gender

Before running any DIF analysis, it is good practice to examine the unconditional proportions correct by gender. Items where one group does substantially better than another are ones to keep an eye on. The question is: could part of this observed difference be caused by bias in how the item was written, rather than true ability differences?

Code
library(kableExtra)

# Compute proportion correct by gender for each item
p_correct <- data.frame(
  Item = 1:31,
  Female = sapply(1:31, function(i) mean(cde[cde$gender == "F", i], na.rm = TRUE)),
  Male = sapply(1:31, function(i) mean(cde[cde$gender == "M", i], na.rm = TRUE))
)
p_correct$Diff <- p_correct$Male - p_correct$Female

p_correct |>
  kable(digits = 3, col.names = c("Item", "P(Female)", "P(Male)", "Diff (M-F)")) |>
  kable_styling(full_width = FALSE) |>
  row_spec(which(abs(p_correct$Diff) > 0.05), bold = TRUE)
Item P(Female) P(Male) Diff (M-F)
1 0.713 0.705 -0.008
2 0.705 0.719 0.015
3 0.652 0.635 -0.017
4 0.701 0.744 0.043
5 0.584 0.561 -0.023
6 0.481 0.503 0.023
7 0.575 0.631 0.056
8 0.470 0.546 0.076
9 0.711 0.757 0.046
10 0.640 0.631 -0.009
11 0.783 0.819 0.036
12 0.791 0.760 -0.031
13 0.804 0.787 -0.017
14 0.547 0.529 -0.018
15 0.898 0.922 0.024
16 0.877 0.874 -0.003
17 0.682 0.717 0.035
18 0.588 0.619 0.031
19 0.808 0.827 0.019
20 0.164 0.278 0.114
21 0.574 0.695 0.122
22 0.615 0.636 0.022
23 0.525 0.495 -0.029
24 0.505 0.529 0.024
25 0.709 0.729 0.020
26 0.303 0.362 0.059
27 0.470 0.439 -0.031
28 0.385 0.456 0.071
29 0.166 0.243 0.077
30 0.330 0.350 0.021
31 0.253 0.321 0.068

Items with differences greater than 5 percentage points (shown in bold) are candidates for closer inspection. But remember: these are unconditional differences. An item that appears to favor one group might not show DIF once we condition on ability.

Code
diff_vals <- p_correct$Diff
bar_cols <- ifelse(is.na(diff_vals), "gray",
                   ifelse(diff_vals > 0, "steelblue", "coral"))
barplot(diff_vals, names.arg = 1:31,
        xlab = "Item", ylab = "Difference (Male - Female)",
        main = "Unconditional Proportion Correct: Male - Female",
        col = bar_cols, border = NA, las = 1)
abline(h = 0, col = "gray40")
abline(h = c(-0.05, 0.05), lty = 2, col = "gray60")


21 The Mantel-Haenszel Approach

21.1 Basic Strategy

The Mantel-Haenszel (MH) procedure is the most widely used non-IRT method for detecting DIF. The basic strategy is:

  • Examine 2 x 2 crosstabs of group (focal vs. reference) by item score (correct vs. incorrect)
  • Condition on a matching criterion with a fixed number of levels (typically total test score, possibly collapsed into bins)
  • Estimate the factor by which the odds of getting the item correct change for the focal group relative to the reference group, after controlling for the matching criterion

21.1.1 Terminology

  • Focal group: The group we are concerned about experiencing bias (in our example, females)
  • Reference group: The group that might be advantaged by the way items are written (in our example, males)

21.2 Creating an Ability Grouping Variable

The MH procedure requires a discrete matching criterion. We use total test score, collapsed into bins to ensure adequate sample sizes at each level.

Code
totscore <- apply(cde[, 1:31], 1, sum)

Let’s examine the total score distribution.

Code
hist(totscore, breaks = 0:31, freq = FALSE,
     main = "Total Score Distribution",
     xlab = "Total Score", col = "lightblue", border = "white")

We collapse scores into decile bins to keep sample sizes reasonably large at each level.

Code
x <- quantile(totscore, probs = seq(0, 1, 0.10))
diftot <- matrix(0, nrow(cde), 1)
for (i in 1:length(x) - 1) {
  diftot[(totscore >= x[i] & totscore < x[i + 1])] <- i
}
diftot[totscore == max(totscore)] <- length(x) - 1

table(diftot)
diftot
  1   2   3   4   5   6   7   8   9  10 
119 153 139 160 143 184  99 173 136 194 

21.3 Computing the MH Statistic

The MH statistic estimates a common odds ratio across all levels of the matching criterion. Under the null hypothesis, the odds of getting the item correct for the focal group are the same as for the reference group at every ability level. If the null is true, the odds ratio should equal 1.

  • MH odds ratio > 1: The focal group (females) is more likely to answer correctly than comparable members of the reference group
  • MH odds ratio < 1: The focal group is less likely to answer correctly than comparable members of the reference group

Let’s compute the MH statistic for a single item first.

Code
mh <- mantelhaen.test(cde$V1, y = cde$gender, z = diftot,
                      alternative = "two.sided", conf.level = 0.95)
mh

    Mantel-Haenszel chi-squared test with continuity correction

data:  cde$V1 and cde$gender and diftot
Mantel-Haenszel X-squared = 2.0366, df = 1, p-value = 0.1535
alternative hypothesis: true common odds ratio is not equal to 1
95 percent confidence interval:
 0.6491584 1.0618187
sample estimates:
common odds ratio 
         0.830234 

Here the MH odds ratio is 0.83. An odds ratio less than 1 means females are more likely to solve this item correctly than comparable males, but the result is not statistically significant.

Now we compute the MH statistic for all 31 items.

Code
MH <- list(NULL)
mh_or <- rep(0, 31)
mh_pval <- rep(0, 31)

for (i in 1:31) {
  temp <- mantelhaen.test(cde[, i], y = cde$gender, z = diftot,
                          alternative = "two.sided", conf.level = 0.95)
  MH[[i]] <- temp
  mh_or[i] <- temp$estimate
  mh_pval[i] <- temp$p.value
}
Code
hist(mh_or, breaks = 15, freq = FALSE,
     main = "Distribution of MH Odds Ratios",
     xlab = "MH Odds Ratio", col = "lightblue", border = "white")
abline(v = 1, col = "blue", lwd = 2)

21.4 The ETS Delta Statistic

The MH odds ratio is asymmetric around 1 (values favoring the reference group range from 0 to 1, while values favoring the focal group range from 1 to infinity). To produce a more interpretable, symmetric measure, Educational Testing Service (ETS) uses the delta transformation:

\[\Delta_{MH} = -2.35 \ln(\hat{\alpha}_{MH})\]

The delta statistic is interpreted as follows:

  • Negative values: The reference group (males) found the item easier than the focal group (females) of comparable ability
  • Positive values: The focal group (females) found the item easier than the reference group (males) of comparable ability
Code
delta <- -2.35 * log(mh_or)

21.5 ETS Classification Rules

ETS uses a three-level classification system for the practical significance of DIF:

  • Level A: |delta| < 1 — negligible DIF. Items are considered acceptable for use.
  • Level B: 1 < |delta| < 1.5 — moderate DIF. Items can be used with caution.
  • Level C: |delta| > 1.5 — large DIF. Items should not be used unless content experts can justify them.
Code
level_A <- sum(abs(delta) <= 1)
level_B <- sum(abs(delta) > 1 & abs(delta) <= 1.5)
level_C <- sum(abs(delta) >= 1.5)

cat("Level A:", level_A, "items\n")
Level A: 28 items
Code
cat("Level B:", level_B, "items\n")
Level B: 3 items
Code
cat("Level C:", level_C, "items\n")
Level C: 0 items

Let’s create a summary table with the MH results and ETS classification for all items.

Code
mh_results <- data.frame(
  Item = 1:31,
  OR = round(mh_or, 3),
  Delta = round(delta, 3),
  p_value = round(mh_pval, 4),
  Sig = ifelse(mh_pval < 0.05, "*", ""),
  Level = ifelse(abs(delta) >= 1.5, "C",
           ifelse(abs(delta) > 1, "B", "A"))
)

mh_results |>
  kable(col.names = c("Item", "MH OR", "Delta", "p-value", "Sig (.05)", "ETS Level")) |>
  kable_styling(full_width = FALSE) |>
  row_spec(which(mh_results$Level != "A"), bold = TRUE)
Item MH OR Delta p-value Sig (.05) ETS Level
1 0.830 0.437 0.1535 A
2 0.957 0.104 0.7653 A
3 0.735 0.724 0.0192 * A
4 1.134 -0.296 0.3688 A
5 0.767 0.624 0.0275 * A
6 0.963 0.087 0.7895 A
7 1.142 -0.312 0.2685 A
8 1.242 -0.509 0.0681 A
9 1.171 -0.370 0.3177 A
10 0.803 0.516 0.0875 A
11 1.159 -0.347 0.3446 A
12 0.646 1.027 0.0049 * B
13 0.757 0.654 0.0588 A
14 0.808 0.501 0.0687 A
15 1.276 -0.573 0.2517 A
16 0.820 0.468 0.3329 A
17 1.069 -0.157 0.6368 A
18 1.016 -0.037 0.9380 A
19 1.035 -0.081 0.8828 A
20 1.855 -1.452 0.0000 * B
21 1.692 -1.236 0.0000 * B
22 0.938 0.150 0.6411 A
23 0.811 0.491 0.0588 A
24 0.985 0.035 0.9432 A
25 0.958 0.101 0.7821 A
26 1.102 -0.229 0.5057 A
27 0.698 0.846 0.0032 * A
28 1.257 -0.538 0.0462 * A
29 1.437 -0.851 0.0126 * A
30 0.930 0.171 0.5910 A
31 1.269 -0.559 0.0544 A

Notice that some items are statistically significant but not practically significant (Level A), and vice versa. The MH approach emphasizes both criteria for flagging items.

Also pay attention to the direction of DIF across flagged items. If all flagged items disadvantage the same group, the DIF may amplify at the test level. If the direction is mixed, DIF effects may partially cancel out.

21.6 Limitations of the MH Approach

The MH procedure has several notable limitations:

  • It uses total test score as the matching criterion, which may not be ideal if the Rasch model does not hold (i.e., items differ in discrimination)
  • It requires collapsing the matching criterion into discrete bins, which may lose information
  • It is primarily designed to detect uniform DIF and has limited power for non-uniform DIF
  • It does not naturally incorporate an iterative purification process to account for DIF in the matching criterion itself

22 Logistic Regression and DIF

Logistic regression provides a flexible bridge between the MH contingency table approach and IRT-based methods. As Swaminathan and Rogers (1990) showed, the MH procedure can be understood as a special case of logistic regression where the matching criterion is treated as a categorical variable. Moving to logistic regression gains us several advantages: the matching criterion can be treated as continuous, and we can explicitly test for both uniform and non-uniform DIF.

22.1 Three Nested Models

For each item, we fit three nested logistic regression models:

  • Model 1: \(\text{logit}[P(Y = 1)] = \beta_0 + \beta_1 \cdot \text{ability}\)
  • Model 2: \(\text{logit}[P(Y = 1)] = \beta_0 + \beta_1 \cdot \text{ability} + \beta_2 \cdot \text{group}\)
  • Model 3: \(\text{logit}[P(Y = 1)] = \beta_0 + \beta_1 \cdot \text{ability} + \beta_2 \cdot \text{group} + \beta_3 \cdot \text{ability} \times \text{group}\)

Comparing these models lets us test for:

  • Uniform DIF: Compare Model 1 vs. Model 2 (1 df). A significant \(\beta_2\) means the group variable shifts the intercept after controlling for ability.
  • Non-uniform DIF: Compare Model 2 vs. Model 3 (1 df). A significant \(\beta_3\) means the group-by-ability interaction changes the slope—the DIF effect varies across the ability scale.
  • Overall DIF: Compare Model 1 vs. Model 3 (2 df). Tests for any type of DIF.

22.2 Demonstration

Let’s apply this to one item we flagged with MH (item 20) and one that appeared DIF-free (item 1).

Code
# Use continuous total score as matching criterion
totscore <- apply(cde[, 1:31], 1, sum)
gender <- cde$gender  # "F" = Female, "M" = Male

# --- Item 20 (flagged by MH) ---
m1_20 <- glm(cde$V20 ~ totscore, family = binomial)
m2_20 <- glm(cde$V20 ~ totscore + gender, family = binomial)
m3_20 <- glm(cde$V20 ~ totscore * gender, family = binomial)

cat("=== Item 20: Uniform DIF test (Model 1 vs 2) ===\n")
=== Item 20: Uniform DIF test (Model 1 vs 2) ===
Code
anova(m1_20, m2_20, test = "Chisq")
Analysis of Deviance Table

Model 1: cde$V20 ~ totscore
Model 2: cde$V20 ~ totscore + gender
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1      1498     1269.3                          
2      1497     1250.8  1   18.488 1.709e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
cat("=== Item 20: Non-uniform DIF test (Model 2 vs 3) ===\n")
=== Item 20: Non-uniform DIF test (Model 2 vs 3) ===
Code
anova(m2_20, m3_20, test = "Chisq")
Analysis of Deviance Table

Model 1: cde$V20 ~ totscore + gender
Model 2: cde$V20 ~ totscore * gender
  Resid. Df Resid. Dev Df  Deviance Pr(>Chi)
1      1497     1250.8                      
2      1496     1250.8  1 0.0020085   0.9643

For item 20, we see a significant group effect (uniform DIF), consistent with the MH results.

Code
# --- Item 1 (not flagged by MH) ---
m1_01 <- glm(cde$V1 ~ totscore, family = binomial)
m2_01 <- glm(cde$V1 ~ totscore + gender, family = binomial)

cat("=== Item 1: Uniform DIF test (Model 1 vs 2) ===\n")
=== Item 1: Uniform DIF test (Model 1 vs 2) ===
Code
anova(m1_01, m2_01, test = "Chisq")
Analysis of Deviance Table

Model 1: cde$V1 ~ totscore
Model 2: cde$V1 ~ totscore + gender
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1      1498     1525.6                     
2      1497     1523.7  1   1.9528   0.1623

For item 1, there is no significant group effect, consistent with its Level A classification from the MH analysis.

22.3 Connection to IRT-Based Approaches

The logistic regression framework is powerful because it can be extended in an important way: instead of using the observed total score as the matching criterion, we can substitute an IRT-based trait estimate (\(\hat{\theta}\)). This is exactly what the lordif algorithm does, and it has several advantages:

  • IRT \(\hat{\theta}\) accounts for differences in item discrimination, providing a better matching criterion
  • IRT \(\hat{\theta}\) is on a continuous, interval-like scale
  • The iterative purification of \(\hat{\theta}\) (re-estimating ability after accounting for DIF items) yields a cleaner matching criterion

23 IRT-Based Approaches to DIF

23.1 The Scale Indeterminacy Problem

Before we can compare ICCs across groups to look for DIF, we need to address a critical issue that connects directly to test equating—a topic we covered last week.

Suppose we calibrate a set of items separately for two groups and find that the ICCs look different. Can we conclude the items have DIF? Not necessarily. Recall that when we calibrate IRT models separately, the resulting scales are indeterminate—each group’s parameters are on an arbitrary metric defined by the constraints imposed during estimation (typically, mean \(\theta\) = 0, SD = 1 within each group).

If the two groups truly differ in their ability distributions (item impact), calibrating separately will absorb this difference into the item parameters, making it look like DIF when there is none.

23.1.1 An Illustration

Imagine two groups take the same set of items. Group 1 has mean \(\theta\) = 0, Group 2 has mean \(\theta\) = 0.5. If we calibrate separately:

  • Group 1 item parameters are estimated on a scale centered at 0
  • Group 2 item parameters are estimated on a scale centered at 0

The true 0.5-unit difference in group means gets absorbed into the item difficulty estimates, making it appear that items are harder for Group 1—but this is just an artifact of scale indeterminacy.

23.2 The lordif Algorithm

The lordif package (Choi, Gibbons, & Crane, 2011) implements an iterative hybrid approach that combines ordinal logistic regression with IRT. The algorithm proceeds through 11 steps:

  1. Data preparation: Check for sparse cells in the cross-tabulation of items by group; collapse response categories as needed (controlled by minCell parameter).

  2. IRT calibration: Fit the Graded Response Model (GRM) to the full dataset, ignoring group distinctions, to obtain a single set of item parameters. (The lordif package now uses mirt internally for this step.)

  3. Trait estimation: Obtain \(\hat{\theta}\) for each examinee using the Expected A Posteriori (EAP) estimator.

  4. Logistic regression: For each item, fit the three nested logistic regression models (Models 1, 2, and 3) using the IRT-based \(\hat{\theta}\) as the matching criterion. Compute likelihood ratio \(\chi^2\) statistics, pseudo \(R^2\) measures, and the proportional change in \(\beta_1\).

  5. DIF detection: Flag items based on the user-specified criterion (typically the \(\chi^2\) test at \(\alpha\) = 0.01).

  6. Sparse matrix construction: For each flagged item, split the response vector into group-specific “pseudo-items” (e.g., responses for females with males set to missing, and vice versa). Non-DIF items remain intact.

  7. IRT recalibration: Refit the GRM to the sparse data matrix, yielding common item parameters for non-DIF items and group-specific parameters for DIF items.

  8. Scale transformation: Use the Stocking-Lord equating procedure (with non-DIF items as anchors) to place the group-specific parameters on a common scale.

  9. Trait re-estimation: Obtain new EAP \(\hat{\theta}\) estimates using the common parameters for non-DIF items and group-specific parameters for DIF items.

  10. Iterative cycle: Repeat steps 4–9 until the same set of items is flagged on consecutive iterations or a maximum number of iterations is reached.

  11. Monte Carlo simulation (optional): Generate datasets under the null hypothesis of no DIF (preserving observed group differences in \(\hat{\theta}\)) to obtain empirical distributions of the test statistics. This provides data-driven thresholds for flagging items rather than relying on fixed cutoffs.


23.3 Running lordif on the CDE Data

Code
library(lordif)

response <- cde[, 1:31]
gender_num <- as.numeric(as.factor(cde$gender))
# as.factor converts "F"/"M" to factor levels; as.numeric makes Female = 1, Male = 2

Now we run the lordif algorithm using the likelihood ratio \(\chi^2\) test as the detection criterion at \(\alpha\) = 0.01.

Code
dif_result <- lordif(response, gender_num,
                     criterion = "Chisqr", alpha = 0.01, minCell = 5)

23.3.1 Examining the Results

Code
print(dif_result)
Call:
lordif(resp.data = response, group = gender_num, criterion = "Chisqr", 
    alpha = 0.01, minCell = 5)

  Number of DIF groups: 2 

  Number of items flagged for DIF: 5 of 31 

  Items flagged: 12, 20, 21, 26, 29 

  Number of iterations for purification: 4 of 10 

  Detection criterion: Chisqr 

  Threshold: alpha = 0.01 

   item ncat  chi12  chi13  chi23
1     1    2 0.2838 0.5141 0.6694
2     2    2 0.9170 0.4769 0.2253
3     3    2 0.0699 0.0511 0.1026
4     4    2 0.1772 0.1587 0.1726
5     5    2 0.0718 0.1257 0.3412
6     6    2 0.8796 0.6544 0.3637
7     7    2 0.0810 0.1984 0.6618
8     8    2 0.0155 0.0150 0.1104
9     9    2 0.1001 0.2066 0.5026
10   10    2 0.1944 0.2763 0.3458
11   11    2 0.1868 0.0650 0.0536
12   12    2 0.0066 0.0111 0.2016
13   13    2 0.1246 0.2697 0.6086
14   14    2 0.1458 0.3449 0.9085
15   15    2 0.1768 0.2729 0.3793
16   16    2 0.3833 0.5551 0.5185
17   17    2 0.3642 0.2461 0.1594
18   18    2 0.5205 0.7408 0.6653
19   19    2 0.6703 0.7481 0.5274
20   20    2 0.0000 0.0000 0.8034
21   21    2 0.0000 0.0000 0.1280
22   22    2 0.9021 0.2769 0.1101
23   23    2 0.1354 0.2973 0.6573
24   24    2 0.8474 0.3682 0.1614
25   25    2 0.7724 0.8222 0.5790
26   26    2 0.1264 0.0002 0.0001
27   27    2 0.0200 0.0516 0.4699
28   28    2 0.0207 0.0185 0.1053
29   29    2 0.0016 0.0022 0.1365
30   30    2 0.9727 0.9640 0.7881
31   31    2 0.0118 0.0329 0.4832

The print() output shows which items were flagged for DIF, along with the \(\chi^2\) statistics and pseudo \(R^2\) values for each item. Let’s examine the flagged items more carefully.

Code
flagged_items <- dif_result$flag
cat("Items flagged for DIF:", which(flagged_items), "\n")
Items flagged for DIF: 12 20 21 26 29 
Code
cat("Number of iterations:", dif_result$iteration, "\n")
Number of iterations: 4 

23.3.2 Diagnostic Plots

The plot() function in lordif generates several types of diagnostic visualizations. First, it shows the trait distributions for the two groups (based on the final purified \(\hat{\theta}\) estimates). Then, for each flagged item, it produces four diagnostic panels:

  • Top left: Item true score functions by group (based on group-specific item parameters)
  • Top right: Absolute difference between the group true score functions
  • Bottom left: Item response functions by group (showing slope and threshold estimates)
  • Bottom right: Impact weighted by the focal group density (showing where DIF matters most given the actual distribution of examinees)
Code
plot(dif_result, labels = c("Female", "Male"), cex = 0.7)

23.3.3 Interpreting the Diagnostic Plots

When examining the plots for each flagged item, pay attention to:

  • Direction of DIF: Which group is advantaged? Does the advantage hold across all ability levels (uniform) or does it reverse (non-uniform)?
  • Magnitude: How large is the difference between the group-specific ICCs? Small differences may be statistically significant but practically negligible.
  • Impact: The density-weighted plot (bottom right) shows where DIF actually matters given the population of examinees. An item might show a large ICC difference at extreme ability levels where few examinees exist, resulting in minimal practical impact.
  • The \(\chi^2\) statistics: The three tests (uniform, non-uniform, overall) printed on the top-left plot help classify the type of DIF.
  • The pseudo \(R^2\) values: These measure the magnitude of DIF. Values below 0.01 are generally considered negligible.

23.3.4 Extracting Empirical Test- and Individual-Level Impact

In addition to the plots, we can pull a few numerical summaries directly from the lordif object to quantify the impact of DIF on the \(\hat{\theta}\) estimates. Two components are particularly useful:

  • dif_result$calib$theta — the initial trait estimates, computed as if there were no DIF (common parameters for all items across groups).
  • dif_result$calib.sparse$theta — the purified trait estimates, computed with group-specific parameters for the flagged items.

Comparing these two sets of estimates tells us how much the individual (and group-mean) ability estimates change once DIF is accounted for.

Code
theta_init <- dif_result$calib$theta          # ignoring DIF
theta_pur  <- dif_result$calib.sparse$theta    # accounting for DIF
se_pur     <- dif_result$calib.sparse$SE

m_init_F <- mean(theta_init[gender_num == 1])
m_init_M <- mean(theta_init[gender_num == 2])
m_pur_F  <- mean(theta_pur[gender_num == 1])
m_pur_M  <- mean(theta_pur[gender_num == 2])

cat("Initial theta  (no DIF adjustment):  F =", round(m_init_F, 3),
    " M =", round(m_init_M, 3),
    " M - F =", round(m_init_M - m_init_F, 3), "\n")
Initial theta  (no DIF adjustment):  F = -0.068  M = 0.069  M - F = 0.137 
Code
cat("Purified theta (DIF adjusted):       F =", round(m_pur_F, 3),
    " M =", round(m_pur_M, 3),
    " M - F =", round(m_pur_M - m_pur_F, 3), "\n")
Purified theta (DIF adjusted):       F = -0.043  M = 0.043  M - F = 0.087 
Code
# The initial single-group calibration anchors the theta metric to
# population mean 0 and SD 1, so these mean differences are already
# in population-SD units

shift <- theta_pur - theta_init
cat("Shift (purified - initial)\n")
Shift (purified - initial)
Code
cat("  Mean for F:", round(mean(shift[gender_num == 1]), 3),
    "  Mean for M:", round(mean(shift[gender_num == 2]), 3), "\n")
  Mean for F: 0.025   Mean for M: -0.025 
Code
cat("  |shift| quantiles (50/75/95%):",
    round(quantile(abs(shift), c(0.5, 0.75, 0.95)), 3), "\n")
  |shift| quantiles (50/75/95%): 0.022 0.045 0.079 
Code
cat("  Max |shift|:", round(max(abs(shift)), 3), "logits\n")
  Max |shift|: 0.139 logits
Code
cat("  Mean CSEM (purified theta):", round(mean(se_pur), 3), "logits\n")
  Mean CSEM (purified theta): 0.381 logits

23.3.5 Test-Level Impact

The test-level plots show the effect of DIF on the test characteristic curves (TCCs), and the group-mean comparison above quantifies the same thing numerically. The difference between the initial \(\bar{\theta}_M - \bar{\theta}_F\) and the purified \(\bar{\theta}_M - \bar{\theta}_F\) tells us how much of the apparent group gap was driven by DIF items rather than by true ability differences. To evaluate the practical significance of DIF for aggregate group comparisons, we express this difference in population-SD units. Because the initial single-group calibration identifies the latent metric with population mean 0 and SD 1, the raw mean \(\hat{\theta}\) difference is already on this scale—no sample-based rescaling is needed (in fact, a sample-based denominator would be inappropriate, since EAP estimates shrink toward the prior and the pooled sample SD is systematically less than 1).

  • If DIF operates in opposite directions across items, the effects cancel at the test level: the initial and purified group-mean differences are essentially the same and the TCCs are nearly coincident.
  • If DIF is largely one-directional, the effects amplify: the purified difference is noticeably smaller (or larger) than the naive difference, and the TCCs clearly separate.

In the CDE data, the male advantage is about 0.14 SD units under the naive calibration and about 0.09 SD units after purification. A reduction of roughly 0.05 SD units—approximately 37% of the apparent gap—is attributable to DIF items rather than to true ability differences. This is partial amplification: not the full cancellation we see with balanced mixed-direction DIF, but not a wash either.

23.3.6 Individual-Level Impact

The individual-level impact plot shows the initial \(\hat{\theta}\) on the x-axis and the shift (purified \(-\) initial) on the y-axis, in logits, with a boxplot summarizing the distribution. To judge the practical significance of DIF for individual scores, the relevant denominator is the conditional standard error of measurement (CSEM). The CSEM already reflects the measurement uncertainty an individual estimate carries, so a DIF-induced shift much smaller than the CSEM is unlikely to change any single person’s classification.

For the CDE data, the median absolute shift is about 0.02 logits, the 95th percentile is 0.08, and the largest single shift is about 0.14 logits. The mean CSEM of the purified \(\hat{\theta}\) is about 0.38 logits. Typical shifts are a small fraction of a CSEM and therefore negligible at the individual level.

There is, however, an important caveat. Even when per-person shifts are tiny, their direction can be systematic by group. In the CDE data, the mean shift is about -0.03 logits for males and 0.02 logits for females—small, but pointing in opposite directions. Shifts that are uncorrelated across people would wash out in a group-mean comparison; shifts that all point the same way within a group do not. This is the mechanism through which small per-person shifts translate into the nontrivial group-level bias we saw above.


23.4 Are the Differences Practically Significant?

Two benchmarks are relevant, and they answer different questions:

  • Population SD of \(\theta\) (= 1 by the identification constraint of the initial calibration) — the right denominator for group-level comparisons. The change in the mean \(\hat{\theta}\) difference from initial to purified estimates, read directly in SD units, expresses how much of the apparent group gap is DIF-induced. This is the benchmark that matters for the validity of inferences about group differences.
  • CSEM — the right denominator for individual-level interpretation. It reminds us that every individual estimate already carries substantial measurement error, so a shift well below the CSEM has little consequence for any one person’s classification.

These two benchmarks can disagree in a way that is important to understand. When DIF cancels, both benchmarks give the same answer: effects are small. When DIF amplifies, the CSEM benchmark can make the effect look trivial (per-person shifts remain well below the SEM) while the population-SD benchmark correctly registers the systematic group-level bias. In other words, if individual \(|\)shifts\(|\) are small but the group-mean difference changes appreciably from initial to purified, DIF is producing a directional bias at the group level—individuals move only a little, but they move in the same direction within each group, and those small shifts accumulate in group-mean comparisons. That is exactly the scenario in which a test can look measurement-invariant at the individual level while still misleading group-level inferences, and it is the validity concern that typically matters most.


24 Comparing MH and lordif Results

Code
mh_sig_items <- which(mh_pval < 0.05 & abs(delta) > 1)
lordif_items <- which(dif_result$flag)

cat("MH flagged items (sig + Level B/C):", sort(mh_sig_items), "\n")
MH flagged items (sig + Level B/C): 12 20 21 
Code
cat("lordif flagged items:              ", sort(lordif_items), "\n")
lordif flagged items:               12 20 21 26 29 

The two methods may not flag exactly the same items, for several reasons:

  • Different matching criteria: MH uses observed total score; lordif uses IRT-based \(\hat{\theta}\)
  • Iterative purification: lordif iteratively removes the influence of DIF items from the matching criterion; MH does not
  • Sensitivity to DIF type: MH is primarily designed for uniform DIF; lordif can detect both uniform and non-uniform DIF
  • Scale linking: lordif addresses scale indeterminacy through the equating step; MH does not

In general, items flagged by both methods are the strongest candidates for further review. Items flagged by only one method warrant closer examination to understand why.


25 Summary

  • Impact, bias, and DIF are distinct concepts. DIF is a necessary but not sufficient condition for item bias.
  • The Mantel-Haenszel procedure is a widely used, computationally simple method for detecting uniform DIF. It uses an observed score as the matching criterion and produces the ETS delta statistic for practical significance classification.
  • Logistic regression provides a flexible framework that can test for both uniform and non-uniform DIF by comparing nested models. It bridges the gap between contingency table methods and IRT-based approaches.
  • lordif is an IRT-based method that uses ordinal logistic regression with IRT trait scores as the matching criterion. Its iterative purification process and scale transformation steps address key limitations of simpler methods.
  • DIF detection depends on both statistical and practical significance. Items should be flagged only when both criteria are met.
  • Almost all DIF methods use an internal matching criterion, which assumes that at least some items are unbiased.
  • Finding DIF does not automatically mean an item is biased—substantive expert review is essential to determine whether the DIF reflects a genuine construct-irrelevant factor.

26 References

  • Choi, S., Gibbons, L., & Crane, P. (2011). lordif: An R Package for Detecting Differential Item Functioning Using Iterative Hybrid Ordinal Logistic Regression/Item Response Theory and Monte Carlo Simulations. Journal of Statistical Software, 39(8).
  • Clauser, B. & Mazor, K. (1998). Using statistical procedures to identify differentially functioning test items. Educational Measurement: Issues and Practice, 17, 31–44.
  • Holland, P. & Thayer, D. (1988). Differential item performance and the Mantel-Haenszel procedure. In H. Wainer & H. Braun (Eds.), Test Validity (pp. 129–145). Hillsdale, NJ.
  • Swaminathan, H. & Rogers, H.J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27, 361–370.
  • Zwick, R. (2025). Fairness in Educational Measurement. In L. Cook & M. Pitoniak (Eds.), Educational Measurement, 5th Edition.