16  IRT Equating using mirt

Author

Derek C. Briggs and Claude Code (Opus 4.6 & 4.7)

16.1 Introduction

Psychometrcians do test equating to ensure that scores from two different forms of a test are comparable, even if one happens to be harder than the other. You might need to do this when administering multiple parallel forms for security reasons (to prevent cheating), or when releasing items each year and replacing them with new ones. In either case, you want scores from one form to carry the same meaning as scores from another form, even though different students took different tests.


16.2 Using IRT to Equate Two Test Forms

The way this is typically done in practice when we can’t assume that two groups of students have equivalent ability distributions is by embedding some common items in the two test forms. If these items show adequate fit to an IRT model, then they have the property of parameter invariance, which means that beyond estimation error they should be the same—up to a linear transformation—regardless of the ability of the students used to calibrate them. We are going to take advantage of this property to use IRT methods to equate two test forms. There are three ways this can be accomplished:

  1. Calibrate each test form separately with an IRT model (“separate calibration”), then use the two sets of estimates for the common item parameters shared by the two forms to compute the linking constants needed to express theta values for Form 2 on the scale as theta values for Form 1.

  2. Calibrate the items on Form 1, then fix the common items on Form 2 to the values estimated for Form 1. This is known as “anchored calibration.”

  3. Calibrate all the items on both test forms concurrently (“concurrent calibration”) by treating the data as coming from two groups in a single administration, with data on form-specific unique items treated as missing at random.


16.3 Simulating a Non-Equivalent Group Anchor Test (NEAT) Design

Let’s simulate data from two 20-item test forms that have been collected using a NEAT design such that the first 10 items on each form are the same (anchor items) but the second 10 are unique. There are 20 unique items across the two forms and 10 common items, for 30 distinct items in total. We assume 500 students take each form, with students taking Form 1 having lower ability on average and more variability than students taking Form 2.

Our goal is to express the item and person parameters for form 2 on the scale of form 1. So the “from” scale is form 2 (right hand side of transformation equations that will follow) and the “to” scale (left hand side of transformation equations) is form 1.

16.3.1 A Disclaimer

I am taking a somewhat “lazy” approach in generating my true item parameters: I am not targeting any known item parameter distributions for an real world test. In practice, item difficulty and discrimination tend to be positively correlated, but I draw values for them independently. This could result in some difficulty and discrimination values for any given item that might be unusual in practice.

Also, the equating literature assumes that for two forms to be candidates for equating, they should have approximately the same reliability. But this is a condition that I’m not enforcing, and as you’ll see, I end up simulating two test forms that differ in reliability.

Finally, as I’ll show in a separate document, when simulating data from a relatively small number of anchor items (10) with a (relatively) small sample of students taking the items (500), we can get some rather noisy estimates our linking parameters for any given data set, so don’t be surprised if we don’t recover truth perfectly.

In other words, take the simulated data with a grain of salt; I’m using them for didactic purposes to demonstrate IRT equating using mirt.

Code
library(mirt)
library(dplyr)
library(kableExtra)

# Fixed seed for reproducibility
set.seed(17)

# Item difficulty parameters (IRT b-parameterization)
b1 <- rnorm(20)                        # Form 1: 20 difficulties
b2 <- c(b1[1:10], rnorm(10))          # Form 2: same anchor difficulties + 10 unique

# Item discrimination parameters (IRT a-parameterization)
a1 <- runif(20, .5, 2)
a2 <- c(a1[1:10], runif(10, .5, 2))  # Form 2: same anchor discriminations + 10 unique

# Convert to mirt slope-intercept parameterization
d1 <- -a1 * b1
d2 <- -a2 * b2

# Person ability parameters
th1 <- rnorm(500, 0, 1)       # Group 1: N(0, 1)
th2 <- rnorm(500, .5, .8)     # Group 2: N(0.5, 0.8) -- higher mean, less spread

Now we simulate two 500-by-20 matrices of dichotomous item responses using simdata.

Code
test1 <- as.data.frame(simdata(a = a1, d = d1, Theta = th1, itemtype = 'dich'))
test2 <- as.data.frame(simdata(a = a2, d = d2, Theta = th2, itemtype = 'dich'))

Let’s look at the descriptive statistics for these two simulated datasets.

Code
it1 <- itemstats(test1)
it2 <- itemstats(test2)
it1$overall
   N mean_total.score sd_total.score ave.r  sd.r alpha SEM.alpha
 500            8.414          4.518 0.183 0.069 0.817     1.931
Code
it2$overall
   N mean_total.score sd_total.score ave.r  sd.r alpha SEM.alpha
 500           11.444          3.616 0.104 0.063   0.7     1.982
Code
it1$itemstats
          N K  mean    sd total.r total.r_if_rm alpha_if_rm
Item_1  500 2 0.644 0.479   0.383         0.287       0.815
Item_2  500 2 0.516 0.500   0.541         0.456       0.806
Item_3  500 2 0.534 0.499   0.406         0.308       0.814
Item_4  500 2 0.654 0.476   0.419         0.326       0.812
Item_5  500 2 0.400 0.490   0.385         0.287       0.815
Item_6  500 2 0.554 0.498   0.551         0.467       0.805
Item_7  500 2 0.442 0.497   0.409         0.311       0.813
Item_8  500 2 0.182 0.386   0.428         0.354       0.811
Item_9  500 2 0.454 0.498   0.425         0.328       0.813
Item_10 500 2 0.400 0.490   0.535         0.450       0.806
Item_11 500 2 0.272 0.445   0.441         0.356       0.811
Item_12 500 2 0.316 0.465   0.610         0.539       0.802
Item_13 500 2 0.286 0.452   0.463         0.378       0.810
Item_14 500 2 0.476 0.500   0.606         0.529       0.801
Item_15 500 2 0.244 0.430   0.412         0.329       0.812
Item_16 500 2 0.494 0.500   0.526         0.439       0.806
Item_17 500 2 0.326 0.469   0.501         0.417       0.808
Item_18 500 2 0.506 0.500   0.363         0.261       0.816
Item_19 500 2 0.370 0.483   0.439         0.347       0.811
Item_20 500 2 0.344 0.476   0.619         0.548       0.801
Code
it2$itemstats
          N K  mean    sd total.r total.r_if_rm alpha_if_rm
Item_1  500 2 0.764 0.425   0.256         0.142       0.700
Item_2  500 2 0.674 0.469   0.436         0.322       0.684
Item_3  500 2 0.670 0.471   0.392         0.274       0.688
Item_4  500 2 0.746 0.436   0.280         0.164       0.698
Item_5  500 2 0.478 0.500   0.397         0.271       0.689
Item_6  500 2 0.670 0.471   0.474         0.363       0.679
Item_7  500 2 0.464 0.499   0.302         0.169       0.699
Item_8  500 2 0.244 0.430   0.335         0.224       0.693
Item_9  500 2 0.516 0.500   0.365         0.237       0.692
Item_10 500 2 0.554 0.498   0.454         0.335       0.682
Item_11 500 2 0.502 0.500   0.495         0.379       0.677
Item_12 500 2 0.616 0.487   0.237         0.105       0.705
Item_13 500 2 0.760 0.428   0.356         0.246       0.691
Item_14 500 2 0.176 0.381   0.436         0.345       0.683
Item_15 500 2 0.760 0.428   0.357         0.248       0.691
Item_16 500 2 0.390 0.488   0.343         0.216       0.694
Item_17 500 2 0.642 0.480   0.497         0.387       0.677
Item_18 500 2 0.584 0.493   0.473         0.357       0.680
Item_19 500 2 0.528 0.500   0.545         0.437       0.671
Item_20 500 2 0.706 0.456   0.273         0.150       0.700

Some things to notice:

  • Form 1 is more reliable than Form 2
  • The mean total-correct score for Form 1 is lower than for Form 2
  • The SD of total-correct scores is higher for Form 1 than Form 2
  • Even though the first 10 items have the same true parameters on both forms, they look “harder” on Form 1 when examining classical item statistics

16.4 Calibrating the Items on Each Form Separately

Code
fit1 <- mirt(test1, 1, itemtype = "2PL", verbose = FALSE)
coef(fit1, IRTpars = TRUE, simplify = TRUE)
$items
            a      b g u
Item_1  0.719 -0.919 0 1
Item_2  1.381 -0.068 0 1
Item_3  0.822 -0.192 0 1
Item_4  0.939 -0.804 0 1
Item_5  0.723  0.625 0 1
Item_6  1.435 -0.215 0 1
Item_7  0.749  0.349 0 1
Item_8  1.137  1.633 0 1
Item_9  0.825  0.255 0 1
Item_10 1.353  0.399 0 1
Item_11 0.954  1.220 0 1
Item_12 1.951  0.632 0 1
Item_13 1.092  1.032 0 1
Item_14 1.833  0.074 0 1
Item_15 0.968  1.384 0 1
Item_16 1.303  0.020 0 1
Item_17 1.246  0.753 0 1
Item_18 0.632 -0.043 0 1
Item_19 0.907  0.687 0 1
Item_20 1.970  0.526 0 1

$means
F1 
 0 

$cov
   F1
F1  1
Code
fit2 <- mirt(test2, 1, itemtype = "2PL", verbose = FALSE)
coef(fit2, IRTpars = TRUE, simplify = TRUE)
$items
            a      b g u
Item_1  0.361 -3.344 0 1
Item_2  0.957 -0.902 0 1
Item_3  0.793 -1.013 0 1
Item_4  0.452 -2.489 0 1
Item_5  0.708  0.139 0 1
Item_6  1.213 -0.750 0 1
Item_7  0.408  0.367 0 1
Item_8  0.739  1.704 0 1
Item_9  0.564 -0.122 0 1
Item_10 0.996 -0.262 0 1
Item_11 1.173 -0.007 0 1
Item_12 0.245 -1.960 0 1
Item_13 0.754 -1.709 0 1
Item_14 1.458  1.438 0 1
Item_15 0.740 -1.735 0 1
Item_16 0.637  0.767 0 1
Item_17 1.258 -0.605 0 1
Item_18 1.019 -0.404 0 1
Item_19 1.434 -0.106 0 1
Item_20 0.359 -2.510 0 1

$means
F1 
 0 

$cov
   F1
F1  1
Code
# Extract item parameter estimates (col 1 = a, col 2 = b)
items1 <- coef(fit1, IRTpars = TRUE, simplify = TRUE)$items
items2 <- coef(fit2, IRTpars = TRUE, simplify = TRUE)$items

We know from the simulation that Group 1 has lower average ability than Group 2. But let’s look at what mirt estimates for the two population distributions.

Code
comp <- data.frame(
  Form    = c("Form 1 (Group 1)", "Form 2 (Group 2)"),
  Mean    = c(coef(fit1)$GroupPars[1], coef(fit2)$GroupPars[1]),
  SD      = c(sqrt(coef(fit1)$GroupPars[2]), sqrt(coef(fit2)$GroupPars[2]))
)
comp |>
  kable("html", digits = 3, col.names = c("Form", "Mean (θ)", "SD (θ)")) |>
  kable_styling(full_width = FALSE)
Form Mean (θ) SD (θ)
Form 1 (Group 1) 0 1
Form 2 (Group 2) 0 1

What happened? When mirt estimates item parameters it anchors the logit scale by fixing the mean and SD of the population distribution to 0 and 1. This is the actual mean and SD for the Group 1 sample—but it is not the mean and SD we generated for Group 2. The two sets of thetas appear to be on the same scale, but they are not. Let’s see this by comparing the estimated difficulty and discrimination parameters for the 10 anchor items.

Code
par(mfrow = c(1, 2), mar = c(4, 4, 3, 1))

# Anchor item difficulties
plot(items1[1:10, 2], items2[1:10, 2],
     xlab = "b  (Form 1 calibration)", ylab = "b  (Form 2 calibration)",
     main = "Anchor Difficulty:\nSeparate Calibrations",
     pch = 16, col = "purple", xlim = c(-4, 2), ylim = c(-4, 2))
abline(a = 0, b = 1, lty = 2, col = "gray60")

# Anchor item discriminations
plot(items1[1:10, 1], items2[1:10, 1],
     xlab = "a  (Form 1 calibration)", ylab = "a  (Form 2 calibration)",
     main = "Anchor Discrimination:\nSeparate Calibrations",
     pch = 16, col = "purple", xlim = c(.2, 1.6), ylim = c(.2, 1.6))
abline(a = 0, b = 1, lty = 2, col = "gray60")

Code
par(mfrow = c(1, 1))

The two sets of estimates are strongly correlated—supporting parameter invariance—but systematically offset from the identity line. They are not on the same scale. The pattern here (a systematic shift rather than random scatter) is exactly what the IRT linking transformation is designed to correct.


16.5 Equating with Separate Calibration and Linking Constants

Having confirmed the scale mismatch, we can now estimate the linking constants \(A\) and \(B\) that map the Form 2 scale onto the Form 1 scale. Recall from the background reading that if we define \(\theta_1 = A \cdot \theta_2 + B\), then the anchor item parameters must satisfy:

\[b_1 = A \cdot b_2 + B \qquad \text{and} \qquad a_1 = \frac{a_2}{A}\]

In the separate IRT equating approach we can estimate \(A\) and \(B\) from the 10 anchor items using three different methods.

16.5.1 Mean-Mean Method

The mean-mean method estimates \(A\) from the ratio of mean discrimination parameters across the two calibrations, and \(B\) from the adjusted mean difference in difficulty parameters.

Code
# Anchor item parameters from each separate calibration (items 1-10)
a1_anc <- items1[1:10, 1];  b1_anc <- items1[1:10, 2]
a2_anc <- items2[1:10, 1];  b2_anc <- items2[1:10, 2]

A_mm <- mean(a2_anc) / mean(a1_anc)
B_mm <- mean(b1_anc) - A_mm * mean(b2_anc)

cat("Mean-Mean:  A =", round(A_mm, 4), "  B =", round(B_mm, 4), "\n")
Mean-Mean:  A = 0.7134   B = 0.5823 

16.5.2 Mean-Sigma Method

The mean-sigma method estimates \(A\) from the ratio of the standard deviations of the anchor difficulty parameters (which are typically estimated more stably than discriminations), and \(B\) from the mean difference in difficulties.

Code
A_ms <- sd(b1_anc) / sd(b2_anc)
B_ms <- mean(b1_anc) - A_ms * mean(b2_anc)

cat("Mean-Sigma: A =", round(A_ms, 4), "  B =", round(B_ms, 4), "\n")
Mean-Sigma: A = 0.5134   B = 0.4488 

We expect \(A < 1\) and \(B > 0\) here. Recall that Group 2 has higher mean ability (\(\mu = 0.5\)) but less variability (\(\sigma = 0.8\)) than Group 1 (\(\mu = 0\), \(\sigma = 1\)). When mirt separately calibrates Form 2, it forces the \(N(0.5, 0.8)\) ability distribution onto a \(N(0, 1)\) scale. This has two consequences. First, the \(\theta\) scale gets stretched—the true SD of 0.8 is expanded to 1—which inflates the spread of the \(b\)-parameter estimates and compresses the \(a\)-parameter estimates. Second, the origin gets shifted downward—the true mean of 0.5 is pulled to 0. The linking constant \(A < 1\) undoes the stretching, scaling Form 2’s parameters back to the narrower Form 1 metric, while \(B > 0\) undoes the shift, moving the zero-centered Form 2 scale upward to reflect Group 2’s higher mean. Notice, however, that the Mean-Mean and Mean-Sigma methods give noticeably different estimates of \(A\). This discrepancy is a warning sign that moment-based summaries of the anchor item parameters can be unstable.

16.5.3 Stocking-Lord Method

The Stocking-Lord method (Stocking & Lord, 1983) takes a different approach. Rather than summarizing the anchor parameters with means or standard deviations, it finds the \(A\) and \(B\) that minimize the squared difference between the anchor test characteristic curves (TCCs) evaluated on the two scales. In notation from Kolen and Brennan (2014, Eq. 6.15), the criterion is:

\[SL_{\text{crit}} = \sum_{k} \left[ \sum_{j \in V} p_j\!\left(\theta_k;\, \hat{a}_{1j},\, \hat{b}_{1j}\right) \;-\; \sum_{j \in V} p_j\!\left(\theta_k;\, \frac{\hat{a}_{2j}}{A},\, A\hat{b}_{2j} + B\right) \right]^2\]

where \(V\) denotes the set of anchor items and \(\theta_k\) ranges over a grid of ability values. The key advantage is that this criterion considers all item parameters simultaneously, so a single item with an extreme \(\hat{b}\) or \(\hat{a}\) estimate cannot dominate the way it can with moment-based methods. We implement this directly with optim, using the mean-mean constants as starting values.

Code
# 2PL item characteristic curve
icc_2pl <- function(theta, a, b) 1 / (1 + exp(-a * (theta - b)))

# Test characteristic curve: sum of ICCs across items at each theta point
tcc_2pl <- function(theta, a, b) {
  sapply(theta, function(th) sum(icc_2pl(th, a, b)))
}

# Stocking-Lord criterion function
stocking_lord <- function(a1, b1, a2, b2,
                          theta = seq(-4, 4, by = 0.1),
                          start_A = NULL, start_B = NULL) {
  # Compute target TCC once (Form 1 anchor items)
  tcc_target <- tcc_2pl(theta, a1, b1)

  # Criterion: sum of squared TCC differences
  sl_criterion <- function(par) {
    A <- par[1]; B <- par[2]
    tcc_source <- tcc_2pl(theta, a2 / A, A * b2 + B)
    sum((tcc_target - tcc_source)^2)
  }

  # Starting values from mean-mean if not supplied
  if (is.null(start_A)) start_A <- mean(a2) / mean(a1)
  if (is.null(start_B)) start_B <- mean(b1) - start_A * mean(b2)

  result <- optim(c(start_A, start_B), sl_criterion, method = "Nelder-Mead")
  list(A = result$par[1], B = result$par[2],
       convergence = result$convergence, criterion = result$value)
}

sl <- stocking_lord(a1_anc, b1_anc, a2_anc, b2_anc)
cat("Stocking-Lord: A =", round(sl$A, 4), "  B =", round(sl$B, 4), "\n")
Stocking-Lord: A = 0.6694   B = 0.4933 

The Stocking-Lord \(A\) and \(B\) fall between the mean-mean and mean-sigma estimates. Because the optimization considers the full shape of each anchor item’s characteristic curve, it is less sensitive to any single extreme parameter estimate.

16.5.4 Applying the Transformation

We apply the linking constants to transform Form 2 item parameters onto the Form 1 scale.

Code
# Extract Stocking-Lord constants
A_sl <- sl$A;  B_sl <- sl$B

# Transform Form 2 item parameters to Form 1 scale using each method
b2_mm <- A_mm * items2[, 2] + B_mm;  a2_mm <- items2[, 1] / A_mm
b2_ms <- A_ms * items2[, 2] + B_ms;  a2_ms <- items2[, 1] / A_ms
b2_sl <- A_sl * items2[, 2] + B_sl;  a2_sl <- items2[, 1] / A_sl

# Compare implied Group 2 population parameters to true generating values.
# On Form 2's mirt scale the population is N(0, 1) by construction.
# After the linear transformation theta_1 = A * theta_2 + B, the implied
# population mean = A * 0 + B = B and the implied population SD = A * 1 = A.
comp_sep <- data.frame(
  Version = c("Form 2 -- no transformation",
              "Form 2 -- Mean-Mean",
              "Form 2 -- Mean-Sigma",
              "Form 2 -- Stocking-Lord",
              "True generating value"),
  Mean_th = c(0.000, round(B_mm, 3), round(B_ms, 3), round(B_sl, 3), 0.500),
  SD_th   = c(1.000, round(A_mm, 3), round(A_ms, 3), round(A_sl, 3), 0.800)
)

comp_sep |>
  kable("html", col.names = c("Version", "Mean (θ)", "SD (θ)")) |>
  kable_styling(full_width = FALSE)
Version Mean (θ) SD (θ)
Form 2 -- no transformation 0.000 1.000
Form 2 -- Mean-Mean 0.582 0.713
Form 2 -- Mean-Sigma 0.449 0.513
Form 2 -- Stocking-Lord 0.493 0.669
True generating value 0.500 0.800

Without transformation, mirt’s identification constraint forces the Form 2 population to N(0, 1), masking the true group difference. After applying the linking constants, the implied population mean on the Form 1 scale is \(B\) and the implied population SD is \(A\).

The Stocking-Lord method recovers the population mean most accurately among the three separate calibration approaches. All three underestimate the population SD, but the mean-sigma method is the worst offender—its \(A\) depends on the ratio of standard deviations of the anchor difficulty estimates, and on the Form 2 scale a few easy anchor items have extreme \(\hat{b}\) values that inflate \(\text{sd}(\hat{b}_2)\), pulling \(A\) too low. The mean-mean and Stocking-Lord methods are less sensitive to these outliers, though neither fully recovers the true SD in this instance. This general pattern—moment-based methods being less stable than characteristic curve methods—is well documented (Kolen & Brennan, 2014, Section 6.3.4).

Let’s check how well the anchor item parameters align after each transformation.

Code
par(mfrow = c(3, 2), mar = c(4, 4, 3, 1))

# Mean-Mean
plot(items1[1:10, 2], b2_mm[1:10],
     xlab = "b  (Form 1)", ylab = "b  (Form 2, transformed)",
     main = "Anchor Difficulty:\nMean-Mean",
     pch = 16, col = "darkgreen", xlim = c(-3, 2), ylim = c(-3, 2))
abline(a = 0, b = 1, lty = 2, col = "gray60")

plot(items1[1:10, 1], a2_mm[1:10],
     xlab = "a  (Form 1)", ylab = "a  (Form 2, transformed)",
     main = "Anchor Discrimination:\nMean-Mean",
     pch = 16, col = "darkgreen", xlim = c(0.2, 2.2), ylim = c(0.2, 2.2))
abline(a = 0, b = 1, lty = 2, col = "gray60")

# Mean-Sigma
plot(items1[1:10, 2], b2_ms[1:10],
     xlab = "b  (Form 1)", ylab = "b  (Form 2, transformed)",
     main = "Anchor Difficulty:\nMean-Sigma",
     pch = 16, col = "purple", xlim = c(-3, 2), ylim = c(-3, 2))
abline(a = 0, b = 1, lty = 2, col = "gray60")

plot(items1[1:10, 1], a2_ms[1:10],
     xlab = "a  (Form 1)", ylab = "a  (Form 2, transformed)",
     main = "Anchor Discrimination:\nMean-Sigma",
     pch = 16, col = "purple", xlim = c(0.2, 2.2), ylim = c(0.2, 2.2))
abline(a = 0, b = 1, lty = 2, col = "gray60")

# Stocking-Lord
plot(items1[1:10, 2], b2_sl[1:10],
     xlab = "b  (Form 1)", ylab = "b  (Form 2, transformed)",
     main = "Anchor Difficulty:\nStocking-Lord",
     pch = 16, col = "firebrick", xlim = c(-3, 2), ylim = c(-3, 2))
abline(a = 0, b = 1, lty = 2, col = "gray60")

plot(items1[1:10, 1], a2_sl[1:10],
     xlab = "a  (Form 1)", ylab = "a  (Form 2, transformed)",
     main = "Anchor Discrimination:\nStocking-Lord",
     pch = 16, col = "firebrick", xlim = c(0.2, 2.2), ylim = c(0.2, 2.2))
abline(a = 0, b = 1, lty = 2, col = "gray60")

Code
par(mfrow = c(1, 1))

The mean-mean transformation (top row) and Stocking-Lord transformation (bottom row) both bring the anchor parameters reasonably close to the identity line. The mean-sigma difficulty plot (middle left) shows the transformed parameters compressed toward 0—a consequence of the underestimated \(A\) constant. Items that deviate substantially from the identity line after any transformation may be showing differential item functioning across the two groups. When the groups are defined by time (e.g. 2024 vs 2025), this is known as “item parameter drift.”

16.5.5 On Your Own

  1. Try transforming the Form 2 unique item parameters (items 11–20) using the Stocking-Lord constants and compare them to the true generating values (b2[11:20] and a2[11:20]). How well did the linking recover the truth?
  2. The mean-sigma method ignores discrimination parameters when estimating \(A\). Based on the results here, why is that a problem?

16.6 Equating with Anchored Calibration

Let’s try the second of the three IRT equating approaches. We re-calibrate the items for Form 2, but this time we anchor the first 10 items to have the same difficulty and discrimination values estimated for Form 1.

Code
# Discrimination parameters for anchor items (Form 1 estimates)
anchor_a <- items1[1:10, 1]
# Difficulty parameters converted to mirt's slope-intercept "d" parameterization
anchor_d <- -anchor_a * items1[1:10, 2]

# Get starting values for Form 2 calibration
sv2 <- mirt(test2, 1, itemtype = '2PL', pars = 'values', verbose = FALSE)

# Fix anchor item slope and intercept parameters (parnum < 41 covers items 1-10)
sv2$value[sv2$name == 'a1' & sv2$parnum < 41] <- anchor_a
sv2$value[sv2$name == 'd'  & sv2$parnum < 41] <- anchor_d
sv2$est[sv2$parnum < 41] <- FALSE

# Free the population mean and variance so they can be estimated
sv2$est[sv2$name == 'MEAN_1'] <- TRUE
sv2$est[sv2$name == 'COV_11'] <- TRUE

# Fit the 2PL with anchor constraints
fit2.a    <- mirt(test2, 1, pars = sv2, verbose = FALSE)
items2.a  <- coef(fit2.a, IRTpars = TRUE, simplify = TRUE)$items
coef(fit2.a, IRTpars = TRUE, simplify = TRUE)
$items
            a      b g u
Item_1  0.719 -0.919 0 1
Item_2  1.381 -0.068 0 1
Item_3  0.822 -0.192 0 1
Item_4  0.939 -0.804 0 1
Item_5  0.723  0.625 0 1
Item_6  1.435 -0.215 0 1
Item_7  0.749  0.349 0 1
Item_8  1.137  1.633 0 1
Item_9  0.825  0.255 0 1
Item_10 1.353  0.399 0 1
Item_11 1.634  0.493 0 1
Item_12 0.349 -0.876 0 1
Item_13 1.048 -0.728 0 1
Item_14 2.063  1.518 0 1
Item_15 1.063 -0.714 0 1
Item_16 0.900  1.042 0 1
Item_17 1.734  0.062 0 1
Item_18 1.438  0.212 0 1
Item_19 2.000  0.423 0 1
Item_20 0.520 -1.240 0 1

$means
   F1 
0.498 

$cov
      F1
F1 0.507

Now let’s compare the mean and SD of the two theta distributions after anchoring the Form 2 scale to Form 1.

Code
comp_anc <- data.frame(
  Form = c("Form 1 (Group 1)", "Form 2 Anchored (Group 2)"),
  Mean = c(coef(fit1)$GroupPars[1],  coef(fit2.a)$GroupPars[1]),
  SD   = c(sqrt(coef(fit1)$GroupPars[2]), sqrt(coef(fit2.a)$GroupPars[2]))
)
comp_anc |>
  kable("html", digits = 3, col.names = c("Form", "Mean (θ)", "SD (θ)")) |>
  kable_styling(full_width = FALSE)
Form Mean (θ) SD (θ)
Form 1 (Group 1) 0.000 1.000
Form 2 Anchored (Group 2) 0.498 0.712

We are now close to the true generating values for Group 2 (\(\mu = 0.5\), \(\sigma = 0.8\)). We can also verify that the item parameter estimates for the 10 anchor items are identical across the two calibrations—we forced this to be so.

Code
par(mfrow = c(1, 2), mar = c(4, 4, 3, 1))

# Anchor difficulties
plot(items1[1:10, 2], items2.a[1:10, 2],
     xlab = "b  (Form 1 calibration)", ylab = "b  (Form 2, anchored)",
     main = "Anchor Difficulty:\nAnchored Calibration",
     pch = 16, col = "steelblue", xlim = c(-4, 2), ylim = c(-4, 2))
abline(a = 0, b = 1, lty = 2, col = "gray60")

# Anchor discriminations
plot(items1[1:10, 1], items2.a[1:10, 1],
     xlab = "a  (Form 1 calibration)", ylab = "a  (Form 2, anchored)",
     main = "Anchor Discrimination:\nAnchored Calibration",
     pch = 16, col = "steelblue", xlim = c(.2, 2.2), ylim = c(.2, 2.2))
abline(a = 0, b = 1, lty = 2, col = "gray60")

Code
par(mfrow = c(1, 1))

The more interesting comparison is of the unique items (11–20) on Form 2, before and after anchoring.

Code
par(mfrow = c(1, 2), mar = c(4, 4, 3, 1))

# Unique item difficulties
plot(items2[11:20, 2], items2.a[11:20, 2],
     xlab = "b  (Form 2, separate)", ylab = "b  (Form 2, anchored)",
     main = "Unique Item Difficulty:\nBefore vs. After Anchoring",
     pch = 16, col = "steelblue", xlim = c(-4, 2), ylim = c(-4, 2))
abline(a = 0, b = 1, lty = 2, col = "gray60")

# Unique item discriminations
plot(items2[11:20, 1], items2.a[11:20, 1],
     xlab = "a  (Form 2, separate)", ylab = "a  (Form 2, anchored)",
     main = "Unique Item Discrimination:\nBefore vs. After Anchoring",
     pch = 16, col = "steelblue", xlim = c(.2, 2.2), ylim = c(.2, 2.2))
abline(a = 0, b = 1, lty = 2, col = "gray60")

Code
par(mfrow = c(1, 1))

With both forms now on the same scale, we can obtain EAP theta estimates from each and compare them directly.

Code
th1.eap   <- fscores(fit1)[, 1]
th2.a.eap <- fscores(fit2.a)[, 1]

par(mfrow = c(1, 2), mar = c(4, 4, 3, 1))
hist(th1.eap,   xlim = c(-3, 3), main = "Form 1 (Group 1)",
     xlab = expression(hat(theta)), col = "lightblue",  freq = FALSE)
hist(th2.a.eap, xlim = c(-3, 3), main = "Form 2 Anchored (Group 2)",
     xlab = expression(hat(theta)), col = "lightgreen", freq = FALSE)

Code
par(mfrow = c(1, 1))

16.6.1 On Your Own

  1. Compare the estimated item parameters to their generating values—how did we do?
  2. Should we be concerned that we are underestimating the true SD of the theta distribution for Group 2?
  3. Would we do better if we had more students? More anchor items? Better-discriminating anchor items?

16.7 Concurrent Calibration

The first thing we need to do is restructure the data from the two forms into a single dataframe.

Code
# We reuse the test1 and test2 matrices simulated at the top of this document.
# (Re-simulating here without resetting the seed would produce different data.)

# Rename the last 10 columns of test2 to distinguish them from Form 1 unique items
new_labels <- paste0("Item_", (ncol(test2) + 1):(ncol(test2) + 10))
colnames(test2)[(ncol(test2) - 9):ncol(test2)] <- new_labels

# Group membership vector (used by multipleGroup)
group <- c(rep("Test1", 500), rep("Test2", 500))

# Add group indicator column for merging
test1 <- mutate(test1, group = 1)
test2 <- mutate(test2, group = 2)

# Merge on common anchor items (Item_1 through Item_10)
common_items <- paste0("Item_", 1:10)
full_dat <- merge(test1, test2, by = c("group", common_items),
                  all.x = TRUE, all.y = TRUE)
dat <- full_dat[, -1]

Notice the structure of full_dat. For the first 500 cases (students taking Form 1), we have observations for items 1–20; items 21–30 are missing by design. For the next 500 cases (students taking Form 2), we have observations for items 1–10 (the anchor items) and items 21–30 (Form 2 unique items); items 11–20 are missing by design. The diagram below contrasts this stacked data layout with the separate-forms layout used in the anchored calibration.

Now we are ready to do a concurrent multigroup calibration using mirt.

Code
cc <- multipleGroup(dat, model = 1, group = group, itemtype = '2PL',
                    invariance = c('intercepts', 'slopes', 'free_var', 'free_means'),
                    verbose = FALSE)

out.cc <- coef(cc, simplify = TRUE, IRTpars = TRUE)
out.cc
$Test1
$items
            a      b g u
Item_1  0.704 -1.084 0 1
Item_2  1.391 -0.097 0 1
Item_3  0.953 -0.244 0 1
Item_4  0.831 -0.886 0 1
Item_5  0.805  0.598 0 1
Item_6  1.467 -0.150 0 1
Item_7  0.635  0.575 0 1
Item_8  1.084  1.681 0 1
Item_9  0.785  0.343 0 1
Item_10 1.393  0.351 0 1
Item_11 0.952  1.221 0 1
Item_12 1.961  0.630 0 1
Item_13 1.084  1.037 0 1
Item_14 1.836  0.074 0 1
Item_15 0.973  1.378 0 1
Item_16 1.307  0.019 0 1
Item_17 1.243  0.754 0 1
Item_18 0.634 -0.043 0 1
Item_19 0.912  0.684 0 1
Item_20 1.979  0.524 0 1
Item_21 1.623  0.498 0 1
Item_22 0.345 -0.888 0 1
Item_23 1.038 -0.735 0 1
Item_24 2.027  1.538 0 1
Item_25 1.041 -0.732 0 1
Item_26 0.889  1.053 0 1
Item_27 1.729  0.065 0 1
Item_28 1.424  0.214 0 1
Item_29 1.993  0.427 0 1
Item_30 0.508 -1.274 0 1

$means
F1 
 0 

$cov
   F1
F1  1


$Test2
$items
            a      b g u
Item_1  0.704 -1.084 0 1
Item_2  1.391 -0.097 0 1
Item_3  0.953 -0.244 0 1
Item_4  0.831 -0.886 0 1
Item_5  0.805  0.598 0 1
Item_6  1.467 -0.150 0 1
Item_7  0.635  0.575 0 1
Item_8  1.084  1.681 0 1
Item_9  0.785  0.343 0 1
Item_10 1.393  0.351 0 1
Item_11 0.952  1.221 0 1
Item_12 1.961  0.630 0 1
Item_13 1.084  1.037 0 1
Item_14 1.836  0.074 0 1
Item_15 0.973  1.378 0 1
Item_16 1.307  0.019 0 1
Item_17 1.243  0.754 0 1
Item_18 0.634 -0.043 0 1
Item_19 0.912  0.684 0 1
Item_20 1.979  0.524 0 1
Item_21 1.623  0.498 0 1
Item_22 0.345 -0.888 0 1
Item_23 1.038 -0.735 0 1
Item_24 2.027  1.538 0 1
Item_25 1.041 -0.732 0 1
Item_26 0.889  1.053 0 1
Item_27 1.729  0.065 0 1
Item_28 1.424  0.214 0 1
Item_29 1.993  0.427 0 1
Item_30 0.508 -1.274 0 1

$means
   F1 
0.503 

$cov
      F1
F1 0.517
Code
# To access directly:
# out.cc$Test1$items  -- item parameters for Form 1 reference group
# out.cc$Test2$items  -- item parameters for Form 2 focus group

In the multigroup concurrent calibration, the mean and variance of the reference group (Test1) are fixed at 0 and 1, while those of the focus group are freely estimated. You can access them via out.cc$Test2$means (population mean) and out.cc$Test2$cov (population variance). Let’s compare the item parameters from the concurrent calibration to those from the two separate calibrations.

Code
par(mfrow = c(2, 4), mar = c(4, 4, 3, 1))

# Common item difficulty -- Form 1 separate vs. concurrent
plot(items1[1:10, 2], out.cc$Test1$items[1:10, 2],
     xlab = "Separate", ylab = "Concurrent",
     main = "Anchor Difficulty:\nForm 1",
     pch = 16, col = "steelblue", xlim = c(-3, 1.5), ylim = c(-3, 1.5))
abline(a = 0, b = 1, lty = 2, col = "gray60")

# Common item difficulty -- Form 2 separate vs. concurrent
plot(items2[1:10, 2], out.cc$Test1$items[1:10, 2],
     xlab = "Separate", ylab = "Concurrent",
     main = "Anchor Difficulty:\nForm 2",
     pch = 16, col = "firebrick", xlim = c(-3, 1.5), ylim = c(-3, 1.5))
abline(a = 0, b = 1, lty = 2, col = "gray60")

# Common item discrimination -- Form 1 separate vs. concurrent
plot(items1[1:10, 1], out.cc$Test1$items[1:10, 1],
     xlab = "Separate", ylab = "Concurrent",
     main = "Anchor Discrimination:\nForm 1",
     pch = 16, col = "steelblue", xlim = c(.4, 1.8), ylim = c(.4, 1.8))
abline(a = 0, b = 1, lty = 2, col = "gray60")

# Common item discrimination -- Form 2 separate vs. concurrent
plot(items2[1:10, 1], out.cc$Test1$items[1:10, 1],
     xlab = "Separate", ylab = "Concurrent",
     main = "Anchor Discrimination:\nForm 2",
     pch = 16, col = "firebrick", xlim = c(.4, 1.8), ylim = c(.4, 1.8))
abline(a = 0, b = 1, lty = 2, col = "gray60")

# Unique item difficulty -- Form 1
plot(items1[, 2], out.cc$Test1$items[1:20, 2],
     xlab = "Separate", ylab = "Concurrent",
     main = "Unique Difficulty:\nForm 1",
     pch = 16, col = "steelblue", xlim = c(-3, 2), ylim = c(-3, 2))
abline(a = 0, b = 1, lty = 2, col = "gray60")

# Unique item difficulty -- Form 2
plot(items2[, 2], out.cc$Test2$items[c(1:10, 21:30), 2],
     xlab = "Separate", ylab = "Concurrent",
     main = "Unique Difficulty:\nForm 2",
     pch = 16, col = "firebrick", xlim = c(-3, 2), ylim = c(-3, 2))
abline(a = 0, b = 1, lty = 2, col = "gray60")

# Unique item discrimination -- Form 1
plot(items1[, 1], out.cc$Test1$items[1:20, 1],
     xlab = "Separate", ylab = "Concurrent",
     main = "Unique Discrimination:\nForm 1",
     pch = 16, col = "steelblue", xlim = c(.2, 2), ylim = c(.2, 2))
abline(a = 0, b = 1, lty = 2, col = "gray60")

# Unique item discrimination -- Form 2
plot(items2[, 1], out.cc$Test2$items[c(1:10, 21:30), 1],
     xlab = "Separate", ylab = "Concurrent",
     main = "Unique Discrimination:\nForm 2",
     pch = 16, col = "firebrick", xlim = c(.2, 2), ylim = c(.2, 2))
abline(a = 0, b = 1, lty = 2, col = "gray60")

Code
par(mfrow = c(1, 1))

The concurrent calibration places all 30 items on a single common scale in one step. Form 1 item parameters (blue) are estimated from data on the reference scale and so fall close to the identity line. Form 2 item parameters (red) come from a separate scale in the independent calibrations but are brought onto the Form 1 scale in the concurrent calibration, producing a systematic shift relative to the identity line.


16.8 Comparing All Approaches

We have now implemented separate calibration (with three different methods for estimating linking constants), anchored calibration, and concurrent calibration. Let’s compare the key result from each: the estimated mean and standard deviation of the Group 2 ability distribution on the Form 1 scale, compared to the true generating values (\(\mu = 0.5\), \(\sigma = 0.8\)).

Code
# Population estimates from concurrent calibration
cc_mean <- out.cc$Test2$means[1]
cc_sd   <- sqrt(out.cc$Test2$cov[1])

# Population estimates from anchored calibration
anc_mean <- coef(fit2.a)$GroupPars[1]
anc_sd   <- sqrt(coef(fit2.a)$GroupPars[2])

# For separate calibration, the implied population mean = B and SD = A
# (because the Form 2 population is N(0,1) by mirt's identification constraint)
comparison <- data.frame(
  Method = c("True (generating)",
             "Separate + Mean-Mean",
             "Separate + Mean-Sigma",
             "Separate + Stocking-Lord",
             "Anchored",
             "Concurrent"),
  Group2_Mean = c(0.500,
                  round(B_mm,     3),
                  round(B_ms,     3),
                  round(B_sl,     3),
                  round(anc_mean, 3),
                  round(cc_mean,  3)),
  Group2_SD   = c(0.800,
                  round(A_mm,     3),
                  round(A_ms,     3),
                  round(A_sl,     3),
                  round(anc_sd,   3),
                  round(cc_sd,    3))
)

comparison |>
  kable("html", col.names = c("Method", "Group 2 Mean (θ)", "Group 2 SD (θ)")) |>
  kable_styling(full_width = FALSE)
Method Group 2 Mean (θ) Group 2 SD (θ)
True (generating) 0.500 0.800
Separate + Mean-Mean 0.582 0.713
Separate + Mean-Sigma 0.449 0.513
Separate + Stocking-Lord 0.493 0.669
Anchored 0.498 0.712
Concurrent 0.503 0.719

Several patterns stand out. For recovering the Group 2 mean, the Stocking-Lord, anchored, and concurrent approaches are all very close to the true value; the moment-based methods (mean-mean, mean-sigma) are less accurate. For the SD, all methods underestimate the true value of 0.8 to some degree. The anchored, concurrent, and mean-mean methods cluster together, the Stocking-Lord method is somewhat lower, and the mean-sigma method is clearly the worst. The mean-sigma \(A\) is dragged too low by a few anchor items with extreme \(\hat{b}\) values on the Form 2 scale.

In practice, Kolen and Brennan (2014, Section 6.3.4) recommend using the Stocking-Lord or Haebara characteristic curve methods when separate calibration is used, primarily because they are more stable across replications than the moment-based methods, even if any single realization may not show dramatic differences. When feasible, anchored or concurrent calibration avoids the linking constant estimation problem entirely.


16.9 On Your Own

  1. Does the overall flow of results across the approaches make sense to you? Give it some thought.
  2. Compare the estimated item parameters from each approach to the true generating values (b1, b2, a1, a2). Which approach recovers the truth most accurately for the unique items?
  3. All methods underestimate the true SD (0.8) of the Group 2 theta distribution to some degree. What might cause this?
  4. Look at the anchor item \(b\)-parameter estimates on the two scales. Which items have the most extreme Form 2 estimates? Why might that be, and how does it affect the mean-sigma linking constant?
  5. Try running the simulation with different seeds (e.g., set.seed(42), set.seed(100)). How stable are the linking constants across replications? Which method shows the most variability?
  6. The anchor test here is 10 out of 20 items (50% of the test). What would you expect to happen to equating accuracy if the anchor were only 5 items? Try modifying the simulation to find out.