Introduction

In Item Response Theory (IRT), once we have estimated item parameters, we need a method to estimate person parameters (theta, \(\theta\)). This process is often called “scoring.” This document covers three primary approaches:

Maximum Likelihood Estimation (MLE) - finds the theta that maximizes the likelihood of the observed response pattern
Maximum A Posteriori (MAP) - incorporates prior information about the population distribution
Expected A Posteriori (EAP) - computes the expected value of theta given the posterior distribution

We’ll use data from 31 multiple-choice mathematics items administered as part of a Colorado state achievement test.

library(mirt)
library(ggplot2)
library(dplyr)
library(tidyr)

cde <- read.csv("cde_subsample_math.csv")
dim(cde)

## [1] 1500   31

We have 1500 students and 31 dichotomously scored items.

Likelihood Functions

What is a Likelihood?

The likelihood expresses the probability of observing the data, conditional on a particular model and its parameters. For IRT:

The likelihood tells us how probable our observed response pattern is for different possible values of theta
We often work with the log-likelihood function (the natural logarithm of the likelihood) for computational convenience
The goal is to find the theta value that maximizes this likelihood

Response Vector Likelihood

For a test with \(J\) items, where the response to item \(j\) is \(u_j\) (either 0 or 1), the likelihood of a response vector is:

\[L(\theta | \mathbf{u}) = \prod_{j=1}^{J} P_j(\theta)^{u_j} Q_j(\theta)^{1-u_j}\]

Where:

\(P_j(\theta)\) is the probability of a correct response to item \(j\)
\(Q_j(\theta) = 1 - P_j(\theta)\) is the probability of an incorrect response

The log-likelihood is:

\[\ell(\theta | \mathbf{u}) = \sum_{j=1}^{J} \left[ u_j \ln P_j(\theta) + (1-u_j) \ln Q_j(\theta) \right]\]

A Two-Item Example

Let’s illustrate with a simple two-item test using the 2PL model. Suppose:

Item 1: \(a_1 = 1\), \(b_1 = 0\)
Item 2: \(a_2 = 2\), \(b_2 = 2\)

# 2PL probability function
p_2pl <- function(theta, a, b) {
  1 / (1 + exp(-a * (theta - b)))
}

# Item parameters
a1 <- 1; b1 <- 0
a2 <- 2; b2 <- 2

# Create theta range
theta_range <- seq(-4, 4, by = 0.01)

# Calculate probabilities for each item
p1 <- p_2pl(theta_range, a1, b1)
p2 <- p_2pl(theta_range, a2, b2)
q1 <- 1 - p1
q2 <- 1 - p2

Case 1: Thor (Response Pattern 1-0)

Thor gets the first item correct and the second item incorrect.

# Likelihood for pattern (1,0)
L_thor <- p1 * q2

# MLE estimate
mle_thor <- theta_range[which.max(L_thor)]

# Plot
par(mfrow = c(2, 1), mar = c(4, 4, 3, 1))

# Top panel: Both probability curves on same plot
plot(theta_range, p1, type = "l", col = "blue", lwd = 2,
     xlab = expression(theta), ylab = "Probability",
     main = "Thor (1-0): Probability Functions for Each Item",
     ylim = c(0, 1), las = 1)
lines(theta_range, q2, col = "red", lwd = 2)
legend("right", legend = c(expression(P[1](theta)~"= P(Item 1 Correct)"),
                           expression(Q[2](theta)~"= P(Item 2 Incorrect)")),
       col = c("blue", "red"), lwd = 2, cex = 0.9, bty = "n")

# Bottom panel: Likelihood function
plot(theta_range, L_thor, type = "l", col = "purple", lwd = 2,
     xlab = expression(theta), ylab = "Likelihood",
     main = expression("Likelihood: "*L(theta)*" = "*P[1](theta) %*% Q[2](theta)),
     las = 1)
abline(v = mle_thor, lty = 2, col = "gray40")
text(mle_thor + 0.5, max(L_thor) * 0.9,
     paste("MLE =", round(mle_thor, 2)), cex = 0.9)

The MLE for Thor is approximately \(\hat{\theta} = 1.05\).

Case 2: Storm (Response Pattern 0-1)

Storm gets the first item incorrect and the second item correct.

# Likelihood for pattern (0,1)
L_storm <- q1 * p2

# MLE estimate
mle_storm <- theta_range[which.max(L_storm)]

# Plot
par(mfrow = c(2, 1), mar = c(4, 4, 3, 1))

# Top panel: Both probability curves on same plot
plot(theta_range, q1, type = "l", col = "blue", lwd = 2,
     xlab = expression(theta), ylab = "Probability",
     main = "Storm (0-1): Probability Functions for Each Item",
     ylim = c(0, 1), las = 1)
lines(theta_range, p2, col = "red", lwd = 2)
legend("right", legend = c(expression(Q[1](theta)~"= P(Item 1 Incorrect)"),
                           expression(P[2](theta)~"= P(Item 2 Correct)")),
       col = c("blue", "red"), lwd = 2, cex = 0.9, bty = "n")

# Bottom panel: Likelihood function
plot(theta_range, L_storm, type = "l", col = "purple", lwd = 2,
     xlab = expression(theta), ylab = "Likelihood",
     main = expression("Likelihood: "*L(theta)*" = "*Q[1](theta) %*% P[2](theta)),
     las = 1)
abline(v = mle_storm, lty = 2, col = "gray40")
text(mle_storm + 0.5, max(L_storm) * 0.9,
     paste("MLE =", round(mle_storm, 2)), cex = 0.9)

Key insight: Under the 2PL, Thor and Storm have the same raw score (1 out of 2) but different theta estimates! Storm receives more “credit” for getting the more discriminating item correct. This is a fundamental difference between IRT scoring and classical test theory scoring.

Case 3: Wonder Woman (Response Pattern 1-1)

Wonder Woman gets both items correct.

# Likelihood for pattern (1,1)
L_ww <- p1 * p2

# Plot
par(mfrow = c(2, 1), mar = c(4, 4, 3, 1))

# Top panel: Both probability curves on same plot
plot(theta_range, p1, type = "l", col = "blue", lwd = 2,
     xlab = expression(theta), ylab = "Probability",
     main = "Wonder Woman (1-1): Probability Functions for Each Item",
     ylim = c(0, 1), las = 1)
lines(theta_range, p2, col = "red", lwd = 2)
legend("right", legend = c(expression(P[1](theta)~"= P(Item 1 Correct)"),
                           expression(P[2](theta)~"= P(Item 2 Correct)")),
       col = c("blue", "red"), lwd = 2, cex = 0.9, bty = "n")

# Bottom panel: Likelihood function
plot(theta_range, L_ww, type = "l", col = "purple", lwd = 2,
     xlab = expression(theta), ylab = "Likelihood",
     main = expression("Likelihood: "*L(theta)*" = "*P[1](theta) %*% P[2](theta)),
     las = 1)
text(2, max(L_ww) * 0.7, "No maximum!\nLikelihood increases\nas θ → ∞", cex = 0.9)

Problem: No maximum likelihood estimate exists! The likelihood keeps increasing as theta approaches positive infinity. This is a fundamental limitation of MLE scoring for perfect response patterns.

Case 4: Hulk (Response Pattern 0-0)

Hulk gets both items incorrect.

# Likelihood for pattern (0,0)
L_hulk <- q1 * q2

# Plot
par(mfrow = c(2, 1), mar = c(4, 4, 3, 1))

# Top panel: Both probability curves on same plot
plot(theta_range, q1, type = "l", col = "blue", lwd = 2,
     xlab = expression(theta), ylab = "Probability",
     main = "Hulk (0-0): Probability Functions for Each Item",
     ylim = c(0, 1), las = 1)
lines(theta_range, q2, col = "red", lwd = 2)
legend("topright", legend = c(expression(Q[1](theta)~"= P(Item 1 Incorrect)"),
                              expression(Q[2](theta)~"= P(Item 2 Incorrect)")),
       col = c("blue", "red"), lwd = 2, cex = 0.9, bty = "n")

# Bottom panel: Likelihood function
plot(theta_range, L_hulk, type = "l", col = "purple", lwd = 2,
     xlab = expression(theta), ylab = "Likelihood",
     main = expression("Likelihood: "*L(theta)*" = "*Q[1](theta) %*% Q[2](theta)),
     las = 1)
text(-2, max(L_hulk) * 0.7, "No maximum!\nLikelihood increases\nas θ → -∞", cex = 0.9)

Same problem: No maximum exists. The likelihood keeps increasing as theta approaches negative infinity.

Maximum Likelihood Estimation

The Newton-Raphson Algorithm

In practice, we don’t find the MLE graphically. Instead, we use iterative numerical methods. The Newton-Raphson algorithm is commonly used:

\[\theta^{(t+1)} = \theta^{(t)} - \frac{\ell'(\theta^{(t)})}{\ell''(\theta^{(t)})}\]

Where:

\(\ell'(\theta)\) is the first derivative of the log-likelihood (the score function)
\(\ell''(\theta)\) is the second derivative

For the 2PL model:

\[\ell'(\theta) = \sum_{j=1}^{J} a_j (u_j - P_j(\theta))\]

\[\ell''(\theta) = -\sum_{j=1}^{J} a_j^2 P_j(\theta) Q_j(\theta)\]

Standard Error of the MLE

The standard error of the MLE is derived from the Fisher information:

\[SE(\hat{\theta}) = \frac{1}{\sqrt{I(\hat{\theta})}}\]

Where the information function is:

\[I(\theta) = \sum_{j=1}^{J} a_j^2 P_j(\theta) Q_j(\theta)\]

Bayesian Estimation: MAP and EAP

The Prior Distribution

Before observing any responses, what is our best guess about a test-taker’s theta? Under MLE, we have no way to answer this. However, if we assume test-takers are sampled from a population with a known distribution (typically standard normal), we can incorporate this information.

Bayes’ Theorem Applied

The posterior distribution combines the prior and the likelihood:

\[g(\theta | \mathbf{u}) = \frac{L(\mathbf{u} | \theta) \cdot f(\theta)}{\int L(\mathbf{u} | \theta) \cdot f(\theta) \, d\theta}\]

Where:

\(L(\mathbf{u} | \theta)\) is the likelihood
\(f(\theta)\) is the prior distribution (typically \(N(0, 1)\))
\(g(\theta | \mathbf{u})\) is the posterior distribution

MAP vs. EAP

MAP (Maximum A Posteriori): The mode of the posterior distribution
EAP (Expected A Posteriori): The mean of the posterior distribution

\[\hat{\theta}_{EAP} = \frac{\int \theta \cdot L(\mathbf{u} | \theta) \cdot f(\theta) \, d\theta}{\int L(\mathbf{u} | \theta) \cdot f(\theta) \, d\theta}\]

Revisiting Wonder Woman with Bayesian Estimation

# Prior: standard normal
prior <- dnorm(theta_range, mean = 0, sd = 1)

# Posterior (unnormalized) for Wonder Woman
posterior_ww <- L_ww * prior

# MAP estimate
map_ww <- theta_range[which.max(posterior_ww)]

# Three vertically stacked plots
par(mfrow = c(3, 1), mar = c(4, 4, 3, 1))

# Panel 1: Both ICCs on same plot (the likelihood components)
plot(theta_range, p1, type = "l", col = "blue", lwd = 2,
     xlab = expression(theta), ylab = "Probability",
     main = "Wonder Woman (1-1): Probability Functions for Each Item",
     ylim = c(0, 1), las = 1)
lines(theta_range, p2, col = "red", lwd = 2)
legend("right", legend = c(expression(P[1](theta)~"= P(Item 1 Correct)"),
                           expression(P[2](theta)~"= P(Item 2 Correct)")),
       col = c("blue", "red"), lwd = 2, cex = 0.9, bty = "n")

# Panel 2: Prior distribution (population ability distribution)
plot(theta_range, prior, type = "l", col = "forestgreen", lwd = 2,
     xlab = expression(theta), ylab = "Density",
     main = expression("Prior Distribution: "*f(theta)*" = N(0, 1)"),
     las = 1)

# Panel 3: Posterior distribution (product of likelihood and prior)
plot(theta_range, posterior_ww, type = "l", col = "purple", lwd = 2,
     xlab = expression(theta), ylab = "Posterior (unnormalized)",
     main = expression("Posterior: "*g(theta)*" "*proportional*" "*P[1](theta) %*% P[2](theta) %*% f(theta)),
     las = 1)
abline(v = map_ww, lty = 2, col = "gray40")
text(map_ww + 0.5, max(posterior_ww) * 0.9,
     paste("MAP =", round(map_ww, 2)), cex = 0.9)

Key insight: By incorporating prior information, we now have a finite estimate for Wonder Woman! The prior distribution “pulls” the estimate back toward the population mean. Compare this to Case 3 above, where the likelihood alone had no maximum. The posterior is the product of the likelihood (from the ICCs) and the prior distribution—this multiplication creates a curve with a well-defined peak.

Application with Real Data

Fitting the 2PL Model

# Fit 2PL model
m2 <- mirt(cde, 1, itemtype = "2PL", verbose = FALSE)

# View item parameters
coef(m2, simplify = TRUE, IRTpars = TRUE)$items

##             a           b g u
## V1  1.0211422 -1.05990398 0 1
## V2  0.7825516 -1.31109802 0 1
## V3  1.4241184 -0.58480336 0 1
## V4  1.2041302 -1.02047790 0 1
## V5  0.9336941 -0.38150283 0 1
## V6  0.8671079  0.03393734 0 1
## V7  0.7142823 -0.65593915 0 1
## V8  0.9259754 -0.05083560 0 1
## V9  1.9660439 -0.84714913 0 1
## V10 1.1842197 -0.61070835 0 1
## V11 1.1815007 -1.47967286 0 1
## V12 1.9352373 -1.02575238 0 1
## V13 1.1259578 -1.49222661 0 1
## V14 0.7613313 -0.23334423 0 1
## V15 1.2847495 -2.25061241 0 1
## V16 2.0923349 -1.49677594 0 1
## V17 1.0318098 -1.00092441 0 1
## V18 0.7838470 -0.61447743 0 1
## V19 1.5143618 -1.37192331 0 1
## V20 1.2181657  1.31635095 0 1
## V21 1.2647873 -0.58202952 0 1
## V22 1.1222375 -0.58356842 0 1
## V23 0.3470280 -0.12049909 0 1
## V24 0.9412101 -0.09441439 0 1
## V25 0.9776519 -1.15062054 0 1
## V26 1.5042345  0.63980674 0 1
## V27 1.0028072  0.20895607 0 1
## V28 0.6360303  0.54449773 0 1
## V29 0.9980535  1.62306946 0 1
## V30 0.9011089  0.85502285 0 1
## V31 0.5065158  1.90097806 0 1

Checking for Perfect and Zero Scores

# Calculate total scores
tot <- apply(cde, 1, sum)

# Summary
cat("Total Score Distribution:\n")

## Total Score Distribution:

summary(tot)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   14.00   19.00   18.41   23.00   31.00

# Check for extreme scores
cat("\nNumber with score = 0:", sum(tot == 0))

## 
## Number with score = 0: 1

cat("\nNumber with perfect score (31):", sum(tot == 31))

## 
## Number with perfect score (31): 2

hist(tot, breaks = 30, col = "steelblue", border = "white",
     main = "Distribution of Raw Scores",
     xlab = "Total Score (out of 31)", las = 1)

Estimating Theta with Different Methods

# MLE estimates
theta_mle <- fscores(m2, method = "ML", full.scores.SE = TRUE)

# EAP estimates
theta_eap <- fscores(m2, method = "EAP", full.scores.SE = TRUE)

# MAP estimates
theta_map <- fscores(m2, method = "MAP", full.scores.SE = TRUE)

# Combine into data frame
theta_df <- data.frame(
  raw_score = tot,
  mle = theta_mle[, 1],
  mle_se = theta_mle[, 2],
  eap = theta_eap[, 1],
  eap_se = theta_eap[, 2],
  map = theta_map[, 1],
  map_se = theta_map[, 2]
)

MLE Issues with Extreme Scores

summary(theta_df$mle)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     -Inf -0.76441 -0.01835           0.70545      Inf

Notice the Inf and -Inf values for students with perfect and zero scores. This is a fundamental limitation of MLE.

Comparing Theta Distributions

par(mfrow = c(2, 2), mar = c(4, 4, 3, 1))

# Raw scores
hist(tot, breaks = 30, col = "steelblue", border = "white",
     main = "Raw Scores", xlab = "Total Score", las = 1)

# MLE (excluding infinite values)
mle_finite <- theta_df$mle[is.finite(theta_df$mle)]
hist(mle_finite, breaks = 30, col = "coral", border = "white",
     main = "MLE Theta Estimates", xlab = expression(theta), las = 1)

# EAP
hist(theta_df$eap, breaks = 30, col = "forestgreen", border = "white",
     main = "EAP Theta Estimates", xlab = expression(theta), las = 1)

# MAP
hist(theta_df$map, breaks = 30, col = "purple", border = "white",
     main = "MAP Theta Estimates", xlab = expression(theta), las = 1)

MLE vs. EAP Comparison

# Filter to finite MLE values
finite_idx <- is.finite(theta_df$mle)

# Standard deviations
cat("SD of MLE estimates:", round(sd(theta_df$mle[finite_idx]), 4), "\n")

## SD of MLE estimates: 1.1605

cat("SD of EAP estimates:", round(sd(theta_df$eap[finite_idx]), 4), "\n")

## SD of EAP estimates: 0.9166

The EAP estimates have less variability because they are “shrunk” toward the prior mean of 0.

# Correlation
cat("Correlation between MLE and EAP:",
    round(cor(theta_df$mle[finite_idx], theta_df$eap[finite_idx]), 4))

## Correlation between MLE and EAP: 0.9875

ggplot(theta_df[finite_idx, ], aes(x = mle, y = eap)) +
  geom_point(alpha = 0.5, color = "steelblue") +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "red") +
  labs(x = "MLE Theta", y = "EAP Theta",
       title = "MLE vs. EAP Theta Estimates",
       subtitle = "Dashed line shows y = x") +
  theme_minimal() +
  coord_equal()

Note the non-linearity at the tails: EAP estimates are pulled toward the mean relative to MLE, especially for extreme scores.

Standard Errors Comparison

ggplot(theta_df[finite_idx, ], aes(x = mle, y = mle_se)) +
  geom_point(alpha = 0.5, color = "red") +
  geom_point(aes(x = eap, y = eap_se), alpha = 0.5, color = "blue") +
  labs(x = expression(theta), y = "Standard Error",
       title = "Standard Errors: MLE (red) vs. EAP (blue)") +
  theme_minimal() +
  ylim(0.3, 0.8)

EAP standard errors tend to be slightly smaller than MLE standard errors.

Item and Test Information

Item Information Function

For the 2PL model, the item information function is:

\[I_j(\theta) = a_j^2 P_j(\theta) Q_j(\theta)\]

Key properties:

Information is maximized near the item difficulty (\(b_j\))
Higher discrimination (\(a_j\)) leads to more information
Information is highest where the item is most useful for distinguishing between test-takers

# Extract item parameters
item_pars <- coef(m2, simplify = TRUE, IRTpars = TRUE)$items

# Calculate information for each item
theta_seq <- seq(-4, 4, by = 0.1)
info_matrix <- matrix(NA, nrow = length(theta_seq), ncol = nrow(item_pars))

for (j in 1:nrow(item_pars)) {
  a <- item_pars[j, "a"]
  b <- item_pars[j, "b"]
  p <- p_2pl(theta_seq, a, b)
  q <- 1 - p
  info_matrix[, j] <- a^2 * p * q
}

# Plot a few items
plot(theta_seq, info_matrix[, 1], type = "l", col = 1, lwd = 2,
     xlab = expression(theta), ylab = "Information",
     main = "Item Information Functions (Selected Items)",
     ylim = c(0, max(info_matrix)), las = 1)
for (j in c(5, 10, 15, 20, 25, 31)) {
  lines(theta_seq, info_matrix[, j], col = j, lwd = 2)
}
legend("topright", legend = paste("Item", c(1, 5, 10, 15, 20, 25, 31)),
       col = c(1, 5, 10, 15, 20, 25, 31), lwd = 2, cex = 0.8)

Test Information Function

The test information function is simply the sum of item information functions:

\[I(\theta) = \sum_{j=1}^{J} I_j(\theta)\]

# Test information
test_info <- rowSums(info_matrix)

# Plot
plot(theta_seq, test_info, type = "l", col = "darkblue", lwd = 3,
     xlab = expression(theta), ylab = "Information",
     main = "Test Information Function (2PL)", las = 1)
abline(v = theta_seq[which.max(test_info)], lty = 2, col = "red")
text(theta_seq[which.max(test_info)] + 0.5, max(test_info) * 0.95,
     paste("Max at θ =", round(theta_seq[which.max(test_info)], 2)))

This test provides the most information for individuals scoring slightly below average.

Relationship Between Information and Standard Error

The standard error of measurement (SEM) is inversely related to information:

\[SEM(\theta) = \frac{1}{\sqrt{I(\theta)}}\]

# Theoretical SEM
sem_theoretical <- 1 / sqrt(test_info)

par(mfrow = c(1, 2))

# Information
plot(theta_seq, test_info, type = "l", col = "darkblue", lwd = 2,
     xlab = expression(theta), ylab = "Information",
     main = "Test Information", las = 1)

# SEM
plot(theta_seq, sem_theoretical, type = "l", col = "darkred", lwd = 2,
     xlab = expression(theta), ylab = "SEM",
     main = "Standard Error of Measurement", las = 1)

Theoretical vs. Empirical SEM

par(mfrow = c(1, 1))

# Plot theoretical SEM
plot(theta_seq, sem_theoretical, type = "l", col = "blue", lwd = 2,
     xlab = expression(theta), ylab = "SEM",
     main = "Theoretical vs. Empirical SEM",
     ylim = c(0.2, 0.8), las = 1)

# Add empirical EAP SEs
points(theta_df$eap, theta_df$eap_se, col = "red", pch = 16, cex = 0.5)

legend("topright",
       legend = c("Theoretical (1/√I)", "Empirical (EAP posterior SD)"),
       col = c("blue", "red"), lwd = c(2, NA), pch = c(NA, 16))

Summary

Method	Advantages	Disadvantages
MLE	Unbiased; no prior assumptions required	Undefined for perfect/zero scores; larger variance
MAP	Handles extreme scores; mode of posterior	Requires prior specification
EAP	Handles extreme scores; minimum MSE; most commonly used	Requires prior specification; shrinkage toward mean

Key Takeaways

Likelihood functions express the probability of observed data as a function of theta
MLE finds the theta that maximizes the likelihood, but fails for perfect/zero scores
Bayesian methods (MAP, EAP) incorporate prior information and handle all response patterns
EAP estimates have smaller variance but are “shrunk” toward the population mean
Information functions tell us where on the theta scale the test provides the most precise measurement
Standard errors are inversely related to information—more information means more precise estimates

References

Embretson, S. E., & Reise, S. P. (2000). Item Response Theory for Psychologists. Lawrence Erlbaum Associates.
Thissen, D., & Orlando, M. (2001). Item response theory for items scored in two categories. In D. Thissen & H. Wainer (Eds.), Test Scoring (pp. 73-140). Lawrence Erlbaum Associates.

Person Parameter Estimation in IRT

Scoring with MLE, MAP, and EAP

Derek Briggs and Claude Code (Opus 4.6)

2026-02-20

Introduction

Likelihood Functions

What is a Likelihood?

Response Vector Likelihood

A Two-Item Example

Case 1: Thor (Response Pattern 1-0)

Case 2: Storm (Response Pattern 0-1)

Case 3: Wonder Woman (Response Pattern 1-1)

Case 4: Hulk (Response Pattern 0-0)

Maximum Likelihood Estimation

The Newton-Raphson Algorithm

Standard Error of the MLE

Bayesian Estimation: MAP and EAP

The Prior Distribution

Bayes’ Theorem Applied

MAP vs. EAP

Revisiting Wonder Woman with Bayesian Estimation

Application with Real Data

Fitting the 2PL Model

Checking for Perfect and Zero Scores

Estimating Theta with Different Methods

MLE Issues with Extreme Scores

Comparing Theta Distributions

MLE vs. EAP Comparison

Standard Errors Comparison

Item and Test Information

Item Information Function

Test Information Function

Relationship Between Information and Standard Error

Theoretical vs. Empirical SEM

Summary

Key Takeaways

References