Estimation bias correction - warnings

Top Up Down  A A

Winsteps uses JMLE (= Joint Maximum-Likelihood Estimation), implemented with iterative-curve fitting, rather than Newton-Raphson estimation, because iterative curve-fitting is more robust against awkward data patterns.
 
Every estimation method has strengths and weaknesses. The primary weakness of JMLE is that estimates have statistical bias. This is most obvious in a test of two dichotomous items (Andersen, 1973). In such a test, the difference between the item difficulties of the two items will be estimated to be twice its true value. In practical situations, the statistical bias is usually less than the standard errors of the estimates . However, the advantages of JMLE far outweigh its disadvantages. JMLE is estimable under almost all conditions including arbitrary and accidental patterns of missing data, arbitrary anchoring (fixing) of parameter estimates, unobserved intermediate categories in rating scales, and multiple different Rasch models in the same analysis.

Andersen, E. B. Conditional inference for multiple-choice questionnaires. British Journal of Mathematical and Statistical Psychology, 1973, 26, 31-44.


 

In the psychometric literature, the terms "bias" and "inconsistency" are usually used in the context of the estimation of the difficulties of dichotomous items in a fixed length test administered to a sample of persons. Each measure-parameter is imagined to have a true value, and it is the purpose of the estimation procedure is to estimate that value from the available data. We can never be sure that we have exactly estimated the true value.

 

If the sample size is infinite, and the resulting item estimates are their true values, then the estimates are consistent.

If the sample size is finite, and the expectations of the possible item estimates are their true values, then the estimates are unbiased.

 

Winsteps implements JMLE. JMLE estimates are inconsistent and biased. They are less central than the true values. For instance, if the test consists of two dichotomous items, then, with an infinite sample, the JMLE estimate of the difference between the two item difficulties will be twice the true value. In this situation, an immediate solution is PAIRED=Yes.

 

Ben Wright and Graham Douglas discovered that the multiplier (L-1)/L is an approximate correction for JMLE item bias, where L is the test length. For a two-item test this correction would be (2-1)/2 = 0.5 . It is implemented in Winsteps with STBIAS=YES.

 

Winsteps uses the raw scores as sufficient statistics for its estimates. The parameter estimates reported by Winsteps are the values for which "the observed raw score = the model-expected raw score".

 

"Statistical Consistency" (as usually conceptualized in the psychometric literature) relates to an infinite sample size with a finite test length. Under these conditions, Winsteps estimates are statistically inconsistent (i.e., are not the "true" parameter values even with an infinite amount of data) because inestimable extreme scores are included in the estimation space. "Conditional" estimation methods, CMLE, remove ("condition out") extreme scores from the estimation space,

 

The practical concern is "estimation bias", "departure of estimates from their true values with a finite amount of data". Winsteps estimates do have estimation bias. The Winsteps estimates are less central than they should be. But, as the likelihood of observing extreme scores reduces, the bias in the Winsteps estimates also reduces. Published studies indicate that when the test length is longer than 20 dichotomous items and the sample size is greater than 20 cases, then the Winsteps estimation bias is inconsequentially small. Estimation bias is usually only of concern if exact probabilistic inferences are to be made from logit measures obtained from small samples or short tests. But such inferences are imprecise irrespective of the size of the estimation bias.

 

There are techniques which correct for estimation bias under specific conditions. One such condition is when the data correspond to pairwise observations (such as a basketball league or chess competition). Winsteps has the PAIRED=YES option for this situation.

 

At least two sources of estimation error are reported in the literature.

 

An "estimation bias" error. This is usually negligibly small after the administration of 10 dichotomous items (and fewer rating scale items). Its size depends on the probability of observing extreme score vectors. For a two item test, the item measure differences are twice their theoretical values, reducing as test length increases. This can be corrected. STBIAS= does this approximately, but is only required if exact probability inferences are to be made from logit measure differences.

 

A "statistical inflation" error. Since error variance always adds to observed variance, individual measures are always reported to be further apart (on average) than they really are. This cannot be corrected, in general, at an individual- measure level, because, for any particular measurement it cannot be known to what extent that measurement is biased by measurement error. However, if it is hypothesized that the persons, for instance, follow a normal distribution of known mean and standard deviation, this can be imposed on the estimates (as in MMLE) and the global effects of the estimate dispersion inflation removed. This is done in some other Rasch estimation software.

 

Estimation Bias

 

All Rasch estimation methods have some amount of estimation bias (which has no relationship with demographic bias). The estimation algorithm used by Winsteps, JMLE, has a slight bias in measures estimated from most datasets. The effect of the bias is to spread out the measures more widely than the data indicate. In practice, a test of more than 20 dichotomous items administered to a reasonably large sample will produce measures with inconsequential estimation bias. Estimation bias is only of concern when exact probabilistic inferences are to be made from short tests or small samples. Ben Wright opted for JMLE in the late 1960's because users were rarely concerned about such exact inferences, but they were concerned to obtain speedy, robust, verifiable results from messy data sets with unknown latent parameter distributions. Both of the identifiable sources of error are reduced by giving longer tests to bigger samples. With short tests, or small samples, other threats to validity tend to be of greater concern than the inflationary ones.

 

If estimation bias would be observed even with an infinitely large sample (which it would be with JMLE), then the estimation method is labeled "statistically inconsistent" (even though the estimates are predictable and logical). This sounds alarming but the inconsistency is usually inconsequential, or can be easily corrected in the unlikely event that it does have substantive consequences.

 

The JMLE joint likelihood estimation algorithm produces estimates that have a usually small statistical bias. This bias increases the spread of measures and calibrations, but usually less than the standard error of measurement. The bias quickly becomes insignificantly small as the number of persons and items increases. The reason that JMLE is statistically inconsistent under some conditions, and noticeably biased for short tests or small samples, is that it includes the possibility of extreme scores in the estimation space, but cannot actually estimate them. Inconsistency doesn't really matter, because it asks "if we have infinite data, will the estimation method produce the correct answer?" Estimation bias, also called statistical bias, is more important because it asks "How near to correct are the estimates with finite data?" In practice, JMLE bias is smaller than the other sources of noise in the data. See Ben Wright's comments at www.rasch.org/memo45.htm

 

For paired comparisons and very short tests, estimation can double the apparent spread of the measures, artificially inflating test reliability. This can be eliminated by specifying PAIRED=YES.

 

Correcting for bias may be helpful when it is desired to draw exact probabilistic inferences for small, complete datasets without anchoring.

 

Correcting for bias may be misleading, or may be suppressed by Winsteps, in the presence of missing data or anchored persons or items.

 

Bias correction can produce apparently inconsistent measures if bias-corrected measures, estimated from an unanchored analysis, are then used to anchor that same dataset.

 

Estimation correction methods:

 

STBIAS=YES implements a variant of the simple bias correction proposed in Wright, B.D. and Douglas, G.A. (1977). Best procedures for sample-free item analysis. Applied Psychological Measurement, 1, 281-294. With large samples, a useful correction for bias is to multiply the estimated measures by (L-1)/L, where L is the smaller of the average person or item response count, so, for paired comparisons, multiply by 0.5. This is done automatically when PAIRED=YES.

 

Other Rasch programs may or may not attempt to correct for estimation bias. When comparing results from other programs, try both STBIAS=Y and STBIAS=N to find the closest match.

 

Estimation methods with less bias under sum circumstances include CMLE and MMLE, but these have other limitations or restrictions which are deemed to outweigh their benefits for most uses.

 

Technical information:

 

Statistical estimation bias correction with JMLE is relevant when you wish to make exact probabilistic statements about differences between measures for short tests or small samples. The (L-1)/L correction applies to items on short dichotomous tests with large samples, where L is the number of non-extreme items on a test. For long dichotomous tests with small samples, the correction to person measures would be (N-1)/N. Consequently Winsteps uses a bias correction on dichotomous tests for items of (L-1)/L and for persons of (N-1)/N

 

The reason for this correction is because the sample space does not match the estimation space. The difference is extreme score vectors. Estimation bias manifests itself as estimated measures which are more dispersed than the unbiased measures. The less likely an extreme score vector, the smaller the correction to eliminate bias. Extreme score vectors are less likely with polytomies than with dichotomies so the bias correction is smaller. For example, if an instrument uses a rating scale with m categories, then Winsteps corrects the item measures by (m-1)(L-1)/((m-1)(L-1)+1) and person measures by (m-1)(N-1)/((m-1)(N-1)+1) - but these are rough approximations.

 

With most Rasch software using CMLE, PMLE or MMLE bias correction of item measures is not done because the estimation bias in the item difficulties is generally very small. Bias correction of person abilities is not done though estimation bias exists.

 

Interaction terms are computed in an artificial situation in which the abilities and difficulties estimates are treated as known. Estimation bias is a minor effect in the interaction estimates. It would tend to increase very slightly the probability that differences between interaction estimates are reported as significant. So this is another reason to interpret DIF tests conservatively. If the number of relevant observations for an interaction term is big enough for the DIF effect to be regarded as real, and not a sampling accident, then the estimation bias will be very small. In the worst case, the multiplier would be of the order of (C-1)/C where C is the number of relevant observations.

 

Comparing Estimates

 

Bigsteps and Winsteps should produce the same estimates when

 

(a) they are run with very tight convergence criteria, e.g.,

RCONV=.00001

LCONV=.00001

MJMLE=0

 

(b) they have the same statistical bias adjustment

STBIAS=YES ; estimates will be wider spread

or

STBIAS=NO ; estimates will be narrower

 

(c) they have the same extreme score adjustment

EXTRSC=0.5

 

The item estimates in BTD were produced with statistical bias adjustment, but with convergence criteria that would be considered loose today. Tighter convergence produces a wider logit spread. So the BTD item estimates are slightly more central than Winsteps or Bigsteps.

 

Winsteps and Bigsteps are designed to be symmetric. Transpose persons and items, and the only change is the sign of the estimates and an adjustment for local origin. The output reported in BTD (and by most modern Rasch programs) is not symmetric. So the person measure estimates in BTD are somewhat different.

 

Do-it-yourself estimation-bias correction

 

Correcting for estimation-bias in Winsteps estimates has both advantages and disadvantages. Corrected estimates are usually slightly more central than uncorrected estimates. The only conspicuous advantage of bias correction is for making inferences based on the exact logit distance between the Rasch estimates. Since, with small data sets, the bias correction is usually less than the standard error of the Rasch estimates, bias correction may be of doubtful statistical utility.

 

STBIAS= will not correct for bias accurately with missing data, IWEIGHT= or PWEIGHT=. It may over- or under- correct for estimation bias.

 

If you do need estimation-bias correction that is as accurate as possible with your data set, you will need to discover the amount of bias in the estimates, and then use USCALE= to perform your own estimation-bias correction.

 

Here is a procedure using simulated data sets:

 

1. In your control file, STBIAS=No and USCALE=1

2. Obtain the Winsteps estimates for your data

3. Simulate many datasets using those estimates. (SIFILE= on the Winsteps Output Files menu).

4. Obtain the Winsteps estimates from the simulated data sets

5. Regress the simulated estimates on your initial estimates. These will give a slope near 1.0.

6. Obtain the Winsteps estimates for your data with USCALE = 1/slope

 

The set of estimates in 6 is effectively unbiased.