Average Measures, Distractors and Rating Scale Structures |
Top Up Down
A A |
The "average measure" for a category is the average ability of the people who respond in that category or to that distractor (or distracter. The term "distractor" has been in use since at least 1934). This is an empirical value. It is not a Rasch-model parameter.
The "step difficulty" (Rasch-Andrich threshold, step calibration, etc.) is an expression of the log-odds of being observed in one or other of the adjacent categories. This is a model-based value. It is a Rasch-model parameter.
Our theory is that people who respond in higher categories (or to the correct MCQ option) should have higher average measures. This is verified by "average measure".
Often there is also a theory about the rating scale, such as "each category in turn should be the most probable one to be observed as one advances along the latent variable." If this is your theory, then the "step difficulties" should also advance. But alternative theories can be employed. For instance, in order to increase item discrimination one may deliberately over-categorize a rating scale - visual-analog scales are an example of this. A typical visual analog-scale has 101 categories. If these functioned operationally according to the "most probable" theory, it would take something like 100 logits to get from one end of the scale to the other.
The relationship between "average measure" and "step difficulties" or "item difficulties" is complex. It is something like:
step difficulty = log ((count in lower category) / (count in higher category)) + (average of the measures across both categories) - normalizer
normalized so that: sum(step calibrations) = 0
So that,
the higher the frequency of the higher category relative to the lower category, the lower (more negative) the step calibration (and/or item difficulty)
and the higher the average of the person measures across both categories, the higher (more positive) the step calibration (and/or item difficulty)
but the step calibrations are estimated as a set, so that the numerical relationship between a pair of categories is influenced by their relationships with every other category. This has the useful consequence that even if a category is not observed, it is still possible to construct a set of step calibrations for the rating scale as a whole.
Suggestions based on researcher experience:
In general, this is what we like to see:
(1) More than 10 observations per category (or the findings may be unstable, i.e., non-replicable)
(2) A smooth distribution of category frequencies. The frequency distribution is not jagged. Jaggedness can indicate categories which are very narrow, perhaps category transitions have been defined to be categories. But this is sample-distribution-dependent.
(3) Clearly advancing average measures.
(4) Average measures near their expected values.
(5) Observations fit of the observations with their categories: Outfit mean-squares near 1.0. Values much above 1.0 are much more problematic than values much below 1.0.