Table 30.1 Differential item functioning DIF pairwise |
Top Up Down
A A |
Table 30 supports the investigation of item bias, Differential Item Functioning (DIF), i.e., interactions between individual items and types of persons.
Table
30.2 DIF report (measure list: person class within item)
30.3 DIF report (measure list: item within person class)
30.4 DIF report (item-by-person class chi-squares)
30.5 Within-class fit report (person class within item)
30.6 Within-class fit report item within person class)
DIF Plots
In Table 30.1 - the hypothesis is "this item has the same difficulty for two groups"
In Table 30.2, 30.3 - the hypothesis is "this item has the same difficulty as its average difficulty for all groups"
In Table 30.4 - the hypothesis is "this item has no overall DIF across all groups"
Table 30.1 reports a probability and a size for DIF statistics. Usually we want:
1. probability so small that it is unlikely that the DIF effect is merely a random accident
2. size so large that the DIF effect has a substantive impact on scores/measures on the test
A general thought: Significance tests, such as DIF tests, are always of doubtful value in a Rasch context, because differences can be statistically significant, but far too small to have any impact on the meaning, or practical use, of the measures. So we need both statistical significance and substantive difference before we take action regarding bias, etc.
Table 30.1 is a pairwise DIF (bias) analysis: this is testing "item difficulty for Group A vs. item difficulty for Group B". Table 30.1 makes sense if there are only two groups, or there is one majority reference group.
Tables 30.2 and 30.3 are a global DIF (bias) analysis: this is testing "item difficulty for Group A vs. item difficulty for all groups combined." Tables 30.2 and 30.3 make sense when there are many small groups, e.g., age-groups in 5 year increments from 0 to 100.
DIF results are considerably influenced by sample size, so if you have only two person-groups, go to Table 30.1. If you have lots of person-groups go to Table 30.2
Specify DIF= for person classifying indicators in person labels. Item bias and DIF are the same thing. The widespread use of "item bias" dates to the 1960's, "DIF" to the 1980's. The reported DIF is corrected to test impact, i.e., differential average performance on the whole test. Use ability stratification to look for non-uniform DIF using the selection rules. Tables 30.1 and 30.2 present the same information from different perspectives.
From the Output Tables menu, the DIF/DPF dialog is displayed.
Table 31 supports person bias, Differential Person Functioning (DPF), i.e., interactions between individual persons and classifications of items.
Table 33 reports bias or interactions between classifications of items and classifications of persons.
In these analyses, persons with extreme scores are excluded, because they do not exhibit differential ability across items. For background discussion, see DIF and DPF considerations.
Example output:
You want to examine item bias (DIF) between Females and Males. You need a column in your Winsteps person label that has two (or more) demographic codes, say "F" for female and "M" for male (or "0" and "1" if you like dummy variables) in column 9.
Table 30.1 is best for pairwise comparisons, e.g., Females vs. Males.
DIF class specification is: DIF=@GENDER
-------------------------------------------------------------------------------------------------------------------
| KID DIF DIF KID DIF DIF DIF JOINT Welch Mantel-Haenszel Size TAP |
| CLASS MEASURE S.E. CLASS MEASURE S.E. CONTRAST S.E. t d.f. Prob. Chi-squ Prob. CUMLOR Number Name |
|-----------------------------------------------------------------------------------------------------------------|
| F -6.40 1.94 M -6.33 1.73 -.07 2.60 -.03 32 .9785 1 1-4 |
| F -.79 .74 M .55 .67 -1.34 1.00 -1.35 31 .1883 1.846 .1743 -1.25 10 2-4-3-1 |
| F 4.21 .74 M .59 .63 3.62 .97 3.72 31 .0008 13.345 .0003 -. 11 1-3-1-2-4 |
| F 2.71 .66 M 3.84 .73 -1.14 .98 -1.16 31 .2569 3.462 .0628 +. 13 1-4-3-2-4 |
|-----------------------------------------------------------------------------------------------------------------|
Size of Mantel-Haenszel slice: MHSLICE = .010 logits
The most important numbers in Table 30.1: The DIF CONTRAST is the difference in difficulty of the item between the two groups. This should be at least 0.5 logits for DIF to be noticeable. "Prob." shows the probability of observing this amount of contrast by chance, when there is no systematic item bias effect. For statistically significance DIF on an item, Prob. < .05.
DIF class specification defines the columns used to identify DIF classifications, using DIF= and the selection rules.
For summary statistics on each class, use Table 28.
Reading across the Table 30.1 columns:
PERSON CLASS identifies the CLASS of persons. PERSON is specified with PERSON=, e.g., the first here is CLASS is "A".
DIF estimates with the the iterative-logit method:
DIF MEASURE is the difficulty of this item for this class, with all else held constant, e.g., -.40 is the local difficulty for Class A of Item 1. The more difficult, the higher the DIF measure. The measures are conveniently listed in the Excel file for the DIF plots.
For the raw scores corresponding to these measures, see Table 30.2
-.52> reports that this measure corresponds to an extreme maximum person-class score. EXTRSCORE= controls extreme score estimate.
1.97< reports that this measure corresponds to an extreme minimum person-class score. EXTRSCORE= controls extreme score estimate.
-6.91E reports that this measure corresponds to an item with an extreme score, which cannot exhibit DIF
DIF S.E. is the standard error of the DIF MEASURE. A value of ".00" indicates that DIF cannot be observed in these data.
PERSON CLASS identifies the CLASS of persons, e.g., the second CLASS is "D".
DIF MEASURE is the difficulty of this item for this class, with all else held constant, e.g., -.52 is the local difficulty for Class D of Item 1. > means "extreme maximum score".
DIF S.E. is the standard error of the second DIF MEASURE
DIF CONTRAST is the "effect size" in logits (or USCALE= units), the difference between the DIF MEASUREs, i.e., size of the DIF across the two classifications of persons, e.g., -.40 - -.52 = .11 (usually in logits). A positive DIF contrast indicates that the item is more difficult for the first, left-hand-listed CLASS.
If you want a sample-based effect size, then
effect size = DIF CONTRAST / (person sample measure S.D.)
JOINT S.E. is the standard error of the DIF CONTRAST = sqrt(first DIF S.E.² + second DIF S.E.²), e.g., 2.50 = sqrt(.11² + 2.49²)
Welch t gives the DIF significance as a Welch's (Student's) t-statistic » DIF CONTRAST / JOINT S.E. The t-test is a two-sided test for the difference between two means (i.e., the estimates) based on the standard error of the means (i.e., the standard error of the estimates). The null hypothesis is that the two estimates are the same, except for measurement error.
d.f. is the joint degrees of freedom, computed according to Welch-Satterthwaite. When the d.f. are large, the t statistic can be interpreted as a unit-normal deviate, i.e., z-score.
INF means "the degrees of freedom are so large they can be treated as infinite", i.e., the reported t-value is a unit normal deviate.
Prob. is the probability of the reported t with the reported d.f., but interpret this conservatively, i.e., "barely significant is probably not significant". If you wish to make a Bonferroni multiple-comparison correction, compare this Prob. with your chosen significance level, e.g., .05, divided by the number of DIF tests in this Table.
MantelHanzel reports Mantel-Haenszel (1959) DIF test for dichotomies or Mantel (1963) for polytomies using MHSLICE=. Statistics are reported when computable.
Chi-squ. is the Mantel-Haenszel for dichotomies or Mantel for polytomies chi-square with 1 degree of freedom.
Prob. is the probability of observing these data (or worse) when there is no DIF based on a chi-square value with 1 d.f.
Size CUMLOR Size (cumulative log-odds ratio in logits) is an estimate of the DIF (scaled by USCALE=). When the size is not estimable, +. and -. indicate direction.
MH is sensitive to score frequencies. If you have missing data, or only small or zero counts for some raw scores, the MH statistic can go wild. Please try different values of MHSLICE= (thin and thick slicing) to see how robust the MH estimates are.
ITEM Number is the item entry number. ITEM is specified by ITEM=
Name is the item label.
Each line in the Table is repeated with the CLASSes reversed.
ETS DIF Category |
DIF Contrast (Logits) |
DIF Statistical Significance |
C = moderate to large |
|DIF| >=1.5 / 2.35 = 0.64 |
p( |DIF| ≤1/2.35 = 0.43) < .05 |
B = slight to moderate |
|DIF| >= 1/2.35 = 0.43 |
p( |DIF| <0) < .05 |
A = negligible |
|
|
C-, B- = DIF against focal group |
||
ETS (Educational Testing Service) use Delta units. 1 logit = 2.35 Delta units. 1 δ = (4/1.7) ln(α), where α is the odds-ratio. |
||
Zwick, R., Thayer, D.T., Lewis, C. (1999) An Empirical Bayes Approach to Mantel-Haenszel DIF Analysis. Journal of Educational Measurement, 36, 1, 1-28 |
For meta-analysis, the DIF Effect Size = DIF Contrast / S.D. of the "control" CLASS (or the pooled CLASSes). The S.D. for each CLASS is shown in Table 28.
Example: The estimated item difficulty for Females, the DIF MEASURE, is 2.85 logits, and for males the DIF MEASURE is 1.24 logits. So the DIF CONTRAST, the apparent bias of the item against Females, is 1.61 logits. An alternative interpretation is that the Females are 1.61 logits less able on the item than the males.
Males Females
Item 13: +---------+---------+-+-------+-------+-+>> difficulty increases
-1 0 1.24 +2 2.85 DIF measure
+---------------> = 1.61 DIF contrast