This document describes the general guidelines by which Prometric internal psychometricians evaluate and flag internal items for additional review. These guidelines apply to those programs that utilize classical test theory.
Elements of Form Assembly and Statistical Review
1. Range of item difficulties
P-values = .30 - .89 (optimal)*
2. Target value(s) for item discrimination indices
rpBis > .20
3. Target ranges for estimates of internal consistency reliability
Alpha > .80
Intended Range of Item Difficulties
Prometric staff is trained to recognize that individual p-values represent neither an absolute, repeatable value nor warrant a concrete interpretation. Rather, Prometric psychometricians review all item analysis information available to evaluate trends. Note: p-values alone are insufficient for most item interpretations. All basic item reviews incorporate p-values and rpBis prior to making item disposition decisions.
1.00 to 0.96
Unacceptable items with minimal measurement value that must be flagged for removal or revision by SMEs.
0.90 to 0.95
Very easy (possibly unacceptable) items: review rpBis for adequate discrimination. May need review my SMEs.
0.89 to 0.80
Fairly easy (acceptable) items: review rpBis to confirm discrimination.
0.79 to 0.40
Hard to moderately easy (acceptable) items: use if rpBis are within specifications.
When an item is found to be marginal, developers look at the item's rpBis. If the rpBis is high, more tolerance is given to keep that item on the exam.
Target Value(s) for Item Discrimination Indices
The point Biserial (rpBis) is used by Prometric psychometricians to determine the discrimination power of each item. Like other classical statistics, the use of rpBis is not an exact science. In some cases, low rpBis values can result from particularly high or low p-values, low item variance due to implausible distractors, low scoring variance due to homogeneity of candidates, or extremely skewed scoring distributions. Therefore, Prometric psychometricians are required to take several statistics into account as they review item analyses. Table 3 summarizes the guidelines developers use when reviewing item discrimination. Note that these guidelines assume the item is keyed properly and the sample of candidates is sufficiently large.
1.00 to 0.50
Very strong (acceptable)
0.49 to 0.30
0.29 to 0.20
Acceptable (but may need review)
0.19 to 0.10
Marginal (possibly unacceptable) items: review text and distractors closely.
Table 4 lists the interpretations used by the psychometric team for various ranges of alpha coefficients.
Less than 0.60
Unacceptable coefficients that require new forms
0.60 to 0.69
Poor coefficients that require form revision or removal
0.70 to 0.79
Marginal coefficients that may require form review/revision
0.80 to 0.89
The Target Range for Estimates of Classification Consistency or Reliability of the Pass/Fail Decision
Prometric selected Livingston's squared-error loss method for computing decision consistency reliability. This method was selected because it can be interpreted like other reliability measures (discussed above). It is far less complex than threshold loss methods, and it can be run for all single-administration forms. The use of this statistic is consistent with Standard 2.3 in the Standards for Educational and Psychological Testing, p. 20.
Prometric Recommendations - Item Bank RatiosPrometric's internal standards and client recommendations for item banks are noted in Table 1 below.
1. Minimal Target Range
1.5 to 2 times number of items per form
2. Acceptable Target Range
2 to 3 times number of items per form
3. Optimum Target Range
3 to 5 times number of items per form
Return to Test Efficiency and Legal Defensibility Page
Career Opportunities | Terms | Privacy | Ethics | Site Status | ©2017 Prometric