Internal Psychometric Guidelines For Classical Test Theory

This document describes the general guidelines by which Prometric internal psychometricians evaluate and flag internal items for additional review. These guidelines apply to those programs that utilize classical test theory.

Table 1: Summary of Statistical Specifications
Elements of Form Assembly and Statistical Review	Specifications/Standards
1. Range of item difficulties	P-values = .30 - .89 (optimal)*
2. Target value(s) for item discrimination indices	rpBis > .20
3. Target ranges for estimates of internal consistency reliability	Alpha > .80
4. Target ranges for estimates of classification consistency or reliability	Livingston > .80
Acceptable ranges are larger than optimal ranges and are explained below

Intended Range of Item Difficulties

P-value = 0.30 to 0.89

Prometric staff is trained to recognize that individual p-values represent neither an absolute, repeatable value nor warrant a concrete interpretation. Rather, Prometric psychometricians review all item analysis information available to evaluate trends. Note: p-values alone are insufficient for most item interpretations. All basic item reviews incorporate p-values and rpBis prior to making item disposition decisions.

Table 2: p-value Guidelines
p-value (easy to hard)	Item Interpretation
1.00 to 0.96	Unacceptable items with minimal measurement value that must be flagged for removal or revision by SMEs.
0.90 to 0.95	Very easy (possibly unacceptable) items: review rpBis for adequate discrimination. May need review my SMEs.
0.89 to 0.80	Fairly easy (acceptable) items: review rpBis to confirm discrimination.
0.79 to 0.40	Hard to moderately easy (acceptable) items: use if rpBis are within specifications.
0.39 to 0.30	Difficult (acceptable) items: review rpBis closely, use if rpBis are within specifications.
0.29 to 0.20	Very difficult (possibly unacceptable) items: review rpBis for adequate discrimination. May need review by SMEs.
0.19 to 0.00	Unacceptable items: Inappropriately difficult or otherwise flawed. Must be flagged for removal or revision by SMEs.

When an item is found to be marginal, developers look at the item's rpBis. If the rpBis is high, more tolerance is given to keep that item on the exam.

Target Value(s) for Item Discrimination Indices

rpBis = 0.20 to 1.00

The point Biserial (rpBis) is used by Prometric psychometricians to determine the discrimination power of each item. Like other classical statistics, the use of rpBis is not an exact science. In some cases, low rpBis values can result from particularly high or low p-values, low item variance due to implausible distractors, low scoring variance due to homogeneity of candidates, or extremely skewed scoring distributions. Therefore, Prometric psychometricians are required to take several statistics into account as they review item analyses. Table 3 summarizes the guidelines developers use when reviewing item discrimination. Note that these guidelines assume the item is keyed properly and the sample of candidates is sufficiently large.

Table 3: rpBis Guidelines
RpBis(strong to weak)	Item Interpretation
1.00 to 0.50	Very strong (acceptable)
0.49 to 0.30	Strong (acceptable)
0.29 to 0.20	Acceptable (but may need review)
0.19 to 0.10	Marginal (possibly unacceptable) items: review text and distractors closely.
0.09 to 0.00	Weak (unacceptable) items: p-values are probably very high. Flag for removal or revision by SMEs.
-0.01 to –0.20	Unacceptable items: inappropriately difficult or otherwise flawed. Must be flagged for removal or revision by SMEs.

Table 4 lists the interpretations used by the psychometric team for various ranges of alpha coefficients.

Table 4: Alpha Guidelines
Alpha	Internal Consistency Reliability Interpretation
Less than 0.60	Unacceptable coefficients that require new forms
0.60 to 0.69	Poor coefficients that require form revision or removal
0.70 to 0.79	Marginal coefficients that may require form review/revision
0.80 to 0.89	Good coefficients
0.90 or above	Excellent coefficients

The Target Range for Estimates of Classification Consistency or Reliability of the Pass/Fail Decision

r = 0.80 or higher

Prometric selected Livingston's squared-error loss method for computing decision consistency reliability. This method was selected because it can be interpreted like other reliability measures (discussed above). It is far less complex than threshold loss methods, and it can be run for all single-administration forms. The use of this statistic is consistent with Standard 2.3 in the Standards for Educational and Psychological Testing, p. 20.

Prometric Recommendations - Item Bank Ratios

Prometric's internal standards and client recommendations for item banks are noted in Table 1 below.

Table 1: Recommendation for Item Banks for Standard Form Based Delivery
Recommendation Level	Range
1. Minimal Target Range	1.5 to 2 times number of items per form
2. Acceptable Target Range	2 to 3 times number of items per form
3. Optimum Target Range	3 to 5 times number of items per form

Return to Test Efficiency and Legal Defensibility Page