Best Practices For Pretesting Exam Items

Reasons for Pretesting

Any testing program has to develop processes for incorporating new content into their examinations. Pretesting items prior to using them as scored items on a live exam is done for two key reasons:

Statistical Evaluation of Items: Pretesting items allows for the gathering of statistics regarding candidate performance on each new item. Regardless of how sound the test development process, it is possible for quality items to perform unexpectedly within the candidate population. Evaluating pretest statistics allows for confirmation that newly developed items are performing within acceptable statistical parameters prior to the item affecting a candidate's exam score.
Collecting Statistics for Equating: In order to ensure that every candidate receives an exam of equitable difficulty, pre-equating of examination forms is a desired test development method. Administering exams from an overall bank to a specified difficulty level requires that the live items used within the exam bank have statistics associated with them. A continual, standardized pretesting process continually feeds an item bank and ensures that pre-equating can be performed.

Item evaluation and pre-equating are both designed to create a valid testing process that is fair to all candidates. The combination of these processes within an overall development plan ensures that each live item presented to candidates is performing well and that each candidate receives an exam of equitable difficulty. This creates the foundation for a defensible testing program.

The following information covers the main considerations for any program incorporating a pretesting process.

Mode of Delivery

There are different methodologies available for pretesting – the two main methodologies are (1) separate pretest forms and (2) pretesting embedded within an existing form.

Separate Pretest Forms

Some programs prefer to completely separate the pretesting process from the live exam administration. In order to accomplish that goal, it is necessary to create separate pretest examinations that can be administered to the candidate population. Entire pretest exams are created with the same proportion of items that are present on the live exam form. Separate pretest forms are typically administered to volunteer candidates during special pretesting administrations. Volunteer candidates should represent as closely as possible the same type of candidate pool that would typically be taking the live examination.

The benefits of this approach are that the live testing experience is not affected in any way. Candidates who participate in the pretesting sessions do so voluntarily and with a full awareness of the process. The drawbacks to this approach include (1) an extended timeframe for data collection, and (2) a potential skewing of the candidate pool and the subsequent pretest data. When a pretest process relies on volunteers, it generally takes a longer period of time to gather a large enough sample of candidates to allow for the analysis of the pretest data. In addition, a process that relies on volunteer candidates inherently changes the composition of the candidate pool. As it is typically the motivated, high-achieving candidates who will volunteer to take a pretest examination, the candidate pool is no longer representative of the full range of individuals who take a live exam. This potential alteration of the candidate pool with predominantly high performers can skew the resulting pretest data.

Pretest Items Embedded within Existing Form

A second pretesting methodology involves the inclusion of a small percentage of pretest items within existing examination forms. This methodology allows for the gradual pretesting of items during regular exam administrations. A benefit of this approach is that the candidates responding to the pretest items are the same candidates taking the live exam – which vastly eliminates the potential for a contamination of the candidate pool. Because this process does not involve the use of volunteers, it also allows for the collection of pretest data in the most efficient manner, reducing delays in data collection due to lengthy recruitment time for volunteers.

The drawbacks to this approach involve the extension of the number of items on the examination. Increasing the number of items on an examination can increase candidate anxiety and fatigue in answering the questions on the exam. Secondarily, a smaller number of pretest items are tested within existing forms than in a separate pretest forms. Therefore, a protocol must be established to rotate pretest items in a reasonable timeframe.

Candidate Disclosure

Most test development professionals would recommend that the pretesting process be disclosed to candidates prior to an exam administration. There are options, however, regarding how much information is disclosed to the candidate population.

Knowledge of number of pretest items: Typically candidates are told prior to the examination how many pretest items that will appear on the exam. Candidates are also informed that the pretest items will not affect their overall score.
Knowledge of exact pretest items: Typically candidates are not told exactly which items are the pretest items. This is done to ensure that candidates answer the pretest items in the same way that they answer the live exam items (with an equitable desire to answer the item correctly).

Method of Presentation

If pretest items are embedded within an existing form, there are various ways.of presenting the pretest items. Three methodologies are described below.

Beginning of the Exam: All pretest items can be presented in a section at the beginning of the exam.
End of the Exam: All pretest items can be presented in a section at the end of the exam.
Distributed Throughout the Exam: Items can be distributed within the appropriate content sections within the exam.

In order to ensure that candidates answer the pretest items as they would a live item on the exam, Prometric recommends that the pretest items be distributed throughout the exam form. This helps ensure that candidates don't guess the pretest section and therefore modify their performance on those items.

Percentage of Pretest Items in an Existing Form

It is typically recommended that pretest items do not surpass 10% of the total items on the exam (e.g., a 40 item exam should not contain more than 4 pretest items). Limiting the number of pretest items reduces the possibility of candidate fatigue and typically eliminates the need to extend testing time.

Number of Candidate Exposures Prior to Analysis

For classical test theory, Prometric recommends a minimum of 100 candidate exposures per pretest item in order to evaluate statistical viability. Additional candidate exposures (above the minimum of 100) increase the stability of the candidate data and increase the generalizability of the pretest results.

Optimal Parameters for Transition of Pretest to Live Item

The following section describes the general guidelines by which Prometric internal psychometricians evaluate pretest items. Although individual programs may differ, these guidelines are helpful for overall evaluation purposes. Please note that these guidelines apply only to those programs that utilize classical test theory.

Table 1: Summary of Statistical Specifications

Elements of Form Assembly and Statistical Review	Specifications / Standards
1. Range of item difficulties	p-values = .30 -.89 (optimal)*
2. Target value(s) for item discrimination indices	rpBis > .20
3. Target ranges for estimates of internal consistency reliability	Alpha > .80
4. Target ranges for estimates of classification consistency or reliability	Livingston > .80

Acceptable ranges are larger than optimal ranges and are explained below

Intended Range of Item Difficulties

P-value = 0.30 to 0.89

Prometric staff is trained to recognize that individual p-values represent neither an absolute, repeatable value nor warrant a concrete interpretation. Rather, Prometric psychometricians review all item analysis information available to evaluate trends. Note: p-values alone are insufficient for most item interpretations. All basic item reviews incorporate p-values and rpBis prior to making item disposition decisions.

Table 2: p-value Guidelines

p-value (easy to hard)	Item Interpretation
1.00 to 0.96	Unacceptable items with minimal measurement value that must be flagged for removal or revision by SMEs
0.90 to 0.95	Very easy (possibly unacceptable) items: review rpBis for adequate discrimination. May need review my SMEs.
0.89 to 0.80	Fairly easy (acceptable) items: review rpBis to confirm discrimination.
0.79 to 0.40	Hard to moderately easy (acceptable) items: use if rpBis are within specifications.
0.39 to 0.30	Difficult (acceptable) items: review rpBis closely, use if rpBis are within specifications.
0.29 to 0.20	Very difficult (possibly unacceptable) items: review rpBis for adequate discrimination. May need review by SMEs.
0.19 to 0.00	Unacceptable items: Inappropriately difficult or otherwise flawed. Must be flagged for removal or revision by SMEs.

When an item is found to be marginal, developers look at the item's rpBis. If the rpBis is high, more tolerance is given to keep that item on the exam.

Target Value(s) for Item Discrimination Indices

rpBis = 0.20 to 1.00

Table 3: rpBis Guidelines

RpBis (Strong to Weak)	Item Interpretation
1.00 to 0.50	Very Strong (Acceptable)
0.49 to 0.30	Strong (Acceptable)
0.29 to 0.20	Acceptable (but may need review)
0.19 to 0.10	Marginal (possibly unacceptable) items: review text and distractors closely.
0.09 to 0.00	Weak (unacceptable) items: p-values are probably very high. Flag for removal or revision by SMEs.
-0.01 to -0.20	Unacceptable items: inappropriately difficult or otherwise flawed. Must be flagged for removal or revision by SMEs.

After evaluation of item level statistics, decisions are made on each individual item. Items can be (1) accepted as is and placed in the live exam pool, (2) accepted with modifications and re-entered into the pretest pool, or (3) rejected from further use.

Return to Test Efficiency and Legal Defensibility Page