How To Build In Security And Flexibility
Some clients prefer exams that are not fixed forms, but that can be automatically and randomly generated from a bank of items at the time the candidate sits down to test. Prometric has the capability to develop examination banks that support several types of bank-based testing.
Linear on-the-Fly Testing (LOFT).
LOFT is the assembly of pre-equated forms at the testing center just before or during the administration of the test. LOFT (Figure 2) is used to generate unique comparable fixed forms for each test taker. LOFT is possible when all items are pre-tested and placed on a common scale. To be practical, LOFT must be administered using computer-based testing (CBT).
The construction of the test form will have a direct effect on the construction of the test pool for LOFT testing. Most item pools for LOFT contain a minimum of at least 10 times the number of test items needed for any one form. Item pools are assembled using statistical and content specifications with as much attention to detail as if a single test was being assembled (Ariel, van der Linden, & Veldkamp, 2006). Each item pool is constructed from an item vat that contains many tried items with item statistics and content specifications (Way, 1998) as well as indicators for cueing and overlapping content. Item vats are the basis for assembling item pools for CBT architectures that require many items, such as LOFT.
LOFT with Testlets.
Testlet-level LOFT uses pre-assembled unique testlets rather than individual items to build individualized forms at the testing center. Each testlet contains unique items that belong only to one testlet, yet they are constructed to represent the entire test specification (Figure 3) or they may focus on different sections of the test blueprint (Figure 4). Most testlets contain 15 to 25 items each depending on the test specifications. In the former case a randomly-chosen set of parallel testlets combine to create the final form. In the latter case, a testlet is randomly chosen for each content area and combined to create the final form.
Testlets may be constructed using classical, Rasch, or item response theory models. LOFT with testlets is appropriate when items are pre-tested and when (a) the test blueprint is simple enough to be sampled with a single testlet and/or (b) the pool is big enough to create multiple parallel testlets. LOFT with testlets must be administered using CBT.
The item volume requirement for LOFT with testlets, where those testlets are equivalent in content and statistical characteristics to every other testlet in the pool is about five full-length test forms. Of course, more items translates into more possible combinations of unique test forms, with the same testlet appearing possibly on many different but unique test forms. For LOFT with testlets that are assembled within different sections of the test blueprint, the item requirements increase to about ten full-length test forms because of the differences in the number of questions required in each section of the blueprint.
Item vats are large collections of tried questions (Way, 1998) that are used to construct the LOFT item pools that are subsequently released into the field for administration. Pools are often rotated in and out of different administration windows to help with exposure control and as a measure intended to maintain test security and the integrity of the scores (Ariel, Veldkamp, & van der Linden, 2004). However, if there is a concerted effort on the part of some test-takers to breach the security of the test content, these rotation measures are not invulnerable.
Figure 3. LOFT with Testlets Across the Whole Blueprint
Figure 4. LOFT with Testlets by Sections
Computerized Adaptive Testing (CAT-FL, CAT-VL)
A computerized adaptive test administers items that are near the individual test-taker's level of ability (see Figure 5). This creates more efficient measurement than is possible with non-adaptive forms, yet creates the perception among test takers that CAT tests are more difficult compared to tests constructed as fixed forms. This perception is due to the reality that the items selected for any one examinee are geared to that individual's proficiency as determined from previous items administered in the testing session. This measurement efficiency can be leveraged to create a fixed-length test (CAT-FL) that yields more precise scores than a non-adaptive form or a variable-length test (CAT-VL) that is shorter than a non-adaptive form of comparable precision. CAT is most appropriate when precise measurement is needed all along the ability scale. Number correct or summed scoring will not work with adaptive testing: Rasch or IRT scoring methods must be used. These take into account the invariant Rasch or item response theory parameters of each item that is answered correctly or incorrectly. CAT must be administered using CBT.
Figure 5. Computerized Adaptive Testing
Computerized Mastery Testing (CMT)
A problem for credentialing boards who employ linear or CAT methods of administration is that some pass-fail decisions are made incorrectly with no method to determine or limit that decision error. Classification errors, reflecting these incorrect pass-fail decisions, involve two types of errors: (A) False positives, which involve passing individuals who should fail, and (B) false negatives, which entail failing individuals who should pass.
These incorrect decisions occur because tests are almost never perfect measures of the knowledge and skills of interest. Test questions or problems situations are only a sample of all those relevant to the job of interest that could have been asked, and those that were asked may give a misleading picture of the capabilities of some candidates. Typical non-computer-based solutions to avoiding incorrect decisions about a candidate's pass-fail status involve raising or lowering the cutoff score for a fixed-length test. This results in the size of the more important classification error being increased or decreased in the desired direction, but the size of the other classification error is increased or decreased in the opposite direction. Computerized mastery testing was designed to take advantage of the computer and solve this incorrect-decision problem for clients while not requiring the large resources that CAT requires.
In a computerized mastery test (CMT), some candidates are administered more questions than other candidates. The questions in a CMT examination are subdivided into smaller fixed-length groups of equal numbers of nonoverlapping questions covering all the content defined in the test specifications. These are the same test specifications that resulted from a standard job analysis. We call these small groups of questions testlets. The testlet size used in any CMT examination is directly related to the smallest number of questions that can be asked and still proportionately cover the entire test plan. (We have found that anywhere from 15 to 25 questions per testlet fit most examinations' test specifications tables.) In a CMT examination, each testlet would be constructed to be identical (equal) to every other testlet in average difficulty and spread of scores and each would be designed to cover the entire test content plan in the same way.
In a CMT examination, all candidates are first administered a base test. (We can think of the base test as the first stage of a multistage testing process.) The base test is composed of multiple testlets selected at random from a pool composed of nonoverlapping equal testlets. Candidates performing at extreme levels (high or low) on this base test are passed or failed immediately following completion. Those candidates with intermediate performance -- for whom an incorrect-decision error is most probable -- are administered additional questions in the form of single testlets, permitting them additional opportunity to demonstrate that they have met the established standard. This process of administering additional testlets to those candidates for whom an incorrect decision error is most probable continues until the full-length test is reached, at which point a final pass-fail decision is made identical to that made in a full-length linear examination. This final full-length cutoff score is determined in the same way that a linear test cutoff score is determined. A cut score study is conducted and the client decides on the cutoff score.
An example is provided in the accompanying figure below of how one examinee might proceed through the CMT. Notice that there are seven stages of testing and that after the first stage, the candidate is still in the "continue" region and so receives an additional testlet. This testing process continues in this example until the third stage, when the examinee falls in the fail region and testing stops.
One advantage of CMT over linear testing is that it permits the client to specify their relative tolerance for making either decision error. The shape of the pass-continue-fail regions shown in Figure 1 will change based on these client decisions. In addition to setting the cutoff score, the client decides which decision error is more serious or if they are equally serious. Our preliminary research shows that we can classify most candidates using the CMT model well within those tolerances (losses) expressed by the client.
A second advantage of CMT over CAT is that fewer questions are required to create a testlet pool than are required to create a CAT (calibrated) item pool. We have found that anywhere from three to five linear test forms with a few overlapping (common) items are all that are necessary to form an adequate testlet pool. Also, large samples of candidates are not necessary. We have developed CMT methods that do not use item response theory (IRT), but still take advantage of the computer. (Some of our CMT models do use IRT, while others do not. Those CMT models that do not use IRT are very easy to explain to candidates, since they use number of questions correct in the computation of scores.) In fact, some of our CMT models do not require items to be conditionally independent from each other nor is it required that test content be unidimensional. These are typical requirements of CAT item pools that use IRT.
An Example of How One Candidate Might Proceed through a CMT Examination
(see Kim & Cohen, 1998)
Prometric generates a forms assembly report which captures; (a) test form descriptive statistics in the raw and report score scale, (b) item difficulty, discrimination, and response time statistics by item, (c) conditional standard errors of measurement for each possible score (if appropriate), (d) test information and test characteristic functions if appropriate, (e) compliance of each form with the test blueprint, (f) test time histograms, and (g) total test score distributions if appropriate.