What are the advantages of CAT vs non-CAT in terms of time saving and measurement accuracy?

Sean Keeley, Joanne Parkes

Research output: Contribution to conferencePaperpeer-review

Abstract

CAT systems are often credited with major savings in test time as well as increased measurement accuracy. Few studies have been conducted to show the extent of these savings, and this study is intended to provide real-life estimates of the savings (or not) in high-stakes selection. We examine the time differences for CAT versus non-CAT (static, fixed time versions) for some cognitive ability tests, as well as comparing the reliability (internal consistency) of the tests taken in CAT and non-CAT forms.

Computerised-adaptive testing (or CAT) systems are often credited with major savings in test time over static tests as well as increased measurement accuracy. Few studies have been conducted to show the extent of these savings, and this study is intended to provide real-life findings of the actual savings (or not) in high-stakes selection. We will examine the time differences for CAT versus non-CAT (static, fixed time versions) for some cognitive ability tests, as well as comparing the reliability (internal consistency) of the tests taken in CAT and non-CAT forms.

CAT (or Computerised adaptive testing) is a sophisticated test delivery method that creates and delivers a customized test for each candidate using computers, aiming to measure various psychological constructs such as ability, achievement, attitude and personality traits in the most efficient and effective fashion. CAT successively selects questions so as to maximise the precision of the test based on what is known about the candidate from previous questions. From the candidate's perspective, the difficulty of the exam seems to tailor itself to his or her level of ability.

As early as 1984, Weiss & Kingsbury showed CAT required much fewer test items to arrive at equally accurate scores when compared to static tests. It was not evident whether there were major savings in completion time; it may have taken as long for the candidates to complete the tests although there may have been advantages in other areas. Nevertheless savings in time should be expected as the number of questions administered is generally reduced by more than 50%. This saving is always referred to in terms of a reduction in items administered and answered rather than the actual time savings. Two tests of equal length may have different inherent time requirements due to differences in item difficulty, as more difficulty may require more time to respond (Bridgeman, Laitusis & Cline, 2007).

A characteristic of CAT assessment is its ability to provide uniformly accurate test scorers for a wide range of candidates across the whole of latent trait (i.e. whatever characteristic is being measured). (Thiessen & Mislevy, 2000). Static, fixed-form tests tend to provide highly precise measurement for candidates scoring in the middle of a distribution, but increasingly poor precision for candidates scoring at the extreme ends of a distribution. This difference in measurement accuracy is rarely referred to but the increase in error is likely to have major impact when very high cut-off scores are set or where inappropriate norm groups have been applied.

Questions
There are two questions that are being addressed in this study:
• What are the differences in completion time between CAT versions and non-CAT versions of the same ability tests?
• What are the differences in measurement accuracy between CAT versions and non-CAT versions of the same ability tests?

Methods
The sample sizes are likely to be in the tens of thousands for both groups. The candidates are real-life applicants to high-stake roles, and will have taken either CAT or non-CAT versions of the same ability tests: a test of abstract reasoning, and up to two tests of deductive reasoning.

The CAT versions will be different for each candidate as they dynamically adapt to the answers given. The non-CAT versions include items in a static, fixed-form. All candidates receive the same questions in the same order. All of the items in both the CAT and the non-CAT versions have been calibrated on the same scale, and IRT (Item Response Theory) scoring is used for both versions (Lord, 1980). It is possible to calculate the SEM for each candidate for each test for each test instances (i.e. the actual test that a candidate takes), regardless of the versions, and it is this calculation which will allow us to compare the measurement accuracy for both versions.

Very accurate calculation of time tests are also possible for both versions of the tests. For candidates taking the CAT version, the test will automatically terminate once a threshold level of measurement accuracy is met. For the ability tests used in this study, this is a standard error of measurement of 0.45, equating to an internal consistency level of approximately 0.815. There are methodological problems relating to the completion times for the static tests which will be addressed in the analysis phase. For the non-CAT versions, candidates may not complete all of the items before the test terminates (this is likely in up to 50% of the test instances for one of the tests being used) whilst others may wait for the test to time out rather than press the 'finish' button on answering their final question. Strategies to deal with this will be drawn up once the actual nature of the time data is known. The sophisticated administration and scoring system being used, can produce item level data with specific item latencies (i.e. the time taken to answer each question). Average times for actual completion can then be compared for both versions, for each of the tests.

References
Bridgeman, B., Laitusis, C. C., & Cline, F. (2007). Time requirements for the different item types proposed for use in the revised SAT® (ETS Research Report No. RR-07-35). Princeton, NJ: ETS.
Lord, F.M. (1980). Applications of item response theory to practical testing problems. Mahwah, NJ: Lawrence Erlbaum Associates, Inc.
Thissen, D., & Mislevy, R.J. (2000). Testing Algorithms. In Wainer, H. (Ed.) Computerized Adaptive Testing: A Primer. Mahwah, NJ: Lawrence Erlbaum Associates.
Weiss, D. J., & Kingsbury, G. G. (1984). Application of computerized adaptive testing to educational problems. Journal of Educational Measurement, 21, 361-375.
Original languageEnglish
Pages38
Number of pages40
Publication statusPublished - Jan 2015
Externally publishedYes
EventThe British Psychological Society Division of Occupational Psychology Conference - , United Kingdom
Duration: 7 Jan 20159 Jan 2015

Conference

ConferenceThe British Psychological Society Division of Occupational Psychology Conference
Abbreviated titleBPS DOP
Country/TerritoryUnited Kingdom
Period7/01/159/01/15

Fingerprint

Dive into the research topics of 'What are the advantages of CAT vs non-CAT in terms of time saving and measurement accuracy?'. Together they form a unique fingerprint.

Cite this