CUE-Studies – DialogDesign

CUE (Comparative Usability Evaluation) is a series of ten studies that investigate the reproducibility of usability evaluations and explore common practices among usability professionals.

In a CUE-study, a considerable number of professional usability teams independently and simultaneously evaluate the same website, web application, or Windows program. Afterwards the results are compared and discussed.

The two most important goals of the CUE-studies are:

Study the reproducibility of usability evaluations. In other words, if two professional teams independently carry out a usability evaluation of the same product, will they report similar results? The answer turns out to be mostly negative.

Learn about common practices among usability professionals when they do a usability evaluation.

Purpose of the CUE-studies

The main purpose of the CUE-studies is to answer a series of questions about professional usability evaluation, including:

What is common practice?
What usability evaluation methods and techniques do professionals actually use? Are there any popular methods or techniques that experienced professionals avoid, even though they receive a lot of coverage?
This question is addressed in all CUE-studies
Are usability evaluation results reproducible?
This question is addressed in CUE-1 to CUE-6.
How many usability problems are there in a product?
What’s the order of magnitude of the total number of usability problems that you can expect to find on a typical, nontrivial website?
CUE-1 to CUE-7 showed that the number is huge. No CUE-study came close to finding all usability problems in the product that was evaluated, even though many CUE-studies found more than 300 usability problems.
How many test participants are needed?
How many test participants are required to find most of the critical problems?
CUE-1 to CUE-7 showed that the number is huge. A large number of test participants (>>100) and a large number of moderators (>>30) will be required to find most of the critical problems.
Quality differences
Are there important quality differences between the results the teams obtained?
All CUE-studies addressed this question.
What’s the return on investment?
If you invest more time in a usability evaluation – for example, 100 hours instead of 25 – will you get substantially better results?
CUE-4 analyzed this question.
Usability test versus usability inspection.
How do professional usability testing and usability inspection compare?
CUE-4, CUE-5, and CUE-6 analyzed this question.

Usability experts, including me, are using the results from the CUE-studies to advise the usability community on quality approaches to usability evaluation, in particular usability testing.

Five key takeaways from the CUE-studies

Five users are not enough.
It is a widespread myth that five users are enough to find 85 percent of the usability problems in a product. The CUE studies have consistently shown that even 15 or more professional teams report only a fraction of the usability problems. Five users are enough to drive a useful iterative cycle, but never claim that you found all usability problems—or just half of them—in an interactive system.
Huge number of issues.
The total number of usability issues for the state-of-the-art websites that we have tested is huge, more than 300 and counting. It is much larger than you can hope to find in one usability test.
Usability inspections are useful.
The CUE-4 study indicated that usability inspections produce results of a quality comparable to usability tests—at least when carried out by experts.
Designing good usability test tasks is challenging.
In CUE-2, nine teams created 51 different tasks for the same user interface. We found each task to be well designed and valid, but there was scant agreement on which tasks were critical. If each team used the same best practices, then they should have derived similar tasks from the test scenario. But that isn’t what happened. Instead, there was virtually no overlap. It was as if each team thought the interface was for a completely different purpose.
Quality problems in some usability test reports.
The quality of the usability test reports varied dramatically. In CUE-2, the size of the nine reports varied from five pages to 52 pages—a 10-times difference. Some reports lacked positive findings, executive summaries, and screen shots. Others were complete with detailed descriptions of the team’s methods and definitions of terminology.

Read more in the retrospective article and on the pages dedicated to the specific CUE-studies.

The main result from the CUE-studies is:
Five users will only find a small fraction
of the usability problems in a product
(but five users are great to drive an iterative process anyway).

The ten CUE-studies

This website contains one page for each CUE-study with detailed information about the study and links to related articles and downloads.

CUE-1 – Are usability tests reproducible?
Four teams usability tested the same Windows program, Task Timer for Windows.
CUE-2 – Confirm the results of CUE-1
Nine teams usability tested www.hotmail.com
CUE-3 – Usability inspection
Twelve Danish teams evaluated www.avis.com using usability inspections
CUE-4 – Usability test vs. usability inspection
Seventeen professional teams evaluated www.hotelpenn.com. Nine teams used usability testing and eight teams used usability inspections
CUE-5 – Usability test vs. usability inspection
Thirteen professional teams evaluated the IKEA PAX Wardrobe planning tool on www.ikea-usa.com. Six teams used usability testing and seven teams used usability inspection
CUE-6 – Usability test vs. usability inspection
Thirteen professional teams evaluated the Enterprise car rental website, Enterprise.com. Ten teams used usability testing, six teams used usability inspection, and three teams used both methods
CUE-7 – Recommendations
Nine professional teams provided recommendations for six nontrivial usability problems identified in CUE-5
CUE-8 – Task measurement
Seventeen professional teams measured key usability parameters for the Budget car rental website, Budget.com
CUE-9 – The evaluator effect
Nineteen experienced usability professionals independently observed the same five videos from usability test sessions of www.Uhaul.com, reported their observations and then discussed similarities and differences in their observations.
CUE-10 – Moderation
Sixteen usability professionals independently moderated three usability test sessions of Ryanair.com using the same test script. Videos from the usability test sessions were analyzed to determine good and poor moderation practice.

I conceived and managed the CUE-studies. The first study, CUE-1, took place in 1998. The most recent study, CUE-10, took place in 2018.

At this time, there are no plans for further CUE-studies.

A four-page summary of all CUE-studies

I have written a retrospective about the CUE-studies

Are Usability Evaluations Reproducible? – A retrospective of all CUE-studies
Rolf Molich
Interactions, October 2018

Additional papers about CUE are listed under the respective CUE-studies and on the Publications page.