Usability tests and expert reviews are staple methods of the field of human-computer interaction, but how effective are they? This has been the question behind the series of Competitive Usability Evaluation (CUE) studies, and Chris Rourke reports on the most recent one.
Competitive Usability Evaluations
Earlier this year I was selected to be one of the participants in CUE-4, the most recent comparative study into the effectiveness of common usability assessment methods, and the process and results have been very intriguing. The goals of the CUE studies, conceived and led by Rolf Molich of Dialog Design, is to understand the strengths and weaknesses popular methods such as expert review and usability test, and to see how consistent usability professionals are in finding potential problems using these methods. In an ideal world different usability professionals reviewing a given site would find the same problems and follow a similar methodology. In case you have not noticed, the world is not perfect, and there can be a great deal of variation between the findings and the methods used by usability experts. This is why the information gathered from the CUE studies has been some of the most interesting, and controversial, research about our profession as a whole. A quick summary of the results of the earlier CUE studies shows why some of the results has caused waves:
CUE-1 was a comparative usability test of a Windows calendar application in March 1998 by four teams working independently. Collectively the four teams found many usability problems, but in comparing the work of the teams, it was apparent that there was very little overlap between the findings of the teams. This set the stage for a more extensive follow up study.
CUE-2 was a usability test of www.hotmail.com carried out in late 1998 by nine professional international teams. Each team was given the same interface, test scenario and objectives of the site but they were allowed to follow their own practices for testing method and reporting. Again the study revealed many usability problems and showed that many usability professionals make serious errors when conducting and reporting a usability test. Although it showed that collectively they found a combined total of more than 300 problems (good news for the profession – we can find many problems even in high profile, state of the art sites like hotmail) there seemed very little overlap in the conclusions. In fact, there wasn’t a single problem that every team reported. Eight of the nine teams missed 75% of the usability problems, and only one team reported more than 25% of the collective total. Some of the main factors contributing to this were the variation in tasks developed for the test and the level of detail in the reports. This report clearly showed that the assumption that all usability professionals, using the same methods, would get the same results was wrong.
Moving away from the usability testing method, CUE-3 was a comparative test of expert evaluations conducted by 12 Danish usability professionals. This was a pilot test, no conclusive results were drawn and it was eventually abandoned.
CUE-4: Usability testing and expert evaluations
The most recent comparative study, CUE-4 took place as part of a workshop at the CHI conference this year in Florida. It aimed to show best practice within expert reviews and usability testing and compare the results of these two methods. By analysing the differences in the expert findings in detail, it was intended to propose changes or important caveats to the methods used and to set a benchmark against which other usability professionals can measure their skills.
Seventeen evaluators were selected following a call for participation in CUE-4. Fourteen were from the US and three from Europe including myself as the only UK participant. Of these evaluators, nine were asked to conduct a usability test and eight to conduct expert evaluations. Everyone evaluated a US hotel web site and especially the reservation system developed by iHotelier (www.ihotelier.com ) This comprised a single Flash page which showed room types available, a calendar for selecting dates, and form fields for address and payment details. These sections are interactive so one can see which rooms are available for selected dates or, conversely, which dates are available for a selected room. This system was developed to overcome some drawbacks to linear HTML systems, such as selecting dates for staying at the hotel, only to find the room unavailable.
Each team was told about the target audience (adult travellers with web access) and some key areas to be explored such as finding the cost to rent a room for a specific period, making and cancelling a reservation, and making specific requests such as no-smoking rooms. Beyond that, each team was allowed to conduct the usability test or expert evaluation using their usual methodology. Usability test teams chose their own tasks and the number of subjects, although a common reporting format and severity rating scale was used for consistency and ease of comparing the results.
Comparing the results across evaluators
- At the CHI 2003 Conference, all of the evaluators met and discussed the findings. We soon built some consensus as to the range of problems identified and fortunately these were quite useful for the end client from iHotelier who was unaware of about two thirds of the problems reported. Whatever the results of the study, they should be happy with their software being scrutinised by some of the world’s leading usability professionals. Some of the main results were:
- Approximately 800 problems were identified in total by the groups, but when they were de-duplicated there were approximately 300 different problems. Although there was a strong level of agreement in the range of problems found, there was far less agreement in prioritisation in terms of the top 5 positive and negative findings.
- In comparing the results of expert evaluations to the empirical results of the testing, there were almost no ‘false alarms’ of problems predicted by the expert evaluation that did not actually occur in any of the usability tests. This bodes well for the expert evaluation method as a discount predictor of actual problems, so long as the evaluators have sufficient expertise.
- However, there was wide variation in the methodology used, the time invested (between 6 and 68 hours for expert reviews, 18 and 200 hours for usability tests) and number of test subjects used (5 to 15). Because we were able to report a maximum of 50 problems, it was not really possible to judge whether the extra time spent was useful in discovering more problems, or where the ideal ‘point of diminishing returns’ was reached for either the testing or evaluation.
- The number of problems found varied from 20 to 50 but this was largely due to differences in the level of granularity of usability issues reported. Very often one team reported a single problem which contained 2 or 3 ‘micro’ problems reported individually by others.
- The other interesting finding was that despite the fact that we had all been given the same scale for rating problems in terms of their severity, there was considerable variation in which ones were critical, serious and minor, and indeed which ones deserved to be part of the ‘top 5 problems’ list we were asked to generate.
Questions about Usability Best Practice
The session also raised several practical issues for the wider usability community regarding methodology and best practice, such as:
- What is the most cost effective yet valid and accurate way to recruit subjects? One person conducted the tests in a local Starbuck’s coffee shop that had a wireless connection, and informally recruited customers for short 20 or 30 minute tests. Others put a great deal of effort into the recruiting ensuring that it was balanced for gender, age ranges, travelling experience etc.
- What is the form of severity / priority ratings that is most useful to the end client? Clearly even with common definitions of the severities, subjectivity of the test facilitator inevitably creeps in, especially for the expert evaluations. An informal poll of the severity scales normally used by the participants indicated a great diversity on this aspect. Some use no priority rating scale at all (because they report only important problems) whereas others break the problem down into two or three dimensions such as the likelihood of encountering the problem, potential impact, etc. Although it would be ideal if the usability field aimed for a common priority rating scheme, there are different requirements between in-house usability teams integrating with a complex bug-tracking system, and a consultancy where the client normally appreciates (and has time to read) a much simpler system.
- To what degree should we let subjects explore ‘off task’ during usability testing? Another point of difference between the teams was whether the user as allowed to explore the site on their own, following their own curiosity and loosely defined tasks, or whether they should follow tasks that have a clear goal and ideal path for finding it. Most agreed that giving the user the opportunity to explore on their own helps to unearth some interesting problems, and some degree of free browsing should be included. However the nature of the test, whether it is exploratory and diagnostic or benchmarking against previous tests, also needs to be considered.
- How much are the original 10 usability heuristic definitions referred to during expert reviews? Some expert evaluation teams referred back to the original heuristic evaluation proposal by Nielsen and Molich, and gave their finding the official ‘heuristic’ title such as ‘Flexibility and efficiency of use’, then explained the specific instance in more depth. Most others considered the actual heuristics much more loosely and did not use them as categories for reporting their results, especially as they can be somewhat alienating to readers of the reports who are not already familiar with the 10 official heuristics.
- What best characterises a quality usability report and usability testing methodology? Is it the number of solutions recommended or the percentage that can be actually acted upon? Should users be encouraged to give their own details, even their own credit card numbers (as long as they are not charged) in parts of the test site where registration or purchasing is required? This led to extensive discussion which tended to merge into a ‘tips and tricks for usability testing’
Lessons for the future
The intention is that the results of the CUE-4 will be published in full, including all seventeen test reports and supporting analysis which will be published by the session organisers Rolf Molich and Robin Jeffries of Sun Microsystems. The location has not yet been determined but a likely place will be Rolf Molich’s own web site . Statistics are still being worked out on the raw results, and these will eventually be published to show the actual degree of consensus in the findings of the usability testing and expert evaluations. We also plan to use the session as a learning source for the field, and there are plans to ‘freeze’ the hotel reservation system as it was during the evaluation so that other professionals or HCI students can perform their own evaluation and compare their results to those of the participants in CUE-4. Considering that the study has again highlighted the variability in the methods applied and the results reported, this should be of great use to the profession as a whole.
Many thanks to the CUE 4 Participants
Avram Baskin & Chauncey Wilson (Bentley College, USA), Carol Barnum (Southern Polytechnic State University, USA), Carolyn Snyder (Snyder Consulting, USA), Chip Alexander (Sun Microsystems, USA), Chris Rourke (User Vision Ltd, UK), Don Williams (Microsoft, USA), Eric Pressman (Macromedia, USA), Hannu Koskela (Datex Ohmeda, Finland), Joe Dumas (Oracle Corp., USA), Joshua Seiden (36 Partners, USA), Ron Perkins (DesignPerspectives, USA), Sharon Laskowski (NIST, USA), Steve Krug (Advanced Common Sense, USA), Susan Campbell (ZAAZ, USA), Tim Marsh (Eindhoven University of Technology, The Netherlands), Tom Tullis (Fidelity Investments, USA)
What Can you do next?
Read some more usability and accessibility articles.
Read more about usability evaluations
Find out how usability testing can improve your website
Attend one of our usability training courses and learn the tricks of the trade for yourself.
Want this article on your website?
If you liked this article, feel free to republish it on your own website. All that we ask is that you include the citation below, including links, at the end of the article.
This article was written by Chris Rourke. Chris is the Managing Director of User Vision, a usability and accessibility consultancy that helps clients gain a competitive advantage through improved ease of use.