Designing Formal Usability Studies

Last month, I presented on designing formal usability studies at the UX Pittsburgh (which is also on Twitter). By request, I’m sharing my slides and lightly adapted slide notes here.

You can Google many of these things and find out the good ways to do anything, but what’s hard when you’re starting out especially is to figure out how to think through the best practices as they apply to your own project. I’m going to present one loose framework for doing that. You should consider this as much a point-of-view piece as a how-to.

I also want to encourage you to ask questions. Just shoot up your hand and I’ll call on you once I’ve finished a thought or sequence.

Why am I here?

That is: Why am I doing a formal usability study? For most people doing a formal study, the question amounts to, why not guerrilla testing? We in this room probably all understand the value of some kind of usability testing.

It’s a good question. There’s too much money on the table not to ask about it. I should say I come at this largely from a consulting perspective, but I think the considerations are largely the same for in-house UXers. Also, “formal” vs. “guerrilla” is in many ways a spectrum, so much of what follows will just help you figure out where on that spectrum your next round of testing lies.

Good Reasons

This is the best reason: The project really needs it, for some combination of reasons like these:

Less bias…
- …especially when the tester isn’t on the product team
- …especially when the tester is a practiced facilitator
Qualitative rigor: a thorough analysis process, a comprehensive report with recommendations and theoretical/“best-practical” underpinnings.
- Some useful quantitative measures possible with more participants all running through the same study
More direct observers, like…
- designers
- business owners
- engineers

These things typically come into play on really big projects and in a shorter-term consulting relationship, where the usability researcher isn’t likely to be paid to stick around through the remainder of the design and development process.

Other Reasons

Clients & bosses: They sometimes mistakenly think they need formal testing, and they won’t take no for an answer.

Stakeholders: Sometimes they don’t understand qualitative research and the value it brings, so they demand a quantitative component that just isn’t worth trying to shoehorn into a leaner guerrilla process.

Consultants: There’s more money in bigger projects. That’s enough reason for some people to push for formal testing. It’s usually self-deception rather than evil.

Where am I going?

As you make your planning decisions, you ought to have a very strong sense of direction, as indicated by a few things.

(This is that framework I mentioned.)

The five factors: Goal of the study, broader project process, artifact fidelity, budget, and timeline.

Let’s call these the five factors of study design, and let’s nail them down before we start planning.

(1) Goal: Why are we conducting the study? Is it to prove there’s a UX problem? To validate a design solution? To align the team?

(2) Broader process: Are we part of a long, waterfall design project? Or are we doing standalone usability testing, akin to an annual physical?

(3) Artifact fidelity: Are we testing a live website, a set of wireframes, or something in-between? (Don’t formally test low-low-fidelity designs. It’s not worth it.)

(4) Money and (5) timeline: How much of each do we have for things like recruiting, testing, data processing, analysis, and reporting?

We’ll come back to these several times, which is why I’m showing you these horrible emoji.

How do I get there?

Now you know why you’re doing a formal usability test, whether you feel good about those reasons, and where the project needs to go.

In other words, the easy part is over.

Making your plan: Basic study configuration, location and tools, participants, task design, and artifact prep

Time to make the plan. Here’s what we need to think through.

Basic study configuration matrix: moderated or unmoderated against in-person or remote

Note: “Remote” here means “using an online platform like usertesting.com” or “sitting in the next room watching through a glass or CCTV.”

In-person, moderated
- Classic (in part because the technology wasn’t there for the others when the method was being developed)
- Gives you great insight into not just task completion but physicality and demeanor.
- Lets you probe (with care) into behaviors and desires.
- Relies tremendously on the skill of the facilitator.
Remote, moderated
- Saves costs on travel, space, or both.
- Lose some—but not all—of the benefits of in-person (moderated) testing.
- Mostly, a little harder to “read” a stranger from a distance.
- But, gain some context—what’s the user’s computing environment like?
- Also relies tremendously on the skill of the facilitator.
In-person, unmoderated
- That would just be creepy, to sit there ignoring them like that?
Remote, unmoderated
- Difficult to position this as truly formal usability testing unless your tasks are very well-organized and straightforward, and you have a platform capable of tracking task completion at a fairly granular level of detail.
- Can be very valuable for those sorts of tasks, however.

Don’t forget the five factors. Each one should shape how you make this decision. For example:

If your study’s objective includes testing emotive responses to a product, you should avoid unmoderated testing, because getting deep into the subjectivities of a session usually takes more active probing by the facilitator.
If for some reason you have a week to get your test done, remote unmoderated testing can be a lifesaver.
If you know you can’t do any more user testing before launch, in-person, moderated testing might be best, as it often yields a more comprehensive results set (again, depending on what kinds of things you’re hoping to test).

Location and tools by quadrant of the matrix, as described in the following text.

This is a non-comprehensive list. There’s a *lot* out there especially in terms of tools and the list grows quickly; you’ll have some research to do when you get to planning this stage.

In-person, moderated
- Morae: Heavy-duty, expensive, feature-rich Windows software
- Silverback: Great, cheap Mac software with a history of steady improvement
- Neutral space: Avoid having them see the product company’s logo in the environment
- Inviting space: Be cognizant of accessibility, perceived safety, physical comforts. Also: Men, try not to be there alone with a woman. Find a woman to join you even if she just does unrelated work all day.
- In-home: Great to better understand context (and save money); hard sell for strictly formal studies however.
Remote, moderated
- A lot here. Get creative and test the Dickens out of both your solution and your instructions.
In-person, unmoderated
- (Again, don’t.)
Remote, unmoderated
- I’ve only used usertesting.com, and it’s great for this kind of study. Attendees tonight may be interested in trying Loop11. Nielsen Norman Group has a nice run down from this summer that you could read.

Participants: How many, what kinds, compensation, and recruiting

How many?:

Goal: How credible do you need your quantitative findings to be? (They will not be statistically significant under most circumstances.) Do you have skeptics to convince who don’t understand the value or purpose of discount usability engineering? (Google it yourself, lazy.)
Process: How many more test cycles will be run before the process is complete?
Fidelity: Is your artifact complete/complex enough that you stand to learn more beyond the first five or seven users?
Budget and timeline: How many users can you pay? How many hours can you spend? How many weeks?

What kinds?:

Goal: Are you testing things that require domain expertise? Do you need to cover certain demographics to make your business case more compelling? Do you have personas that you’ve mapped to a specific product / feature set / task set? How important is a diverse participant set to your project? (And, diverse *how,* exactly?)
- Whatever it is you want to test, your study may not be well-served by people in (or near) the industry. You’ll have to decide exactly what that means based on the nature of the product and project.
Process: Again, how many more test cycles will be run before the project is complete?
Fidelity: I can’t think of a case where fidelity of the artifact should influence whom you recruit.
Budget and timeline: See “How much to pay them?”—and also, if you’re short on time, you won’t likely be able to recruit 18 employees of small startups in Pittsburgh making over $85,000 per year and who prefer decaf.

How much to pay them?:

The going rate changes. I’ve paid between $50 and $100 recently. You may have to pay a bit more to get people of higher socioeconomic status, but you should pay all participants in a given study the same amount. Value their time equally, even if they don’t.

How to find them?:

Carpet-bomb your friends and family (though you shouldn’t conduct any sessions with people you know for a formal study), any professional contacts (again, likely from outside the industry), and your social-network connections.
Or, for an even more formal approach: Use a recruiter. Plan to spend between $75 and $150 per participant (as of late 2014), depending on the complexity of your participant set and whether you want just a preliminary recruiting effort from their database or end-to-end recruiting and scheduling.

Task design graph: intensity against complexity

Your overall approach should account for the five factors first:

Study goal: all tasks in support of your objectives; your most critical objectives prioritized.
Broader process: a suite of tasks that neither exceeds the moment nor fails to make use of it.
Artifact fidelity: tasks that the artifacts can support.
Money and timeline: tasks that lead to a dataset compatible with your resources for analysis.

But also consider also this chart, showing a hyperbolic view of how participants tend to experience tasks with great emotional or psychological intensity and great procedural complexity (or ambiguity). The moral: You just won’t get good results if you go full-sadist on your participants. You have to keep the overall test experience at least somewhat pleasant or else you can create a falsely negative impression of the product.

More on task design: clarity, verisimilitude, utility

Clarity: They’ve got to understand what they’re supposed to do—or answer. So don’t be vague (“What do you think of this page?” or “Try to figure out what you’re supposed to do on this site.”) Be clear (“Do you see anything that you don’t understand?” or “This site helps you find facial-hair inspiration, and it works best if it knows what kind of facial hair you’ve had in the past. Let’s try to figure out how to upload pictures of your own facial hair.”)

Verisimilitude: Sometimes, the task you’re testing is simply huge—and sometimes, it can’t be broken up into discrete tasks that a user might perform across several sessions. (You might see about changing that, but sometimes you can’t.) So sometimes, you’re just going to have a really long, painful task. But in most cases, aim for something that will reflect what you anticipate real-world task-completion habits to be. Unless your study aims to demonstrate how bad the software is—as many do, in fairness—you don’t want to hear, “That took way too long” over and over when, in real-world use, the task wouldn’t be an all-or-nothing proposition.

Utility: What will running participants through the task really tell you? It’s too easy to waste your time (and your participant’s) chasing data a minor feature you don’t like or a font choice you fought your team on. Any of those things that are problems will reveal themselves anyway, especially with a good facilitator, who will continually encourage thinking aloud and who will notice small issues and probe accordingly if the participant doesn’t speak to them. (Going after your grudges is also a good way to bias the data set, especially if you’re both designing and facilitating the study.) Instead, every task should help you answer a question you need to have an answer to, whether that’s, “Will people enjoy using a site like this?“ or “Will they successfully upload facial-hair pictures?”

Artifact prep: be lazy early, work hard late, and remember your paperwork

Be lazy early: Sometimes, your choice of artifact—whiteboard sketches, paper prototypes, wireframe PDFs, detailed designs—will be determined by the moment in the broader process. To maximize the ROI of the testing, minimize the “I” by keeping fidelity as low as it can be while preserving your ability to test the specific qualities and quantities you’re setting out to test.
Consider not only the present study, but the fact that you’ll have to revise your artifacts (or possibly advance them to a higher-fidelity deliverable) as a result of the test. Using the least-complex possible artifacts for your study will keep overhead to a minimum.

Work hard late: Once you’ve chosen your artifact and prepped your testing flow, however, bust your ass to make sure it all works. Don’t let your first participant be the first or even the third person to run through your test. Catch all the bugs / inconsistencies / flaws you can. Many of these will not be flaws in your proposed UI, but flaws in your artifacts or your task design. (“Oh, right, I forgot to replace the greeking in that callout.” or “Oh, right, that question is prohibitively unclear.”) These issues have a way of getting magnified in the actual study and noisily clouding out more important results. Catch the low-hanging fruit on your (plural) own and protect your study’s ROI.

In fact, if you have time, first secure yourself an expert review or run a heuristic evaluation; see my own “Beyond Usability Testing”—and don’t miss some important clarifications in the comments.

Paperwork:

discussion guide
quantitative sheets (e.g., SUS)
consent form / anonymity and privacy statement
receipt for compensation

What does my future hold?

This stuff may not be part of study design per se, but it’s worth touching on, because you can easily render your study worthless or, more often, detrimental to the ultimate product without thinking through it.

After-testing activities: analysis, reporting, more testing

Analysis:

The goal is to let the data speak as directly as possible, with as little interpretation as possible from any particular subjectivity. Rigorous qualitative data analysis with several reviewers is time-consuming but it goes a long way towards removing bias. It’s very easy to think you know what your data are telling you just by having run some or all of the test sessions, but you really can’t. Outside reviewers (not involved with product design) are even better, and are often worth paying for even as an independent consultant.
What data to analyze? Could be videos, transcripts, or notes. The closer to the beginning of that list, the more time it will take—but the more objective and comprehensive the results, potentially.

Reporting:

You could…
- …report out the top five issues in a single slide
- …write a 150-page report with charts, screenshots, links to video clips, participant quotations in callouts, opinionated footnotes, and Dilbert comics used by permission
- …do anything in-between, or some combination
Generally speaking, level of investment in the study will help determine what your clients, bosses, or stakeholders expect in terms of final deliverables, but you should obviously be as clear as possible about that with them up-front.
- Also, one level or another may be justified by other of the five factors. For example:
  - As a consultant looking for repeat business, you may want to do as much work here as you can reasonably do in order to prove your value (without setting up unreasonable expectations for the future).
  - You may be testing such informal artifacts that a long report would be a waste of time (as compared to moving on to the next round of artifacts).
  - You may know that this is the only user testing that a product will see for some years, and you may want to be sure your stakeholders can use your report to build-out a medium- to long-term roadmap of improvements.

More testing:

Depending on where you are in the process, fix the issues you found and test again. It’s rare that it’s worth conducting two formal usability studies back-to-back, but it’s equally rare that a formal study is the best last step in a product design, redesign, or optimization process. So you’ll probably look to guerrilla testing or other discount methods next (depending on your reasons for conducting a formal study in the first place). Go get ‘em!

WTF??!