2 Evaluation

Intro

This lecture describes the different dimensions how a visualisation product can add value and how we assess it against its value proposition.

2.1 Topics

First, I am going to to introduce the value proposition of visualisation products. Then we address how evaluation fits into the development process.

Munzner’s nested model gives us a way to think about what can go wrong at different levels of a visualisation design, and what kind of evidence is needed at each level. And third, very practically — what methods exist, how do they differ, and which metrics we use to quantitatively assess usability.

2.2 The Value Proposition

Here we adapt the value proposition canvas introduced by Alexander Osterwalder to visualisation products. It distinguishes between a value map and a user profile.

We have the user profile on the right side. It describes what users aim to do and what they hope to achieve. User tasks include understanding data and relationships, interpreting findings, exploring datasets, monitoring developments, and communicating insights. These tasks appear across many domains. User gains describe the outcomes of these activities. These include clear and shared understanding of data, guided sense-making, insight generation, improved decision-making, and engagement and trust in data. Pains refer to obstacles such as cognitive overload, misinterpretation, lack of transparency or trust, difficulty navigating complex datasets, and limited accessibility.

On the value map side, visualization product refers to the deployed solution that makes data visually accessible to its intended users. Such products take shape through a series of design decisions, including the choice of visualization format (e.g., infographics, data stories, dashboards, or interactive tools), as well as visual encodings and chart types, text and annotations, interaction and dynamic elements, composition and layout, and aesthetic design. Visualization products create value through three gain creators: explanatory clarity, narrative structuring, and analytical support. Explanatory clarity helps users grasp information quickly and accurately. Narrative structuring guides interpretation and organizes information into meaningful sequences. Analytical support enables users to monitor and explore data, generate insights, and support decisions. Pain relievers describe how visualization addresses common difficulties. These include reducing cognitive load, lowering the risk of misinterpretation, improving accessibility, increasing transparency, and supporting navigation.

The alignment of the value map to the user profile forms the fit, defined here as effective visual communication. Fit describes the extent to which the visualization product supports users in accomplishing their tasks, reducing their pains, and achieving desired outcomes (gains).

The value proposition canvas gives us a target — fit between the product and the user profile. But building a visualisation product is an iterative process….

2.3 Visualisation Product Development

… The diagram on this slide adapts the data product development process to visualisation products. It starts with project understanding, where we establish the value proposition: who are the users, what are their tasks and pains, and what does the product need to achieve? From there we move into data acquisition and exploration - making the data available for design and ensuring the quality and quantity supports the intended visualisation.

Interface design corresponds to the modelling phase in the data product development. It consists of selecting appropriate visualization genres, visual encodings, chart types, text, annotations, interaction, and composition to support the gain creators and pain relievers identified in the value proposition canvas.

It typically produces a prototype which can be evaluated.

Evaluation assesses whether the current design achieves the desired fit – effective visual communication – between the value map and the user profile. Ideally, this involves testing with real users because a visualisation product can look polished and test well with its designers, but fail completely when real users try to extract information from it.
Wvaluation is the systematic attempt to measure the gap between design intent and user experience before it fails in production.

A negative evaluation triggers iteration through earlier design phases; a positive evaluation transitions the work into deployment. Deployment transforms the evaluated prototype into an accessible product by embedding it in technical infrastructure, organizational processes, and user communities. To manage the inherent complexity of digital products, software engineering practices from DevOps give us a useful framework here: planning, implementation, delivery, and operations.

Here, visualization products come with their own distinctive challenges.

For example, visualizations depend on precise perceptual properties – color, resolution, aspect ratios. A color mapping that works perfectly on a calibrated desktop monitor might fail completely on a mobile screen or under different lighting. So cross-platform testing is essential.

The second challenge is visualization literacy. Users don’t just need to operate the interface – they need to understand the visual grammar behind it. Because literacy varies widely across audiences, deployment must account for different levels of competence, and this connects directly to accessibility and inclusive design: encoding choices, interaction complexity and alternative representations determine who can actually engage with the product meaningfully.

And visualization products are interventionist. A successful dashboard does not just display information – it changes how people make decisions. Therefore, organizations need to anticipate change management, because workflows and decision processes may need to adapt.

In the operational phase monitoring and interaction data reveal which features deliver value and expose usage patterns. it reveals what works, what does not, and feeds back into future design and deployment iterations.

Then, there is a distinction between the point in the development process evaluation happens…

2.4 Formative vs. Summative Evaluation

Formative evaluation happens during the development process. The goal is discovery: finding problems while they are still cheap to fix. The methods are qualitative — think-aloud sessions, expert walkthroughs — and the output is actionable redesign recommendations. Sample sizes are small: Work by Jakob Nielsen showed that five users from a homogeneous group reveal approximately 85% of usability problems. Testing with a rough prototype is perfectly valid; the earlier you catch a problem, the less expensive it is.

Summative evaluation happens at or near the end of development. Here the goal shifts from discovery to measurement: does the product meet a defined quality level? We use quantitative metrics like task completion rates, time on task, System Usability Scale scores and larger samples to produce statistically reliable numbers. Summative evaluation can also compare two versions or products against each other.

So: formative is for improving, summative is for summing up.

2.5 ISO 9241

Before we look at specific methods and metrics, we need a shared definition of what we are actually trying to measure. There is actually an ISO standard that defines usability as: the extent to which a product can be used by specified users to achieve specified goals with effectiveness, efficiency, and satisfaction in a specified context of use.

Effectiveness asks: can users actually achieve their goal? We measure this with task completion rate.

Efficiency asks: at what cost in time and effort? We measure this with time on task, number of clicks, or error rate.

Satisfaction asks: how do users subjectively experience the product? Which can be approached by standardised questionnaires.

2.6 The Evaluation Problem

A visualisation product can fail at very different levels, and the failure mode determines the evaluation strategy you need.

Consider four examples. A dashboard might be technically perfect — correct encodings, clean layout, fast rendering — but it answers the wrong business question entirely. The chart type might be appropriate for the data, but wrong for the task the user actually needs to perform. A visualisation might be statistically rigorous but unreadable by its intended audience. Or an interaction — a filter, a tooltip, a drill-down — might exist but never be discovered by users.

Each of these is a different kind of failure, located at a different level of the design. A usability test will catch the third and fourth. It will not catch the first or second — because those are not interface problems, they are problems of problem definition and abstraction.

2.7 Munzner’s Nested Model

A visualisation product can fail at different levels. Consider four examples. A dashboard might be technically perfect — correct encodings, clean layout, fast rendering — but it answers the wrong business question entirely. The chart type might be appropriate for the data, but wrong for the task the user actually needs to perform. A visualisation might be statistically rigorous but unreadable by its intended audience. Or an interaction — a filter, a tooltip, a drill-down — might exist but never be discovered by users.

Each of these is a different kind of failure, located at a different level of the product

Tamara Munzner’s nested model describes four levels, each building on the one above it.

The outermost level is the domain situation: what overall goal to the users want to achieve? The second level is data and task abstraction - translating the domain problem into the concrete tasks that the users need to perform with the data. The third level is the visual encoding and interaction level, which determines the specific visualisation genre, chart type, interaction technique, etc. And the fourth and innermost level is the algorithm which performs the computations that deliver the visualisation product.

Outer levels constrain inner levels. A perfect encoding at Level 3 cannot fix a wrong domain model at Level 1. If you have misunderstood who the users are or what question they need to answer, every subsequent design decision is built on a flawed foundation. Munzner calls these downstream errors — mistakes at an outer level that corrupt all inner levels.

2.8 Threats and Validation per Level

Each level has its own characteristic failure mode — what Munzner calls a threat — and its own appropriate validation approach.

At Level 1, the threat is that we are developing for the wrong users or the wrong context. The validation can involve interviews, and contextual research. At Level 2, the threat is a wrong abstraction — the user profile is correct but the data is unsuitable or the user profile is mapped to the wrong abstract tasks. Validation here involves analysis of the tasks users actually perform. At Level 3, the threat is an ineffective encoding — the right abstraction but the wrong visualisation for the task. This is what controlled lab studies and perception experiments address. And at Level 4, the threat is simply that the algorithm is too slow or produces incorrect output — validated through complexity analysis and benchmarking.

The key practical implication: a usability test, which operates at Level 3, does not necesserily inform whether Level 1 decisions were correct. Different evidence for different levels is needed.

2.9 Munzner ↔︎ Product Framework

The nested model maps directly onto the product development framework we started with. The domain situation — Level 1 — corresponds to the project understanding phase: the user profile in the value proposition canvas. Getting this right is precisely what field research, interviews, and stakeholder workshops are for.

Task abstraction is in our model also part of the project understanding

The data abstraction maps to the data acquisition and exploration. Do we have the right data for the right tasks?

The visual encoding and interaction — Level 3 — is the interface design phase proper: chart type, layout, interaction. And the algorithm — Level 4 — corresponds to deployment: performance, correctness, scalability.

This mapping is useful because it can help structuring and planning iterative improvements effectively in the design cycle.

2.10 Expert Evaluation

The first evaluation method we look at is expert evaluation — specifically heuristic evaluation. A small group of trained usability specialists, typically three to five, independently evaluate a product against a set of established guidelines. The most widely used set is Nielsen’s ten usability heuristics, which cover principles such as visibility of system status, consistency, error prevention, and recognition rather than recall.

Expert evaluation is fast and inexpensive — it requires no participant recruitment, no lab setup, no scheduling. Its limitation is that it operates primarily at Level 3: it catches interface-level problems well, but is not reliable for Level 1 or Level 2 issues. Experts tend to find different problems than real users, and vice versa.

Nielsen’s rule — that five users reveal approximately 85% of usability problems — actually applies specifically to this formative, qualitative context with homogeneous user groups. It is not a universal law, but a practical heuristic for resource allocation.

2.11 Focus Group

A focus group is a moderated group discussion — typically six to ten participants — designed to bring up opinions, expectations, attitudes, and mental models. It is a research technique, not an evaluation technique in the strict sense.

Focus groups are useful for understanding the terminology users use in a particular domain, exploring attitudes toward an existing product before redesigning it, or generating hypotheses before building a prototype.

But there is a critical limitation: a focus group reveals what users say they would do — not what they actually do. The gap between stated preferences and observed behaviour can be substantial. Products that receive positive reviews in focus groups can fail when real users attempt to use them for real tasks. This is why the focus group is categorically distinct from the usability test — which measures behaviour, not opinion.

Usability testing encompasses different formats…

2.12 User Testing

that differ in required ressources and ability to take into account the influences of the real-world user environment.

The hallway test — Flurtest in German — is informal and opportunistic: you stop a colleague or a passerby and ask them to try a task. The participants are not from the target group, the protocol is minimal, and the findings are not representative. But it is essentially free and can catch major design errors very early.

Remote moderated testing is the video-call equivalent of a lab test. A moderator observes and prompts via screen share. Real-world relevance is slightly higher because the participant is in their own environment. Remote unmoderated testing uses a platform — UserTesting, Lookback, Maze — where participants work independently, with no moderator present. This scales better but loses the ability to probe unexpected behaviours.

The controlled lab test provides the most richly observable environment: recording of screen and the user performing the tasks, and the option to add eye-tracking. It is the gold standard for understanding not only what users do, but why.

2.13 Usability Lab Tests

The classic usability lab setup separates the participant from the observers using a one-way mirror. The participant works in the test room, where they interact with the product and think aloud — narrating their reasoning as they work. The moderator and observers sit in the observation room, watching without being visible.

This setup allows naturalistic behaviour: users are less likely to perform for the observers if they cannot see them. And it allows the full team — designers, product managers, developers — to observe without disrupting the session.

** In practice, watching a real user encounter a problem you thought was solved is one of the most effective ways to build shared understanding of what needs to change.*

2.14 Eye-Tracking

Eye-tracking is a specialised measurement instrument that records gaze position and fixation patterns while a participant interacts with a visualisation. It provides a direct window into visual attention — we can see which elements the eye visits first, how long it dwells on each element, and which areas are systematically ignored.

For visualisation products, this is particularly informative. Does the eye go to the most important element first — or does the legend draw attention away from the data? Are annotations actually read, or skipped? Where does the user get stuck before finding an interaction affordance?

Eye-tracking data is typically visualised as heatmaps, which aggregate fixation density across participants, or as scanpaths, which show the sequence of individual fixations. Both the simulated heatmap here and the Wikipedia example illustrate a common finding: attention concentrates heavily on the top of a view, and diminishes rapidly for elements further down or in peripheral positions.

2.15 Graphical Perception Studies

Graphical perception studies measure how accurately users extract quantitative information from different visual encodings. This field was pioneered by William Cleveland and Robert McGill in 1984, who asked participants to estimate values from charts using different visual channels.

They found that position along a common scale — the x-axis of a bar chart, for example — is read most accurately. Position on a non-aligned scale comes second. Then length, then angle and slope, then area. Colour saturation and density are least accurate for quantitative estimation.

This hierarchy is one of the strongest empirical arguments for preferring bar charts over pie charts when the task is to compare quantities: the bar chart uses position and length; the pie chart uses angle and area.

2.16 When to Use What

This slide summarises the methods. The choice of evaluation method follows from the question we are trying to answer.

Focus groups answer: what do users think or expect?

Expert review answers: does this design violate established principles?

The usability test answers: can real users achieve their goals and perform their tasks with the product?

Graphical perception studies answer how accurately can users extract a specific value or comparison from a specific chart type?

And field studies answer: how is this product actually used in its real operational context?

Notice that these methods target different levels of Munzner’s model. Field studies address Level 1. Expert review and usability tests address Level 3. Perception studies address the encoding layer within Level 3 with particular precision. Knowing which level your current uncertainty is at determines which method is appropriate.

2.17 Metrics

Metrics are used to quantify evaluation observations so they can be tracked, compared, and reported. They can be organised into three groups (**each corresponding to a different ISO 9241 dimension).

Task performance metrics measure effectiveness and efficiency: task completion rate measures whether users succeed at all; time on task and error rate measure the cost of that success. Path deviation measures how far users divert from the ideal interaction sequence which can indicate navigational confusion.

Perceptual accuracy metrics measure not whether users can operate the interface, but whether they can correctly read the information it encodes. Value estimation error, rank-order correctness, and pattern detection accuracy.

Subjective metrics capture the satisfaction dimension. The System Usability Scale — SUS — is a ten-item Likert questionnaire that yields a score from zero to one hundred. It is fast, validated across many product types, and language-agnostic.
The User Experience Questionnaire (UEQ) adds pleasure-oriented to the goal-oriented dimensions: if the product is exciting, motivating, and enjoyable to use. And how innovative and creative the product is, and if it captures the user’s attention.

And we have the general post-task-ratings and Net-Promoter Score

2.18 SUS Score Scale

The SUS scale deserves a closer look. Scores below 51 are considered unacceptable — users find the product fundamentally difficult. Scores in the 51 to 68 range are marginal. Above 68 is considered acceptable, which is also approximately the industry average across many product types. Above 80 the product is rated good to excellent — this is the target range for a well-designed data product.

The SUS is particularly practical because it is robust to small sample sizes — it produces stable scores even with the 5 to 8 participants typical of formative evaluation. It also serves as a useful conversation tool: a SUS score of 62 on a dashboard gives designers, product managers, and stakeholders a shared, concrete reference point for where the product stands and what improvement looks like.

2.19 Summary

To summarise: evaluation informs design and deployment in a continuous feedback mechanism at multiple levels . Different levels require different evidence. Different phases of development call for different methods. And all efforts serve the same purpose: to test whether the visualisation product achieves the fit between value map and user profile.