Research Design · Cross-cultural HCI

Cross-cultural User Evaluation of Epistemic Closure in AI Responses

A research design examining whether Chinese and American college students prefer AI responses that preserve ambiguity or push toward interpretive closure when discussing morally ambiguous literary material.

This project treats answer style as a design variable rather than a neutral byproduct of language-model behavior. It asks whether premature closure is always desirable, and whether users from different cultural contexts evaluate ambiguity-preserving responses differently.

Role

Solo researcher: study design, prompting, literature review, pilot testing, paper writing

Timeline

Winter 2026

Tools

Qwen 3, Ariel-derived prompt constraints, survey design, SIGCHI paper template

Status

Completed course research design

Overview
Questions
Background
Design
Materials
Measures
Pilot
Reflection

Project Overview

Large language models often present interpretive conclusions with more confidence and finality than the source material justifies. This research design focuses on that tendency as a problem of epistemic closure: the conversion of moral or literary ambiguity into answers that feel complete, stable, and settled.

The project compares how Chinese and American college students evaluate three kinds of AI responses to morally ambiguous literary prompts: a baseline response, a lightly constrained response, and a heavily constrained response. The goal is not to prove that ambiguity is always better, but to test whether closure itself is culturally preferred, and whether ambiguity-preserving responses can sometimes be perceived as more appropriate.

The central question is whether answer quality should always mean resolution — or whether, in some contexts, the better answer is the one that leaves interpretive space intact.

Research Questions and Hypotheses

Research Questions

How do users evaluate AI-generated responses that preserve moral or interpretive ambiguity, compared to responses that offer more definitive interpretations?
Are there differences between Chinese and American college students in their preferences for ambiguity-preserving versus resolving AI responses?
Does familiarity with the source material influence how users perceive the appropriateness or usefulness of ambiguity-preserving interpretations?

Hypotheses

Users across cultures will be more likely to prefer resolving interpretations.
Chinese students will be more likely to prefer ambiguous responses.
Those familiar with the source material will be more likely to prefer ambiguous answers.

Background

The project sits at the intersection of cross-cultural HCI, LLM evaluation, and research on ambiguity tolerance. It draws on several strands of literature: work on need for closure and trust, work on cross-cultural differences in desired AI interaction styles, and work on sycophancy and epistemic alignment in language models.

Hofstede’s UAI and IDV dimensions are used as background interpretive frameworks rather than primary predictive engines, while work by Nisbett and others provides a more plausible cognitive account of why interpretive tolerance may vary across contexts. The result is a paper that treats culture cautiously, but not superficially.

Why this framing mattered

It shifts the question from factual correctness to answer style.
It treats ambiguity as a possible design choice rather than a system failure.
It connects Project Ariel’s anti-closure prompting logic to a comparative user-study design.

Study Design

The design uses two morally ambiguous literary prompts and three response conditions generated with Qwen 3. One response is a baseline prompt-only output, one is lightly constrained, and one is heavily constrained using an Ariel-derived anti-closure framework. The lightly constrained condition is shared between participant groups, allowing comparison without forcing every participant to read all three responses.

Condition	Description	Role in Study
Baseline	Generated from the prompt alone, with no additional Ariel filtering.	Reference condition
Light Constraint	Minimal ambiguity-preserving constraint intended to reduce premature interpretive closure.	Shared comparison condition
Heavy Constraint	Stricter anti-closure constraint intended to maximize interpretive openness.	Upper bound ambiguity condition

The study uses a between-subjects design. One group reads the baseline and lightly constrained responses; the other reads the lightly constrained and heavily constrained responses. This reduces repetition and cognitive fatigue while still permitting between-group comparison through the shared middle condition.

Texts and Materials

Selected literary prompts

Yu Hua, To Live: a man loses his family estate through gambling, and later reflects that this ruin accidentally saved his life during revolutionary violence.
Dostoevsky, The Brothers Karamazov: the Grand Inquisitor episode, centered on freedom, authority, bread, mystery, and silence.

These texts were chosen because they are morally serious, interpretively open, and culturally asymmetrical in plausible familiarity. The prompts abstract away direct citation of the source so that responses are driven by the scenario, not by the model’s overt literary recognition.

The goal was not to test literary knowledge, but to test what kinds of answers an AI system produces when the underlying material resists easy synthesis.

Measures

Participants evaluate each response across several dimensions:

Appropriateness — whether the response fits the question
Usefulness — whether the response is helpful
Perceived accuracy — whether the response seems correct or reliable
Preference — which of the two responses shown is preferred, and how strongly
Familiarity — self-reported familiarity with each source text
Trust — both general trust in AI responses and trust in specific answers
Demographics — including cultural background

Pilot and Analysis Plan

A pilot study helped refine the structure of the evaluation, especially the decision not to show all three responses to every participant. That pilot suggested that reading all conditions for both texts would create fatigue and repetitiveness, especially given the length and density of the responses.

The proposed analysis uses a mixed-design ANOVA with constraint pairing and cultural group as between-subjects factors and text as a within-subjects factor. Key comparisons are then checked with a more assumption-light analysis to ensure the results are robust. Any group differences are interpreted using Hofstede’s UAI/IDV as background frameworks rather than as strict predictive rules.

Reflection

This project began as a course assignment but became useful as a bridge between Project Ariel and a more conventional HCI research design. What made it valuable was not just the topic, but the pressure it placed on several open questions at once: how to evaluate answer style, how to compare cultural preferences without flattening them into stereotypes, and how to translate a private design probe into a public research instrument.

The portfolio version is also a reminder that some of the most interesting AI design questions are not about whether a model can produce an answer, but about whether it knows when resolution is the wrong move.

What I would improve next

Tighten the measures, especially the distinction between trust and perceived accuracy.
Test the design with multiple base models rather than a single selected model.
Refine the public-facing writeup to make the cultural-theory claims more conservative than the deadline version of the class paper.
Turn the instrument into a cleaner downloadable appendix or companion PDF.

Project Files

Download Paper Back to Projects