Cross-cultural User Evaluation of Epistemic Closure in AI Responses
A research design examining whether Chinese and American college students prefer AI responses that preserve ambiguity or push toward interpretive closure when discussing morally ambiguous literary material.
This project treats answer style as a design variable rather than a neutral byproduct of language-model behavior. It asks whether premature closure is always desirable, and whether users from different cultural contexts evaluate ambiguity-preserving responses differently.
Project Overview
Large language models often present interpretive conclusions with more confidence and finality than the source material justifies. This research design focuses on that tendency as a problem of epistemic closure: the conversion of moral or literary ambiguity into answers that feel complete, stable, and settled.
The project compares how Chinese and American college students evaluate three kinds of AI responses to morally ambiguous literary prompts: a baseline response, a lightly constrained response, and a heavily constrained response. The goal is not to prove that ambiguity is always better, but to test whether closure itself is culturally preferred, and whether ambiguity-preserving responses can sometimes be perceived as more appropriate.
Research Questions and Hypotheses
Research Questions
- How do users evaluate AI-generated responses that preserve moral or interpretive ambiguity, compared to responses that offer more definitive interpretations?
- Are there differences between Chinese and American college students in their preferences for ambiguity-preserving versus resolving AI responses?
- Does familiarity with the source material influence how users perceive the appropriateness or usefulness of ambiguity-preserving interpretations?
Hypotheses
- Users across cultures will be more likely to prefer resolving interpretations.
- Chinese students will be more likely to prefer ambiguous responses.
- Those familiar with the source material will be more likely to prefer ambiguous answers.
Background
The project sits at the intersection of cross-cultural HCI, LLM evaluation, and research on ambiguity tolerance. It draws on several strands of literature: work on need for closure and trust, work on cross-cultural differences in desired AI interaction styles, and work on sycophancy and epistemic alignment in language models.
Hofstede’s UAI and IDV dimensions are used as background interpretive frameworks rather than primary predictive engines, while work by Nisbett and others provides a more plausible cognitive account of why interpretive tolerance may vary across contexts. The result is a paper that treats culture cautiously, but not superficially.
Why this framing mattered
- It shifts the question from factual correctness to answer style.
- It treats ambiguity as a possible design choice rather than a system failure.
- It connects Project Ariel’s anti-closure prompting logic to a comparative user-study design.
Study Design
The design uses two morally ambiguous literary prompts and three response conditions generated with Qwen 3. One response is a baseline prompt-only output, one is lightly constrained, and one is heavily constrained using an Ariel-derived anti-closure framework. The lightly constrained condition is shared between participant groups, allowing comparison without forcing every participant to read all three responses.
| Condition | Description | Role in Study |
|---|---|---|
| Baseline | Generated from the prompt alone, with no additional Ariel filtering. | Reference condition |
| Light Constraint | Minimal ambiguity-preserving constraint intended to reduce premature interpretive closure. | Shared comparison condition |
| Heavy Constraint | Stricter anti-closure constraint intended to maximize interpretive openness. | Upper bound ambiguity condition |
The study uses a between-subjects design. One group reads the baseline and lightly constrained responses; the other reads the lightly constrained and heavily constrained responses. This reduces repetition and cognitive fatigue while still permitting between-group comparison through the shared middle condition.
Texts and Materials
Selected literary prompts
- Yu Hua, To Live: a man loses his family estate through gambling, and later reflects that this ruin accidentally saved his life during revolutionary violence.
- Dostoevsky, The Brothers Karamazov: the Grand Inquisitor episode, centered on freedom, authority, bread, mystery, and silence.
These texts were chosen because they are morally serious, interpretively open, and culturally asymmetrical in plausible familiarity. The prompts abstract away direct citation of the source so that responses are driven by the scenario, not by the model’s overt literary recognition.
Measures
Participants evaluate each response across several dimensions:
- Appropriateness — whether the response fits the question
- Usefulness — whether the response is helpful
- Perceived accuracy — whether the response seems correct or reliable
- Preference — which of the two responses shown is preferred, and how strongly
- Familiarity — self-reported familiarity with each source text
- Trust — both general trust in AI responses and trust in specific answers
- Demographics — including cultural background
Pilot and Analysis Plan
A pilot study helped refine the structure of the evaluation, especially the decision not to show all three responses to every participant. That pilot suggested that reading all conditions for both texts would create fatigue and repetitiveness, especially given the length and density of the responses.
The proposed analysis uses a mixed-design ANOVA with constraint pairing and cultural group as between-subjects factors and text as a within-subjects factor. Key comparisons are then checked with a more assumption-light analysis to ensure the results are robust. Any group differences are interpreted using Hofstede’s UAI/IDV as background frameworks rather than as strict predictive rules.
Reflection
This project began as a course assignment but became useful as a bridge between Project Ariel and a more conventional HCI research design. What made it valuable was not just the topic, but the pressure it placed on several open questions at once: how to evaluate answer style, how to compare cultural preferences without flattening them into stereotypes, and how to translate a private design probe into a public research instrument.
The portfolio version is also a reminder that some of the most interesting AI design questions are not about whether a model can produce an answer, but about whether it knows when resolution is the wrong move.
What I would improve next
- Tighten the measures, especially the distinction between trust and perceived accuracy.
- Test the design with multiple base models rather than a single selected model.
- Refine the public-facing writeup to make the cultural-theory claims more conservative than the deadline version of the class paper.
- Turn the instrument into a cleaner downloadable appendix or companion PDF.