Intervention Study Synthesis

Intervention Study Link

Intervention Study Overview and Usage Log
- JACE: Chrome Extension deployed via user scripts on Tampermonkey browser extension.
- Jace’s main goal is to help users use LLMs more thoughtfully by engaging in critical thinking exercises related to their prompt of interest. This intervention study was conducted across 8 users– 6 of whom filled it out on a consistent basis– from Wednesday February 18th, to Tuesday February 24th.
  - Flow: consent + participant ID → intercept → generate Qs → user answers → validity gate → optional second loop → append planning context to prompt → send.
  - There are 2 reflection questions per intercept, up to 2 loops (1 if responses are viewed by our extension as “satisfactory”, 2 if not). This is done via an evaluation score. This current method is highly subject to change before our final product is released for the class.
- We were able to collect data through Google Sheets.
  - 249 Records Collected from 6 different users of Jace. This includes data on what the initial prompt was, the feedback from our intervention, as well as what they put as their answers to the questions we generated.
Post-Interview
- The Post Study Interviews were conducted between February 23rd and February 27th and were our predominant way of collecting feedback. These were 20-30 minute semi-structured interviews allowing us to gain insight into users’ thought processes and if Jace was actually useful or just a hindrance.
  - Emphasis on the usefulness / reliability of JACE itself and how the extension could be improved
  - Looking at how user’s opinions of the intervention changed and shifted over time and whether or not they would recommend the tool to others (NPS)
Recruitment Strategy
- Recruitment strategy: We just texted / reached out to people that we thought fit our ideal participant characteristics / some of the personas that we talked about. We attempted to recruit old participants from the baseline, as well as mix in some new participants.
  - Participant characteristics: heavy LLM user, long-context workflows, uses LLMs for school tasks / work tasks / personal life.
  - Wide range of potential LLM uses
Key Insights from Intervention Study
- This overall reduced vague/underspecified prompts; increased specificity and clarity.
- People would sometimes avoid LLMs more than we were targeting, largely due to the fact of there being multiple rounds.
- Jace felt roughly net-neutral after learning the flow for most, though some felt it became more net-negative over time.
- Users often over-assume model context; questions help externalize missing constraints/known info.
- Users also think the questions can be more specific.
- Jace was generally unhelpful / a big hindrance when it came to small day-to-day questions, (e.g. “What color is the sky?”), so adding an intent parser should help with this fact.
- Bug: The context from planning wouldn’t be added if the second round was considered failure, absolutely have to fix if implementing Jace as a long term solution.
- If the user is prompting about a topic / concept that they are unfamiliar about, asking follow up questions did not make as much sense.
- Round 2 questions are repetitive. Here are some fixes:
  - Generate two completely new questions if both weren’t answered satisfactorily
  - If 1/2 was answered satisfactorily, ask two follow up questions on the satisfactory answer.
- Participants generally felt like the evaluation score they needed to get to skip the second round was slightly too high.
- It should be more clear to the user WHY their input wasn’t accepted if it doesn’t pass a given round
- The purple box feedback section attempted to do this, but it was generally unapproachable / could be done in a better way.
  - Perhaps a response score?
- Knowing that a reflection is coming inherently changes behavior.
Solution Design Changes
- Jace should be more customizable. Don’t have a set amount of questions / rounds. Can also make intervention trigger after MORE than just the first prompt if a user would want that.
- Jace should be more adaptable. Instead of generating the exact same questions for each round, they should adapt to what the user put as their answer, similar to how we adapt to the first prompt that is stated. We can also try to implement a toggle for the users to adjust the intensity of the friction they want to have. However, doing this might hinder the purpose of adding the friction factor in the first place since users can easily switch it off.
- Jace should be more communicative. It should be open with both positive and negative feedback for the user, especially with negative feedback so the user knows how to improve their response to pass round 1.
Research questions
- Does just-in-time reflection increase prompt specificity and independent thinking while keeping friction acceptable?
- What is an effective balance between thoughtful LLM usage as well as time saving?
- What is a healthy amount of friction to include within JACE in terms of rounds / questions? Should it be customizable?

System Paths

Miro link:https://miro.com/app/board/uXjVG-_TkvY=/?share_link_id=569168709764

Image link: https://drive.google.com/file/d/1_njydyJqTIsPAsLkqUdX7MKVx4PRW0cP/view?usp=drive_link

(1) Deadline-Driven College Student (Planning Gate)

(2) Efficiency-Driven Student Researcher (Ownership Reflection)

(3) Frequent Heavy LLM User (Pre-Use Awareness)

(4) Skilled but Low Confidence Expert (Prediction + Critique)

Story Maps

Miro link: https://miro.com/app/board/uXjVG–Z7pY=/?share_link_id=218156373932

Image link: https://drive.google.com/file/d/1U7-8uks6NqQZpmIGJGmyUUfATjP8InCH/view?usp=drive_link

Image link: https://drive.google.com/file/d/13y6A3TnEkLplKJj4yAVhNneZNhzNRzp6/view?usp=drive_link

MVP Features

Detect first submission attempt
Intercept submission before LLM generates answer
Choose which LLM to use
Is the answer that LLM provided accurate?
What task is the LLM being used for?
Am I asking for advice or answer?
Go through the answer and compare responses
Is the LLM going to be saving me time?
Can I get high-quality answers and responses?
How specific do I want it to help?
What do I want the vibe to be? (If it’s a writing assignment)
What sources is the LLM basing its information off of?
Can I do this on my own?
What should the design of the LLM look like? Should it look special/different for first prompt?

Block prompt submission until both answered
Enforce minimum length threshold
If invalid answer: Force rewrite and loop
How many times do you block the LLM before letting them go?
What should the “minimum length” be in order to define an effective answer?
What determines a “valid” answer?
Is the user going to be annoyed to the point of not wanting to use the extension?
What should the “minimum length” be in order to define an effective answer?
Does this add substantial friction to using an LLM while not being too intrusive

Allow submission only after validity check
Attach planning answers to original prompt automatically
User engages in planning/critical thinking
How has my prompt changed after using the tool?
Am I being more mindful while using LLMs?
User’s metacognition increases with better understanding on their tasks
Did the reflection actually help in my critical thinking?
Is the LLM output going to improve with the new prompt?
Is there a way to quantitatvely measure what constitutes as effective critical thinking?
How did this refined prompt help my understanding/learning?
Did reflection actually improve the LLM’s responses?
How can we make it context-aware?

Bubble Maps

Miro Link: https://miro.com/app/board/uXjVG73qV6g=/?share_link_id=508753260315

Image Link: https://drive.google.com/file/d/1CP423P1Y54orHNHTTRP480l3xbNVTvq6/view?usp=drive_link

Assumption Map

Image link: https://drive.google.com/file/d/1nxnPe_5hSquXK-yOgZBzbHwtnMC2bqYq/view?usp=drive_link

Assumption Testing

Assumption 1: Structured reflection questions activate independent reasoning before LLM use, rather than merely slowing the user down.

Method: Interview + reasoning artifact analysis
- Interview Questions
  - “Did you generate your own reasoning first?”
  - “Can you recall a specific moment?”
  - “Did you realize your impulse to prompt was premature?”
- Interview Coding for
  - Problem decomposition before prompting
  - Metacognitive awareness
- Prompt Analysis
  - Do they include partial reasoning before prompting?
Validation Signal
- Users can recall concrete moments of self-reasoning
- Prompts contain structured thinking prior to LLM query
- LLM output becomes higher-quality from being augmented with individual thinking and critical review
Falsification Signal
- No structural change in prompts
- Users report something like “just clicking through”
- LLM output remains identical to pre-intervention levels and users do not feel like their learning has improved

Assumption 2: Users will tolerate cognitive friction if they perceive it as coaching rather than obstruction.

Method: Emotional framing + efficiency tradeoff analysis
- Interview Questions
  - “Did it feel like coaching or friction?”
  - “Did you feel tempted to bypass it?”
- Interview Coding for
  - Coaching framing
  - Friction frustration
  - Situational tolerance (e.g., complex vs trivial tasks)
Validation Signal
- Friction accepted for complex tasks
- Perceived improvement in critical / independent thinking
- Opinions towards intervention are positive despite added friction to completing normal tasks
Falsification Signal
- Efficiency complaints
- Frequent bypass attempts
- Negative opinions or feedback on the intervention

Assumption 3: If making JACE completely togglable, from within the browser of an LLM, people will still use the main planner to a comparable extent

Method: Can do this qualitatively and quantitatively:
Qualitatively: Talking to users about this new option
Quantitatively: Comparing usage both before (which we already have) and after making JACE completely togglable.
- Interview Questions:
  - “Did you use JACE more or less knowing that you can toggle it off?”
  - Was there anything from JACE that made you want to toggle it off?”
  - Once you toggled it off did you turn it back on?
Validation Signal: % of intercepted convos in pre-change vs post-change is not drastically different. Say % difference is less than 20%. This would mean that even though the rigid user barrier is removed, user behavior remains similar (positive opinion of the intervention’s use in helping them think more)
Falsification Signal: Greater than 20% dropoff in usage with the togglable browser option compared to without it, meaning users found the disturbance of the intervention not as bearable and worked to remove it.

A Higher Common Sense

About the author

Leave a Reply Cancel reply

About the author

Related Posts

Leave a Reply Cancel reply