Team Alpaca Assumption Tests Report – A Higher Common Sense

Assumption 1: AI & Privacy

Researchers: Katherine, Tyler

Test Card:

We believe that: People don’t mind having their texts read by AI so long as there are significant privacy constraints.
To verify this we will: Ask people to share texts with their preferred AI platform. See how they react. Ask them to explain their reaction.
And measure: Participant’s reactions (qualitatively).
We are right if: People openly share their texts, with minimal hesitation.

Experiment details:

Participants: We recruited college students who had an account with a major LLM provider (ChatGPT, Gemini, Claude, etc.). We chose this population for experimental feasibility (they needed to be able to upload screenshots to the platform), and because our solution requires some degree of willingness to interact with AI. However, we recruited participants with a variety of AI usage habits to robustly evaluate privacy concerns and comfort levels across different types of AI users.

For our first assumption test, we recruited 6 participants across campus — 2 juniors and 4 seniors — who have an account with a major LLM provider (ChatGPT, Gemini, Claude, etc.). Out of the participants, three regularly used their preferred models (Gemini and ChatGPT), one used AI moderately, and two did not regularly use AI. This gave us a diverse perspective across platforms and usage patterns. We asked participants to pull up their preferred AI platform, and qualitatively observed how they reacted to uploading images of their personal conversations to an LLM.

We did not want participants to be aware of why we were observing them, since we wanted to observe their default reactions. Thus, we gave them this decoy prompt to make the experience seem meaningful: “Ask the AI to critique how you could have solidified your social plans more to prevent flaking. How could you have adjusted the content of your planning messages to prevent flaking?” We gave participants the opportunity to opt out of this task without penalty. We initially observed without prompting our participants, taking note of their body language and speed of completion.

To gather more robust reflections, we asked the following reflection questions:

How did you feel about your experience uploading your texts to an AI platform?
What privacy concerns, if any, came up throughout this process?
When applicable: How did it feel to see the names of people you know in the AI response?

Notes + Photos and Artifacts

Assumption 1 Learning Card:

We believed that: People don’t mind having their texts read by AI so long as there are significant privacy constraints.

We observed: Overall, none of the users hesitated to upload their text conversations surrounding social events to their LLM provider. They did so almost immediately after prompted, even with the suggestion to upload “if they felt comfortable”. Many of the participants had no privacy concerns at all, especially those who were LLM power users.

The qualitative insights from the interviews, however, were more revealing. Users measured privacy based on the perceived level of sensitive information in their messages, rather than evaluating the level of privacy afforded by the LLM provider. All three participants were indifferent to uploading their social plans to AI models, especially given they had occurred “after the fact.” However, one participant expressed concern that if the plans were being made in the moment, this could be used to trace her location. This same participant, however, admitted that companies “have her information already,” expressing that it might not make too much of a difference. Her true non-negotiable for usage would be if someone had information related to work-related or private plans.

Another participant noted that she would not be willing to share with AI any plans that she wouldn’t want to “come out” to the public, and that she wouldn’t want AI to learn about her social habits, such as to whom she is closest. Overall, several participants noted that the experience was fun and that they enjoyed seeing the AI feedback to their messages, suggesting that the upsides of the experience outweighed the privacy downsides.

From that we learned that: People are comfortable sharing their personal text messages with AI tools. While people might feel concerned in the long run about AI learning about their social habits, they have no hesitation granting LLMs access to their texts, including identifying information such as names.

Therefore, we will: Continue to design our intervention around an LLM “bot” with access to text messages. We will consider ways to scrub processed texts of identifying information such as names and other proper nouns; we will communicate these mitigation attempts to users to reduce concerns about AI learning their social habits.

Assumption 2: Screenshot Friction

Researcher: Vardhan

Test Card:

We believe that: the manual friction of taking screenshots and submitting daily forms is the primary deterrent to consistent plan logging and effective real-time intervention
To verify this we will: conduct a comparative test between two groups: one continuing the manual “screenshot-to-bot” reporting method and another using a functional embedded iMessage app prototype that automatically detects potential plans within the chat interface
And measure: the total percentage of social plans successfully logged (verified against end-of-day interviews), the average time delay between a plan being made and its first bot interaction, and the rate of “forgotten” or “late” entries
We are right if: the group using the embedded prototype logs a significantly higher percentage of their total plans in real-time, avoids the “slacking” or “end-of-day” logging dumps seen in the manual study, and reports that easy access was the critical factor in their ability to maintain the habit

Experiment details:

For our second assumption test, we recruited three participants across campus (three seniors) who regularly coordinate social plans through iMessage group chats. We selected participants who frequently make plans with friends via text because they would encounter the scenario our system targets: planning events and potentially forgetting to log them.

The goal of this experiment was to evaluate whether manual friction—specifically the need to take screenshots and submit daily forms—was the main reason users fail to consistently log plans in real time.

To test this, we conducted a comparative workflow experiment between two logging methods:

Condition A: Manual Screenshot Logging (Baseline)

Participants were asked to simulate our original workflow. Whenever they made a social plan in their messages, they were instructed to:

Take a screenshot of the conversation

Submit it through a simple reporting form or bot interface

We observed how participants interacted with this workflow and measured how long it took them to complete the reporting process.

Condition B: Embedded iMessage App Prototype (Low-Friction Workflow)

Participants were then shown a low-fidelity prototype of an embedded iMessage assistant that automatically detects when a potential social plan appears in a chat. Instead of taking screenshots, users could simply tap a suggested action within the conversation interface to confirm or log the plan.

We walked participants through a simulated chat scenario where a plan was made and observed how they interacted with the embedded logging option.

Throughout both conditions we recorded:

Time required to log the plan
Participant hesitation or friction during the logging process
Participant preference between the two workflows

After completing both tasks, participants were asked several follow-up questions:

How natural did each logging workflow feel?
Which system would you realistically use in your daily messaging?
What parts of the manual workflow felt most inconvenient?

Would the embedded version make you more likely to log plans in real time?

This allowed us to collect both behavioral observations and qualitative feedback about the usability differences between the two approaches.

Learning Card:

We believe that:

The manual friction of taking screenshots and submitting daily forms is the primary deterrent to consistent plan logging and effective real-time intervention.

We observed:

Participants consistently found the manual screenshot workflow inconvenient. In multiple cases, they commented that they would likely forget to take the screenshot in the moment or delay logging until later in the day. Even in the controlled experiment environment—where participants were explicitly told to log a plan—the screenshot workflow introduced noticeable hesitation and required several steps.

By contrast, participants responded positively to the embedded logging prototype. Because the logging feature appeared directly within the conversation interface, users felt it was easier to act on immediately without interrupting their messaging flow. Participants also mentioned that the embedded version felt “automatic” and reduced the mental burden of remembering to log the plan.

From that we learned that:

The largest barrier to plan logging is not user willingness but interaction friction. Even small amounts of effort—such as switching apps, taking screenshots, and filling out forms—are enough to break the habit of real-time logging. Systems that integrate directly into the messaging interface significantly lower this barrier.

Therefore, we will:

Focus future iterations on integrating the logging mechanism directly into messaging platforms, minimizing user effort and enabling real-time plan detection and intervention without requiring manual reporting steps.

Notes

Assumption 3: Content of Planning Messages

Researchers: Tyler, Sunny

Test Card:

We believe that: The content of planning-related messages is indicative of whether or not plans fall through
To verify this we will: Read text messages related to a social plan.
And measure: What details of the plan (location, time, punctuality, outcome, etc.) we can discern from the texts alone.
We are right if: We can successfully determine plan outcomes from the messages.

Experiment Details:

Participants: We recruited college students because they were the most accessible group for this test. Because the task was broad and was not highly user specific, we did not have specific requirements for participation. We ran the experiment with 3 female seniors.

In this test, we wanted to learn how much social planning information is contained in text messages. We know that text messaging is a crucial tool for social planning, particularly for non-cohabitating smartphone users. At the same time, some social planning takes place via other communication mediums, such as in person conversation or phone calls. Thus, we decided to compare text message content with participants’s personal knowledge to evaluate how much social planning information is communicated over text.

We asked participants to find a text message conversation related to social planning that they were willing to share with the experimenter. We then asked them to write down all information they could remember relating to the plan, including any information in the text and any information not contained in the text. The experimenter wrote down all details they could infer from the texts alone. We then compared the information produced by the participants and the experimenter.

Photos

Learning Card

We believed that: The content of planning-related messages is indicative of whether or not plans fall through

We observed: Participants’ text conversations were fairly robust, containing extensive details of plan logistics. One plan was made entirely over text, while another plan was made in person and reiterated over text. It was unclear whether the third plan was made over text, though the logistics were communicated over text.

Out of the three tests run, one plan had been scheduled and executed, one plan had been scheduled and was upcoming, and one plan had been proposed but not scheduled. This diversity in plan status gave us a variety of scenarios to evaluate. The content extracted by participants and the researcher were fairly similar, though in the case of the executed plan, details about the timing of the event and the punctuality of the participants were difficult to discern from the texts alone. In this case, the researcher also incorrectly inferred that at least one participant was slightly late (since they had texted “5 minutes away!”) when the participant remembered this person was on time.

From that we learned that: Text messages contain extensive information regarding plan logistics, scheduling, and outcomes. Even in cases where plans are made in live conversation, they are often reiterated or codified over text for clarity and documentation. Some messages can have ambiguous meaning, which could be difficult for an AI bot to understand. For example, it is unclear from the text “5 minutes away” whether someone is running late or not, although earlier texts can clarify context and pragmatic meaning.

Therefore, we will: Continue relying on texts as the primary source from which to extract scheduling information. In ambiguous cases, we will offer users a chance to clarify or correct any incorrectly extracted information.

Assumption 1: AI & Privacy

Assumption 2: Screenshot Friction

Assumption 3: Content of Planning Messages

About the author

Related Posts