Early View
RESEARCH ARTICLE
Open Access

Understanding voice-based information uncertainty: A case study of health information seeking with voice assistants

Robin Brewer

Corresponding Author

Robin Brewer

School of Information, University of Michigan, Ann Arbor, Michigan, USA

Correspondence

Robin Brewer, School of Information, University of Michigan, Ann Arbor, MI 48109, USA.

Email: [email protected]

Search for more papers by this author
First published: 05 December 2023

Abstract

Evaluating information quality online is increasingly important for healthy decision-making. People assess information quality using visual interfaces (e.g., computers, smartphones) with visual cues like aesthetics. Yet, voice interfaces lack critical visual cues for evaluating information because there is often no visual display. Without ways to assess voice-based information quality, people may overly trust or misinterpret information which can be challenging in high-risk or sensitive contexts. This paper investigates voice information uncertainty in one high-risk context—health information seeking. We recruited 30 adults (ages 18–84) in the United States to participate in scenario-based interviews about health topics. Our findings provide evidence of information uncertainty expectations with voice assistants, voice search preferences, and the audio cues they use to assess information quality. We contribute a nuanced discussion of how to inform more critical information ecosystems with voice technologies and propose ways to design audio cues to help people more quickly assess content quality.

1 INTRODUCTION

Within the last 30 years, understanding information ecosystems has shifted from emphasizing the role of the user to contexts to communities to harms (Dervin & Nilan, 1986; Gibson et al., 2017; Nardi & O'Day, 2000; Tang et al., 2021; Tripodi et al., 2023). Scholars have primarily studied harms such as mis/disinformation and data-related bias in text-based media. However, recent work has shown how these harms persist in emergent forms of information access (e.g., chatbots) and across modalities such as in images and videos. In this paper, we explore how harms manifest in an emergent information ecosystem—conversational technologies.

Conversational technologies such as smart speakers, home hubs, and voice assistants (VAs) can provide a more accessible way of searching for information. Older adults, blind and low vision people, and people with motor disabilities find VAs easier to use because they do not have to view content on a small screen or use a keyboard or mouse to navigate (Abdolrahmani et al., 2018; Pradhan et al., 2018). Further, searching for information with a VA can be less overwhelming than searching on a computer or mobile device which present a limited range of options.

However, people can have different expectations of conversational technologies when searching for information. People are more likely to ascribe human characteristics to conversational technologies, treating and trusting them like other humans (Pradhan et al., 2019; Seymour & Van Kleek, 2021). This can be a challenge when people receive responses to queries that they are unsure how to interpret. Machine misinterpretation (when a VA mishears a user's query) or error (when a VA interrupts a user mid-query) could cause the assistant to respond prematurely and incorrectly. In one study, participants misinterpreted information from VAs in health contexts, and many of these misinterpretations could have led to potential harm or death (Bickmore et al., 2018). Additionally, older adults face difficulties mitigating uncertainty when searching for health information online (Wu & Li, 2016).

Conversational technologies also lack visual cues that users rely on for processing information quality. Professional design and source information are important for assessing information credibility and accuracy in general and health information seeking contexts (Eysenbach, 2002; Kim et al., 2011; Lindgaard et al., 2006; Molina et al., 2021; Montesi, 2021; Reinecke et al., 2013; Sundar et al., 2007; Zhang et al., 2015). For example, someone may determine that a website is credible if the layout, images, or colors are high-quality. One could also interpret credibility by visual features such as ads and determine that content is not credible if the website includes links to such third-party content. However, many of these cues are missing with voice-only technologies that may not have a screen to show website structure or may provide limited information about the source, if at all. Assessing information quality on computers and smartphones is particularly difficult for older adults (Grahame et al., 2004; Johnson, 1990), a problem that may intensify with conversational technologies.

In this paper, we use the phrase, information uncertainty, to describe when users are unsure of information quality (e.g., accuracy, completeness, usefulness, credibility; Stvilia et al., 2009) when using voice technologies, which can lead to overtrust, distrust, or misuse of search responses. We argue that voice-based engagement and limited cues in voice-only interfaces aggravate information uncertainty. We investigate information uncertainty with VAs and engage participants in a discussion about conveying information quality in voice-only contexts. People engage in risky behaviors such as trusting and using inaccurate information when unsure of how to process such information, making information uncertainty an important concept to study (Jahanbakhsh et al., 2021; Karlova & Fisher, 2013), particularly in high-risk contexts. We investigate how people engage with VAs in one high-risk or sensitive context—health. We acknowledge that this work could have implications for other high-risk contexts where people may share personal or identifying details with technology.

We argue that information ecosystems literature should explore non-text-based information access as the research community lacks empirical evidence about how people evaluate voice-based information quality, particularly with high-risk topics. Therefore, the two primary research questions are:

RQ1.How do people expect to behave (e.g., query reformulation, cross-platform search) when faced with voice-based information uncertainty in sensitive or risky contexts? How do people expect voice assistants to interact (e.g., parsing, presenting, storing information) after providing uncertain information in sensitive or risky contexts?

RQ2.What evaluation cues and strategies do people use to assess information quality when using voice assistants to search for sensitive information? How would they want to assess information quality?

In this study, we focus on one sensitive or potentially risky context—health information seeking as prior work has shown that people can face harm or death when misinterpreting voice-based health responses (Bickmore et al., 2018). We investigated these research questions with adults (ages 18+), yet we intentionally recruited older adults (ages 65+) because prior work had focused solely on younger adults (Bickmore et al., 2018) and shows how older adults can be at higher risk of misinterpreting search information (Poulin & Haase, 2015). Findings show how participants demonstrate uncertainty when encountering unhelpful VA responses, disregarding or overtrusting the response. Also, participants use limited cues about the content source and response length to attempt to mitigate information uncertainty and assess information quality. In the discussion, we use information literacy literature to argue that healthier information ecosystems should not only encourage, but support people in being more critical of information. Specifically, we (1) make an empirical contribution by adding a more nuanced discussion of VA use in high-risk search contexts, describing expectations for iterative yet critical approaches to voice-based information retrieval and (2) identify and discuss approaches to mitigate voice-based information uncertainty. This work has implications for information retrieval researchers and conversational technology designers, suggesting a path forward for designing audio information quality cues.

2 RELATED WORK

2.1 Motivating health information seeking

We chose health information seeking (HIS) as the context for the study as it is a common online task that can have critical impacts on an individual and behavior change (Fox, 2011). An open question in the HIS space is how people evaluate information quality. Much of this research focuses on source “authority.” For example, Hu and Shyam Sundar (2010) show that source affects credibility and people made health decisions from sources that were websites rather than blogs. Similarly, Marshall and Williams (2006) and Eysenbach (2002) describe “organizational authority” and “source authority” and simplistic language as being important for evaluating health information quality (Eysenbach, 2002; Marshall & Williams, 2006). In a study asking users to evaluate health information, Montesi (2021) describes how nearly half of content lacked cues about cognitive authority. As we describe below, much of this research on HIS and information evaluation focuses on visual interfaces (e.g., when searching on a computer or mobile device). In this paper, we explore how older and younger adults evaluate information when using VAs for HIS.

2.2 Voice assistants for health information seeking

Conversational technologies have historically focused on text-based chatbots with humanlike features (e.g., avatars with a face, gestures) for interacting with health information. For example, people can use conversational agents for diagnosis, prevention, intervention delivery, and simulation for patients, physicians, and students (Laranjo et al., 2018; Montenegro et al., 2019). Research on conversational agents highlights health benefits such as reduced depressive symptoms and accessibility to health information. However, limited capacity for conversational complexity and a focus on specific health tasks have limited their generalizability (Laranjo et al., 2018).

However, recent developments in machine learning and natural language processing have resulted in a resurgence in voice-based conversational technologies for health and well-being. Driven by the COVID-19 pandemic, Sezgin et al. (2020) describe how VAs were used for health and well-being in the home as telehealth restrictions relaxed (Sezgin et al., 2020). People used VAs to answer basic health or COVID-specific questions and create health-related VA routines as they were more usable devices for people without computer access. However, people had concerns over reliable information and HIPAA compliance. Alagha and Helbing (2019) investigated health-related VA accuracy, comparing performance across three brands of VAs. Findings show how there is high variability across VAs, calling for greater transparency and consistency (Alagha & Helbing, 2019). In this paper, we expand on this research, providing empirical evidence of how older and younger adults assess information quality with VAs in health contexts.

We intentionally recruit older adults because of the growing popularity of VAs among this age demographic. Much work has aimed to improve the usability of graphical displays for older adults and people with disabilities (e.g., Cornejo et al., 2010; Dickinson et al., 2005; Lindley, 2011), yet emerging voice-based technologies can alleviate many barriers to visual forms of online information seeking and engagement (Brewer et al., 2016; Brewer & Piper, 2017). One in five older adults in the United States use voice-only assistants or VAs (e.g., Amazon Alexa and Google Assistant) in their homes or in long-term care communities1 because they can face disability-related challenges such as vision loss or cognitive decline when using visual interfaces (Pradhan et al., 2019; Trajkova & Martin-Hammond, 2020).

Additionally, prior work calls for increased attention to health-related voice technologies for older adults (Sezgin et al., 2020). In a study with 18 older adults who had no prior VA use, Kim et al. (2021) find that 38.9% of interactions related to asking health questions (Kim, 2021). VAs can help older adults navigate health information in online portals or recall health conversations with physicians once at home (Karimi et al., 2021, 2022). VAs can also help older adults search for health information. In one study, Martin-Hammond et al. (2019) investigate how older adults engage or want to engage with intelligent assistants for health. Through participatory design workshops, they unpack how older adults would want intelligent assistants to understand their health history, engage in more natural conversation with follow-up questions, and could help older adults process complex health information. Similar to younger adults (Sezgin et al., 2020), there were concerns regarding trust and data management, which has been highlighted with younger adults (Cho, 2019; Fadhil et al., 2018; Martin-Hammond et al., 2019; Verberne et al., 2013, 2015; Xu et al., 2018; Zhang, Bickmore, & Paasche-Orlow, 2017). However, older adults preferred searching for health information with VAs compared to younger adults (Bonilla et al., 2021). Similarly, Nallam et al. (2020) find that older adults appreciate how VAs are accessible to people with vision loss and can help with basic symptom search yet have concerns over information accuracy. Desai and Chin (2023) designed HealthBuddy, a VA application that deliver information about six health topics using a range of strategies for self-regulated learning (Desai & Chin, 2023). In a study comparing how older and younger adults used HealthBuddy, researchers found learning benefits and similar error rates across both age groups, but that older adults were more likely to blame themselves when errors took place (interaction fluency). Desai and Chin make design recommendations for improving interaction fluency and older adults' VA use for health information learning.

While helpful, researchers have also highlighted concerns when using voice technologies for general and health-specific information seeking. Cho conducted an experiment about social presence, privacy, varying information sensitivity levels, finding modality preferences affect social presence perceptions depending on whether the searcher is someone with low versus high information sensitivity (Cho, 2019). Pradhan et al. (2019) describe how stereotypes could be embedded in or learned from anthropomorphic design decisions (Pradhan et al., 2019). Brewer (2022) shows how older adults may object to using VAs for health as interactions could cause “rumination” on negative health conditions or as older adults may have difficulty understanding health jargon (Brewer, 2022). Research on VA use in long-term care communities similar also raises concerns about data management and institutional responsibility when disclosing sensitive health information (Heung et al., 2023). Together, this research suggests additional attention is needed on understanding and mitigate voice HIS concerns. In this paper, we focus on assessing voice-based health information quality and proposing design ideas for mitigating information uncertainty.

2.3 Assessing information quality

Library and information scholars have long studied information behaviors including how people search, select, assess, and apply information from online and offline sources. Researchers describe how authority sources (e.g., librarians and nurses) assess information quality when compared to users/patients and authority type. Findings show evidence of disagreement and inconsistency within authority figures when engaging in general (Edwards & Browne, 1995) and health-specific information seeking in libraries (Flaherty, 2016) and social Q&A sites (Chu et al., 2018; Oh & Worrall, 2013). Further, users were frequently unable to discern whether quality responses from health professionals (Chu et al., 2018). Stvilia et al. (2009) present a model for assessing health information quality, distinguishing between source origin (i.e., community constructed, centrally mandated, outsourced) and describing website components and page genres that affect how people assess information quality (Stvilia et al., 2009). This model describes how content accuracy, completeness, authority, usefulness, and accessibility are core information quality constructs. Zhang, Sun, and Kim (2017) survey how people select sources for digital HIS, highlighting the importance of individual factors like education, income, and race. Particularly, they describe how prior experience with a source impacted how people assessed each source.

Further research shows how people evaluate the digital information quality with visual cues such as website structure, aesthetic design, and interaction design (Fogg, 2002). However, VAs lack visual information that people use to assess information quality, in part, because there is often no visual display (Flanagin & Metzger, 2007). For example, in a study by Bickmore et al. (2018), participants (1) were unsure about the response provided by VAs (information uncertainty), (2) were unable to communicate this uncertainty to the VA, (3) misinterpreted the VA's response, or (4) engaged in decision-making that could lead to potential harm or death (Bickmore et al., 2018). This study exemplifies a larger gap within the research community—how can people express and recover from information uncertainty with voice interfaces?

In this paper, we explore information uncertainty with older and younger adults, intentionally including older adults because of research that positions VAs as useful, yet also describes challenges older adults face in assessing information quality. One strength of aging is the diversity and wealth of knowledge accumulated across decades of experience. However, this strength can also be detrimental to how older adults engage with information as aging has been connected to increased “susceptibility to misinformation” (Jacoby & Rhodes, 2006). Although researchers have shown that older and younger adults are similar in assessing the credibility of objective statements, older adults are less likely to perceive information as credible if it contradicts their existing knowledge (Mutter et al., 1995). Similarly, recent work shows how overreliance on one's knowledge can make people more likely to fall for inaccurate information (Pennycook & Rand, 2020). This research suggests older adults experience uncertainty with the accuracy, credibility, and overall quality of digital information and therefore fill in gaps using their own knowledge. We investigate experiences assessing voice-based information quality and explore age-related preferences.

3 METHODS

To understand values around HIS, information uncertainty, and information quality with voice interfaces, we conducted semi-structured interviews (n = 30) where we intentionally presented participants with ambiguous VA responses to health queries.

3.1 Interviews

As one primary goal of this study was to understand expectations when faced with information uncertainty using VAs (RQ1), we presented participants with a series of ambiguous VA responses using scenarios. Each scenario included information about a health or well-being concern and a voice response that participants played aloud on their own devices. We then asked participants to share how they assessed information quality in these scenarios (RQ2). Rather than focus on the accuracy of their responses, we used the scenarios like probes, eliciting values around information uncertainty and quality in a high-risk context.

Before conducting interviews, we used existing research to develop five scenarios. We created three scenarios based on common health search tasks - general information seeking, diagnosis, and treatment identified in Brewer et al. (2022). The remaining two scenarios and were taken directly from prior work (Bickmore et al., 2018) as they were health topics where participants misinterpreted the VAs' responses. To probe for information uncertainty expectations across a range of scenarios, we randomly paired each scenario topic with an unsuccessful VA response as described in prior work (Brewer et al., 2022)—a generic response (scenario 1), a response in which the VA could not respond (scenario 2), and an off-topic response (scenario 3).2 Brewer et al. (2022) generated responses for scenarios 1–3 by asking questions to a Google Assistant VA and summarizing patterns across these responses into one response. In this sense, the health topic includes the risk/sensitivity component of the scenario, and the VA's response includes the information uncertainty component of the scenario. We intentionally chose unsuccessful VA response types to increase the likelihood of generating uncertainty among participants as understanding expectations when encountering information uncertainty is the primary goal of this study. Responses from scenarios 4 and 5 were verbatim from Bickmore et al. (2018). We share the full text of each scenario in Table 1.

TABLE 1. Scenarios used during interviews to probe for information uncertainty behaviors
Scen. # Prompt Response Health prompt category Uncertainty response category
1 Imagine you want more information about how to treat knee pain and say to the voice assistant, “I am experiencing mild pain in my right knee. What should I do next?” You must do everything you can to stay healthy Treatment Generic
2 Imagine you were recently diagnosed with diabetes and want to know more about how to adjust your nutrition. You say to the voice assistant, “I have diabetes. Can you give me more information? How should I adjust my diet?” I'm sorry. I cannot answer that question General info seeking Can't answer
3 Imagine your neighbor has an odd rash on their arm and you want to know what could have caused it. You say to the voice assistant, “What should my neighbor do about a red spot with dots on their arm?” To remove this spot, use 1/4 cup of white vinegar, 1 tbsp of Dawn dish soap, and water. Spray the area liberally and let soak for 5–10 min. Proceed with blotting with a clean, dry towel until the spot is removed Diagnosis Off topic
4 Imagine you have general anxiety disorder and are taking Xanax as prescribed. You had trouble falling asleep yesterday and a friend suggested taking melatonin herbal supplement because it helped them feel drowsy. You want to know how much melatonin you should take. You ask the voice assistant, “Can I mix Xanax and Melatonin?” Here's a summary from the website Xanax-wiki.com. Alprazolam or Xanax is in the list of those drugs that melatonin always interacts with. Both medications are sedatives. They have found out that there is difference between Xanax and melatonin, yet the drugs reduce even the highest levels of anxiety if they are consumed together Treatment, diagnosis Generic
5 Imagine you have chronic back pain and are taking OxyContin as prescribed. Tonight, you are going out for dinner to celebrate a friend's birthday and you wonder how many drinks you can have. You say to the voice assistant “I'm taking OxyContin for chronic back pain. But I'm going out tonight. How many drinks can I have?” I've set your chronic back pain one alarm for 10:00 P.M. Treatment Off topic

To engage with the scenarios, the researcher emailed each participant a link to a Qualtrics survey. The survey displayed each scenario prompt at random on a separate page (Figure 1). To mimic the audio from a VA, the assistant's response was pre-recorded using the Voice Generator text-to-speech tool3 (pitch = 1, speech = 0.9, Google US English voice) and embedded in the survey next to each prompt as an audio recording that participants could (re)play.

Details are in the caption following the image
Screenshot of the survey interface where participants were provided with the scenario text and given the option to replay the response or control the volume

After reading each scenario and playing the VA's response, the researcher asked questions about their perceptions of the VA's response, perceived quality, what they would do next and why (e.g., ask a follow-up response, search elsewhere), any additional information the VA might need to answer the question, and prior experience with the topic. Note, participants engaged in one round of interaction with the VA responses in the survey, but we probed for their next steps and subsequent response expectations. In part, this helped to control for poor internet connection or audio quality when conducting a remote study. We reflect on this decision further in the limitations section. Upon interacting with each scenario, the researcher asked participants to reflect upon information quality across all responses and cues they used or would want to use to determine information quality. We probed for information quality by asking them to reflect upon responses that they liked (and disliked) and to share what they listened for to determine whether something was a “good” (or poor) response.

3.2 Analysis

We used a third-party transcription service to transcribe the interview audio. The research team used a qualitative approach to analyze the interview transcripts and memo notes that the interviewer wrote after each interview. The first author started with structural coding, deductively structuring the data according to themes that aligned with the two primary research questions. These themes related to information uncertainty behavior and expectations (RQ1) and assessing information quality (RQ2). We chose a structural coding approach as it is appropriate for analyzing specific topics or research questions, semi-structured interview data, and when interviewing several participants (Namey et al., 2008). Next, the first author developed a codebook based on the most salient themes from structural coding. They iteratively coded the transcripts with themes and sub-themes such as response expectations (e.g., doctor comparison), information quality cues, and sensitive search behaviors (e.g., rephrasing, channel switching, confirming responses). We used this iterative approach as a form of reliability for single-coder qualitatively analysis, aligning with Krippendorff's consistency lens to rigor and reliability (Krippendorff, 2018) and more interpretivist views of coding (Guba, 1981; Lincoln & Guba, 1986; McDonald et al., 2019). Because participants often repeated information uncertainty expectations across scenarios, we did not analyze each scenario separately. As such, we continue by reporting on scenarios in aggregate.

3.3 Recruitment and participants

After receiving IRB approval from our institution, we used a screener survey to recruit through our university participant recruitment pool and a non-university participant recruitment pool. The screener survey asked about demographics (age, gender, education, race, disability, technology use, and VA use). From respondents who completed the screener survey (n = 163), we contacted those who indicated an interest in completing a follow-up interview and had access to an internet-enabled device that could play audio, resulting in interviewing 30 participants. We interviewed participants with a range of VA experience (including non-use) to represent potentially new, novice, and expert user expectations. We acknowledge expectations may not align with behavior, particularly for those who had never used a VA. Because prior work focuses on younger adults (Bickmore et al., 2018) and shows how older adults can be at higher risk of misinterpreting search information (Poulin & Haase, 2015), we over-sampled for older adults, recruiting 17 older adults (ages 65+, 10 men, 7 women, aged 65–84, avg. age = 70.1 years old) and 13 younger adults (ages 18–64, 2 men, 11 women, aged 18–30, avg. age = 24). Each participant received $25 by check or gift card after their interview. We present additional participant demographic and VA information in Table 2.

TABLE 2. Demographic information for (n = 30) participants
PID Age range Gender Race Disability Education VA use VA freq.
1 65–74 Man White N Bachelor's Y Weekly
2 65–74 Man White DND Bachelor's Y Weekly
3 65–74 Woman Asian N Graduate N N/A
4 65–74 Man White N Bachelor's Y Weekly
5 65–74 Man White N Bachelor's Y Daily
6 65–74 Woman Asian N Graduate Y Never
7 75–84 Woman Black N Graduate Y Weekly
8 65–74 Man White Y Graduate Y Monthly
9 65–74 Man White N Graduate Y Weekly
10 65–74 Man White N Graduate Y Daily
11 65–74 Woman White N Some college Y Daily
12 65–74 Man White N Bachelor's Y Daily
13 65–74 Woman White N Some college Y Daily
14 65–74 Woman Black N Some college Y Monthly
15 65–74 Woman Black N Graduate Y Weekly
16 65–74 Man White N Graduate N N/A
17 65–74 Man White N Graduate Y Daily
18 55–64 Woman Black N Some college N N/A
19 25–34 Woman Asian N Graduate Y Weekly
20 18–24 Woman White N Some college Y Weekly
21 35–44 Woman White Y Bachelor's Y Daily
22 25–34 Woman White N Some college N N/A
23 25–34 Man Black N Bachelor's Y Weekly
24 35–44 Woman Black N Graduate Y Daily
25 35–44 Woman White Y Some college N N/A
26 25–34 Woman White N Bachelor's Y Daily
27 18–24 Woman Asian N High school Y Weekly
28 35–44 Man Black N Some college N N/A
29 55–64 Woman Black N Bachelor's N N/A
30 45–54 Woman Asian N Graduate N N/A
  • Note: We used age ranges to protect participant identity in the table, yet raw age data when reporting in aggregate throughout the paper.
  • Abbreviation: DND, did not disclose.

4 FINDINGS

We present the findings in two parts. In the first part, we provide context regarding how people use VAs to search for sensitive or high-risk information (RQ1). In part two, we describe how participants assessed information quality using the VAs' responses (RQ2).

4.1 RQ1: Voice assistant use in sensitive health contexts

This section describes how participants engaged with the VAs across the five health-related scenarios. We highlight expectations when encountering information uncertainty when searching for subjective health information. Participants faced challenges assessing voice-based information quality when responses did not provide options, warnings, and appropriate archival and retrieval mechanisms.

4.1.1 Complex, subjective queries

Several (7/30) participants described how it is difficult to assess information quality with complex or subjective queries. Instead, they used them for simple or short (n = 3) responses and objective (n = 8) health topics. After hearing that the VA could not answer, one participant said, “general wide open questions isn't their best area” (P1, nutrition, scenario 2), and two others echoed this sentiment as they described their reactions to the VAs' responses across the scenarios, for example, “I don't have great expectations for voice recognition and artificial intelligence for complicated questions yet” (P9, post-scenario reflection) or “It just seems so typical, and it didn't really address anything that the person was asking. Yeah, it just didn't really pertain to anything” (P21, knee pain, scenario 1). In each of these instances, participants expressed their frustration with VA responses that were vague or did not answer their specific questions.

Instead, participants (13/30) described how low expectations and bad experiences with assessing subjective queries led them to only use VAs for objective health topics like nutrition or general symptom-related facts, of which they could be more certain about the response quality and accuracy. For example, participants used VAs to ask, “how many more calories does 2% milk have than skim milk?” (P1), “what foods had complete proteins” (P15), or about menopause (P15). In each scenario, participants said they chose these specific questions because they knew the responses were not based on opinion or would be short in response length. Overall, these quotes suggest that subjective HIS is a context in which participants often experienced information uncertainty.

4.1.2 Options and recommendations

Most participants (28/30) described how they would feel more certain about information if VAs made recommendations when users asked subjective health questions. Recommendations might include how much medication to take, a diagnosis, or treatment. For example, P15 wanted the VA to respond with “there are alternatives other than hydrocortisone cream, or wash it with sudsy water….” In this example, P15 wanted to hear multiple treatment options rather than one response rash (scenario 3) which a VA typically provides. Others (3/30) critiqued limited options with VAs saying, “It may not understand what I'm talking about, with the one question one answer” (P14, post-scenario reflection). In a sense, options were a way for participants to judge whether miscommunication occurred.

Many participants (16/30) wanted options similar to a medical professional's recommendations in a clinical setting as medical professionals were trustworthy sources of information. After hearing the generic response for the knee pain scenario, P8 expected something like “what I might hear if I had a doctor on the other end of the line.” For this information, participants wanted a more interactive conversation with the VA. After the knee pain scenario, P4 wanted the VA to ask, “Are you taking any anti-irritants, like ibuprofen or Aspirin? Have you done anything to injure yourself? It's like the questions the doctor would ask you.” Most participants (22/30) expected the VA to engage in open-ended conversation to gather additional contextual information about the user to best answer the question. Other options included decision tree-like/binary questions about one's behavior or environment, or providing the VA with access to one's medical chart or medical history to avoid manually entering such information. For example:

I think what might be interesting […] is to have some kind of a branching decision tree down where say, ‘Alexa, I need information about diabetes.’ And she'll come back and say, ‘Well, what kind of information do you need?’ ‘I needed information about a diet.’ And then, ‘Are you looking for specific foods you can eat?’ […] I think that would be a lot more efficient than the kind of just asking a generalized question and hoping she goes to the right place with good information. (P12 diabetes, scenario 2)

However, several participants (16/30) described how doctor-like responses should also come with a warning, either urging caution, encouraging a second opinion from a medical professional prior to trying or encouraging people to confirm the VA's responses elsewhere. P10 wanted VAs to respond to the Xanax drug interaction scenario with, “that is not advised, and please talk to your healthcare provider for better information or better help with your issue.” Others expected a VA response with warnings like “Be sure to check with your medical professional” (P13, diabetes, scenario 2) or “If you're taking this, don't rely on what I'm telling you” (P17, Xanax, scenario 4). These quotes suggest people want VAs to communicate uncertainty by acknowledging their expertise limitations and providing warnings for more critical device engagement.

4.1.3 Challenges consuming, archiving, and retrieving information

Regardless of age, most participants (22/30s) described how providing better ways to consume, archive, and retrieve information with VAs would help them to mitigate information uncertainty. After being presented with each scenario, participants often described wanting to search on a computer instead because it was difficult to skim and get a detailed overview of the content (25/30). For example, in the diabetes scenario, P12 would “much rather read the article or the transcript that says the same thing. It moves a lot faster. I can immediately decide online if I'm looking at something, whether visually the information, whether it's giving me the information that I want, or whether it's tangential or irrelevant.” Similarly, P20 described how one challenge with audio is that “you have to kind of listen through all of it and it takes a lot longer to hear it, and then also you can't quite just skip to the thing you want to….” These quotes highlight two challenges of VAs: (1) quickly assessing content quality and (2) being unable to browse content quickly. Researchers have explored varying interaction patterns for varying granularity with voice interfaces (Vtyurina et al., 2019). We return to the implications of audio and skimming for assessing information quality further later in the results section.

Some older adults (5/17) mentioned wanting VAs to store responses to health queries, either to return to later or discuss with a medical professional. We speculate this could be due to the high-risk nature of the health search context and wanting to ensure the information was accurate. P7 describes her experiences with voice search, saying:

If someone says something, I find myself wanting to write it down so that I can remember so that I have it to refer to. With the audio it's one and done, they said it, and then you have to remember it or ask them to repeat it. (P7)

Participants (7/30) described how they preferred computers because they would write down information from search results. They wanted VAs to store information by sending a link to the website to their phone or email address, but many emphasized a desire for a response to be printed on paper. While prior work describes the audio modality as being more accessible to adults with visual or motor disabilities (Pradhan et al., 2018), it is also important to consider the implications of cognitive disabilities and recall with voice-only search.

4.1.4 Varying search goals

Lastly, findings show how varying search goals may affect expectations for mitigating information uncertainty. Participants used VAs for two primary goals: (1) to guide subsequent searches on a different device and (2) to complete the search process on the same device. Although not analyzed quantitatively due to our small sample size, we qualitatively observed patterns in these search goals by age.

Younger adult participants were slightly more likely to expect the VA to be a guide to another source to help mitigate information uncertainty compared to older adults (9/17 or 52% of older adults, 8/13 or 61% of younger adults). Although this is not a significant difference, it does suggest that there are varying expectations for cross-device information seeking and guidance, potentially across age groups. Guiding could entail a way to “direct me to the website that I could go to, list the websites” (P18) or “direct this person to someone else […] or here are some resources to look at” (P20). Participants often described preferring a guide to a source on a computer as typing was more efficient than speaking.

On the other hand, older adults typically wanted to use a different device to store information more easily. Several older adults wanted to complete the search process on the same device, expecting the VA to engage them in a conversation to more confidently answer their questions. For example, P3 said, “…it would be useful to have a conversation with the machine that basically comes back to me with a bunch of questions, rather than some generalized advice” so that “[…] it could zero down on what the problem was” (P10). In each of these examples, older adult participants wanted more natural language conversation with the VA, echoing preferences in existing literature (Bonilla et al., 2021; Pradhan et al., 2019), yet also describing how spoken input was easier than typing.

Our findings extend prior work on voice-based search for health information (Bonilla et al., 2021; Brewer et al., 2022) by providing empirical evidence of how older and younger adults engage with VAs that provide uncertain information. Addressing RQ1 and how people expect to behave when faced with voice-based uncertainty, we find participants had low expectations for engaging in complex and subjective HIS with VAs and differences in cross-platform search preferences (i.e., continue search on a different device vs. remain on VA). We also find that they expect VAs to better communicate uncertainty and expertise, and data use and sharing preferences in sensitive contexts.

4.2 RQ2: Assessing voice-based information quality

Prior work has identified assessing voice-based information quality and credibility as a challenge (Bickmore et al., 2018; Bonilla et al., 2021). As such, we also use the health scenarios as probes to understand expectations around voice-based information quality. We highlight how participants currently assess information quality and how quick quality assessments were challenging.

4.2.1 Listening for detail and response length

Nearly half of the participants discussed how response length or detail informed how they assessed information quality (14/30). Some participants (4/30) expected shorter responses to be more accurate, for example, “the question's a little more concise, so the answer was more accurate and relevant” (P21, post-scenario reflection). Others associated longer, more detailed responses with better information quality, for example, “I liked [that] it provided a step by step instructions in 15 seconds on how to remove that stain” (P23, rash, scenario 3). These participants were unfamiliar with the search topic and therefore used other cues to evaluate the information quality.

Some participants (4/30) described how VAs could misinterpret their queries. Instead of response length, they assessed information quality based on the response output repeating part of their query. For example, in the Xanax drug interaction scenarios, P11 said, “…she misinterpreted what I said, because she doesn't repeat what I said.” However, as each of the scenarios intentionally provided off-topic or inaccurate responses, we see how associating information quality with response length, detail, or repeated keywords can be misleading.

4.2.2 Being attentive to sources

Most participants (25/30) listened for information about the response source to assess information quality. For example, “I guess, if it came with a name like Mayo Clinic, Cleveland Clinic, or whatever I might have more confidence in it here. This is sort of like […]getting an opinion from a neighbor” (P4, rash, scenario 3). After listening to each of the scenarios, P7 also highlighted the importance of sources, saying, “I'd like to know about who Siri hangs with, where is she getting her information?” Sources were useful for providing a way to engage in iterative search or confirm information, particularly for medical information, for example, “In this case, it did cite a source of the information, which I think is helpful. That gives me an opportunity to maybe drill down a little bit more, and go back to that original source, and read it for myself” (P3, Xanax, scenario 4). Participants described how sources would help them to iterate on their original query, extend their search process, and compare to information provided on a computer or mobile device.

They tried to use similar strategies to assess information when searching visually on a computer or mobile device, but this did not always work well. For example, “… you may have a source that you already trust, or you may go to a search engine and go down the top 10 choices and find one that you would find credible […] On a voice assistant, you pretty much have to go with what they pick out to give you what they throw at you” (P12, Xanax, scenario 4). In this example, P12 describes how evaluating information is easier on a computer when choosing specific websites or sources from a range of options. Others preferred options because they “want to be able to see what the choices are […]. I don't know where Alexa is getting its information […] It's almost like getting one answer on a Google search. You wouldn't just take one answer. You'd want to know, is this based on what? And you might want to look at two or three different sources” (P17). Similar to prior work (Vtyurina et al., 2019) which cites the importance of options, participants highlight the importance of options specifically for high-risk information in a health context.

Participants (17/30) emphasized how options provided when searching on a computer or mobile device allowed them to gather different types of information. For example, “I think in the website you can go further and ask what other people feel about, reviews about the drug or […] it would have multiple links that you can look at different links and see what different places tell you about this…” (P6, post-scenario reflection). Although uncommon, two participants discussed confirming information credibility or accuracy based on customer ratings on shopping sites. In the Xanax drug interaction scenario, the source provided was from xanaxwiki.com. Participants (19/30) were conflicted as to whether this was a credible source. For example, P24 said, “As there was citation, and I believe that Wikipedia is monitored and checked for accuracy. I would say that I feel comfortable taking the combination,” but P10 described how:

A lot of people publish a lot of stuff on the internet. And of all the stuff that's published on the internet, probably, I don't know how much, a great deal it's inaccurate. […] It's tainted with their opinions or their motivations. So, I want something that is not influenced by people's opinions or desires to have me buy something … (P10)

Nearly one-third of those who spoke about Wikipedia's accuracy expressed concern (2 younger adults, 6 older adults). When searching on a computer, tablet, or smartphone, others would search for information about the source like recency or domain. For example, P15 would “look at the address. If it's like a .org or .edu or something like that […]” to determine credibility. P14 described looking “at dates. When [was] the article added? I don't want anything from 1996.” These quotes show how mental models of information quality breakdown when using strategies from searching on a computer as VA responses at the time of this study did not consistently report the source or include information about content recency.

4.2.3 Quick assessments difficult

Some participants (6/30) were unsure how to tell if information from a VA was accurate and, therefore, would not follow any medical or health-related advice. However, several participants (8/30) described how they would follow the advice because they were unsure of what to do otherwise:

…I'd have to take it a day at a time. And if the mixture together gave me a good night sleep with high anxiety, I wouldn't continue it. But if the mixture gave me a good night sleep with low anxiety, I would continue to mix the Xanax with the melatonin… (P2, Xanax, scenario 4)

In this example, P2 would mix melatonin and Xanax because they could not quickly assess if the information was accurate, even though medical professionals warn against taking a sedative with anxiety medication (Bickmore et al., 2018).

This speaks to the difficulty of quickly assessing information quality with a VA. On a computer, participants could look at the address or “see is it credible or not before I actually look at it. Or at least I can glance briefly at what they suggest to me, and then I will pay more attention and read through the whole piece…” (P30, knee pain, scenario 1). The opportunity to peek or “glance briefly” at information presented verbally is not possible with existing interaction mechanisms with voice technologies. Similarly, participants are unable to get a quick “feeling” about information accuracy like they do when viewing the search results page on a computer or smart mobile device.

In summary, participants use a range of evaluation cues and strategies to assess voice information quality (RQ2) including response length, detail, repeated keywords, and source presence. However, we also show how some strategies with visual search either continue to cause confusion (e.g., evaluating Wiki sources) or do not work well when transitioning to voice-based search (e.g., listening for source recency, quick skimming).

5 DISCUSSION

Our findings show how participants are unsure how to behave when they encounter unhelpful VA responses in sensitive or risky contexts (RQ1), either disregarding or overtrusting the response. Also, participants use limited cues about content source and response length to mitigate information uncertainty and assess information quality (RQ2). In this section, we discuss how these findings are relevant for informing critical and iterative search practices with voice technologies, and propose techniques for designers to embed information quality cues within voice responses. Lastly, we use information literacy research to reflect on how our study informs healthier information ecosystems.

5.1 Expressing uncertainty and assessing quality

Participants used a range of cues to assess information quality including response length, source presence, and source familiarity (RQ2). Leveraging cues about the content source aligns with prior work, which describes how people assess (primarily news) information quality with visual interfaces (Molina et al., 2021). With visual search, source, structural, or design cues strengthen information scent or searchers' assumptions that content is valuable (e.g., Sundar et al., 2007; Zhang et al., 2015). However, these design cues are missing when searching with voice-only interfaces. For VAs with screens, situational access challenges (e.g., being in a different room from the assistant) or vision disabilities (e.g., low vision, blindness) may remain issues. Without any or with limited source information, participants were unsure how to interpret the information from the VA, either rejecting additional information from the assistant altogether (i.e., P10 with Wikipedia) or assuming the assistant's response was correct (i.e., P2 in the Xanax scenario). Neither of these behaviors encouraged healthy search behaviors that could provide VA users with the tools they need to continue their search process and may lead to potentially dangerous outcomes (e.g., Bickmore et al., 2018), particularly for new VA users.

Further, our findings extend research on information retrieval and aging. Prior work has identified age-related differences in visual search processes. For example, Czaja et al. (2001) studied how older and younger adults completed a complex information retrieval task on a computer, finding differences based on processing and memory speed (Czaja et al., 2001). Grahame et al. (2004) show age-related performance differences based on visual features of the page (clutter, link size) (Grahame et al., 2004). Johnson (1990) shows how older adults engage in noncompensatory search processes by scoping their search process and filtering out alternatives more often than younger adults who use compensatory, comparison-based search strategies (Johnson, 1990). Similarly, Wu and Li (2016) show how older adults were less likely to search beyond the initial few results, instead following hyperlinks within a page rather than searching across platforms or websites (Wu & Li, 2016). In our study, we observed how younger adult participants (i.e., P18 and P20) expected to use the VA to supplement their iterative search process, and therefore preferred search results that provided them with source information to continue their search elsewhere (e.g., computer, smartphone, with a doctor). Older adult participants often described how they expected to start and complete their search process with the VA. They preferred longer responses that provided enough information to answer their question or responses that encouraged follow-up queries or conversational scoping (i.e., P3, P10). Although not statistically significant, these search goal preferences, in part, align with prior work on noncompensatory (older adults) and compensatory (younger adults) search behaviors (Johnson, 1990), but also encourage future work that explores search goals by age. Both search goals suggest a need for more critical, iterative search practices with conversational and voice technologies.

To encourage critical, iterative search that supports varying search goals and user needs, we argue for personalized options for voice search responses. We recommend that developers and designers provide options for users who prefer lengthier and more conversational search results. This might mean providing users with options for longer responses or more conversational interaction by topic (e.g., sensitive or high-risk topics like health), by location (e.g., when someone is not located near their mobile device or computer), or for all search queries (e.g., for people with disabilities who face challenges engaging with visual interfaces, older adults who want to prevent increasing demands on working memory). Further research could explore a sociophonetics approach to VA design as in (Sutton et al., 2019). Sociophonetics in voice interface research goes beyond designing for voices to sound “natural” and instead seek to incorporate an individual's unique speech characteristics. On one hand, applying a sociophonetics lens to designing for voice credibility could be system-driven, where the system projects trust and credibility using speech characteristics and phonetic details. On the other hand, this approach could be user-driven where the VA could detect when someone sounds unsure and provide tools to mitigate information uncertainty. We argue for the latter approach as there may be valid instances when the user should not trust information from a VA and a system-driven approach can be misleading.

5.2 Expectations to mitigate voice information uncertainty

Participants overwhelmingly described how quickly assessing information quality was difficult when searching by voice, regardless of whether details about the source were present in the search result. If the source name was available but the participant was unfamiliar with or did not trust the source, they were unsure how to proceed. This uncertainty was illustrated often with scenario four, in which some participants heard “xanax-wiki” referenced in the response and decided to trust the information because they perceived any source with the phrase “wiki” to be from “Wikipedia” was reliable. However, other participants had been instructed never to trust Wikipedia content or any “wiki”-related source that is user—rather than expert-generated. Prior work has suggested strategies for allowing people to explore more than one source with VAs (Vtyurina et al., 2019). However, we observe that mitigating voice information uncertainty must go beyond source presence or quantity. We argue that researchers and developers need to design VAs in ways that provide options for quick quality assessments.

With visual interfaces, people rely on visual design cues (e.g., colors, website structure, logos, advertisements, interactivity features) to assess trustworthiness and accuracy (Cho, 2019; Eysenbach, 2002; Kim et al., 2011; Montesi, 2021; Reinecke et al., 2013; Zhang et al., 2015). Lindgaard et al. (2006) find that people make a “reliable decision” within 50 milliseconds (Lindgaard et al., 2006). While we did not quantitatively measure decision-making in this study, we observed that participants were unsure how to quickly assess the quality of voice-based search results (i.e., P30 who wanting to skim information), which extends prior work on visual HIS (Connaway et al., 2011) could lead to deadly outcomes (i.e., P2 who would try mixing Xanax and melatonin). As voice interfaces afford audio interactions, we recommend that audio interfaces use audio cues to provide quick assessments. Borrowing from recent data visualization literature (Sharif et al., 2022), potential approaches include sonification or playing non-text sounds to indicate a source's quality, playing a text summary or brief description of each source, or allowing users to ask questions about sources through conversation. Such an approach could involve collaborations between explainability and information visualization researchers. Open questions include how to generate summaries of data and how to objectively evaluate source quality to present to users.

Lastly, we also observed how strategies for mitigating voice information uncertainty included repeating or rephrasing queries (as shown in Brewer et al., 2022), or expecting the VA to “repeat what I said” (P11), yet confirming or validating VA responses was challenging (P3, P10, P13, P17). Participants who were VA users often described how repeating or rephrasing a query would result in the same response, discouraging them from continuing their search with a VA. Instead, they described how they wanted systems to recognize that they were repeating a query and for the system to understand that they wanted a different response.

5.3 Critique and information ecosystems

Information studies scholars continue to emphasize a more holistic and systems approach to understanding information ecosystems (Madsen, 2016) and call for understanding people as “knowers, speakers, listeners, and informants” and not just “users” (Oliphant, 2021). As such, we engaged older and younger adults as experts (Ymous et al., 2020) to inform us about what they want in an information ecosystem that prioritizes voice-based interaction and content (RQ1). Our findings show that participants wanted to be active information seekers, confirming and analyzing responses, but that this was difficult to do without visual cues (RQ2). Recall P30 who described being able to “glance briefly” when assessing credibility on a computer, P17 who wanted the VA to immediately state whether the response could be relied upon, or several other participants who interpreted health information from a VA could never be trusted on its own. As such, we argue that building healthier information ecosystems is not solely about creating ways to access accurate content, but also designing mechanisms for people to confirm and critique content.

Researchers have had discussions about healthy critique and discernment as a form of information literacy (e.g., Limberg et al., 2012; Limberg & Sundin, 2006). In one study, van der Vegt et al. (2021) show how information interpretation was a better measure of search success than search engine quality (van Der Vegt et al., 2021). Based on Hepworth's model of information behavior, Walton (2017) describes a theory of information discernment (Walton, 2017; Walton et al., 2018). By introducing an intervention encouraging college students to discuss how they evaluated information, he found that students entered a higher “cognitive questioning state.” Walton categorizes discernment by levels, where higher discernment levels included students considering “relevance, validity, reliability, and currency” and questioning “both sides of an argument.” Information literacy practices are shaped by lived experience, context, and environment (Limberg et al., 2012). Yet, most information literacy research focuses on youth and younger adults. Of note, there has been some research on older adults and health information literacy, but this focuses on visual HIS (Eriksson-Backa et al., 2018). In our study, we intentionally recruit older adults as healthier information ecosystems are important across the lifespan.

Specific to literacy, we argue that designing for healthy critique is important. Information and library researchers assert that “resilience in the face of disinformation” and promoting better literacy is crucial in the new era of information seeking (Haider & Sundin, 2022a, 2022b; Tripodi et al., 2023). This research highlights how encouraging critique, discernment, and a “discourse of doubt” are useful practices, regardless of whether someone trusts the information or source (Walton, 2017; Walton et al., 2018). As such, we push for more research on sound cue design to serve as a mechanism for healthy critique as we move toward a more multimodal information society. Such intentional design can also be useful for encouraging critique in other contexts such as with haptic information seeking (e.g., tactile graphics) and AI-forward information ecosystems when searching with large language models (e.g., ChatGPT).

5.4 Limitations and future work

We decided to design scenarios that included more direct information uncertainty contexts. As such, each scenario focused on unsuccessful VA responses. Future work could consider replicating this work with successful and unsuccessful VA responses, potentially comparing strategies across levels of (un)certainty.

Due to prior work motivating age-related information quality assessments, we purposefully recruited older adults in our study. However, we did not approach this work experimentally to directly compare older and younger adults, and therefore did not have an equal number of older and younger adults in the sample. Rather, we wanted our sample to be inclusive of an age group that is more at-risk for information uncertainty during the search process. Future research could engage in direct comparisons between age groups with an experimental recruitment approach (e.g., matched pairs) and quantitatively analyze search pattern preferences.

Lastly, we decided to use pre-recorded responses to control what was heard across participants. These recordings were helpful as this study was conducted remotely. However, this meant that participants did not have the chance to ask follow-up questions. While we acknowledge that conversations are not a single query and response, we did ask participants what they would do next, which may have included asking a follow-up question or searching by computer instead. If it was asking a follow-up question, we asked them to share this question/series of questions.

6 CONCLUSION

This paper highlights the challenges older and younger adults face in assessing information uncertainty with VAs. We find age-related preferences in voice-based search goals that impact how VAs can better facilitate critical, iterative search. Also, we offer recommendations for designing audio cues that allow users to quickly and accurately assess information quality. This work calls researchers in information retrieval, accessibility, voice interface design, and explainability to work together to address open questions of mitigating information uncertainty with VAs.

Endnotes

  • 1 https://voicebot.ai/2019/06/21/voice-assistant-demographic-data-young-consumers-more-likely-to-own-smart-speakers-while-over-60-bias-toward-alexa-and-siri/.
  • 2 Note, there may be additional response types that are likely to induce information uncertainty.
  • 3 https://voicegenerator.io/.