MEETING DELEGATE: Benchmarking LLMs on Attending Meetings on Our Behalf - Summary

In their paper [Lingxiang Hu, Shurun Yuan, Xiaoting Qin, Jue Zhang, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan, Qi Zhang (2025), MEETING DELEGATE: Benchmarking LLMs on Attending Meetings on Our Behalf, https://arxiv.org/abs/2502.04376], the authors confront an increasingly prominent predicament in contemporary workplaces: meetings require substantial time and resources, often culminating in inefficiencies or scheduling conflicts that encumber productivity. Although these gatherings are integral to aligning teams, they can hamper overall workflows when key participants must juggle multiple overlapping commitments.

The paper investigates the viability of enlisting Large Language Models (LLMs) to autonomously represent and participate on behalf of an individual—revealing both the promise and challenges in a realm where meticulous timing, context-awareness, and conversational nuance are paramount.

2502.04376v1 Download

Crucial to this study is the notion that these LLMs would not merely passively observe or serve as silent note-takers but rather function as active meeting participants. The authors propose and implement a prototype system to illustrate how a specialized agent, powered by advanced language models such as GPT-4, GPT-4o, Gemini 1.5, and Llama3, can monitor meeting transcripts, identify relevant cues, selectively interject at critical junctures, and deliver meaningful contributions. They encapsulate this functionality under the broader framework of a Meeting Delegate. By adopting varied engagement strategies—whether cautious or proactive—these delegate systems can provide clarifications, ask pertinent questions, or chime in with substantive insights relevant to the user’s interests. The ultimate objective is to relieve human attendees of the burden of physically (or virtually) attending each meeting, saving time without neglecting essential information.

Throughout their exposition, the authors highlight that LLMs have demonstrated impressive natural language understanding (NLU) and natural language generation (NLG) capabilities, thanks to sources like [OpenAI, 2023, GPT-4 Technical Report, https://arxiv.org/abs/2303.08774]. However, harnessing such power in a multi-participant, dynamic environment introduces substantial complexities. A meeting context is typically rich, frequently extends over extended time frames, and features a tapestry of speakers, topics, and linguistic styles. For an agent to respond at precisely the right moment, it must master the delicate art of distinguishing a direct query from a tangential mention. The authors further note the prevalence of transcription errors—especially those linked to names, homophones, or domain-specific jargon—which can easily derail weaker models. In response, the study orchestrates rigorous evaluations to benchmark the performance of LLMs in realistic conditions, incorporating noisy inputs to emulate real-world usage.

At the heart of the system’s architecture lie three major components: (1) Information Gathering, (2) Meeting Engagement, and (3) Voice Generation. The Information Gathering module aggregates the user’s meeting agenda, relevant background knowledge, and any materials that the user wishes to share. This might include prior slide decks, numerical data, or contextual documents—anything that the user deems permissible for the delegate to disclose. The Meeting Engagement module, underpinned by a carefully-crafted prompt strategy, stands poised to examine each utterance in the meeting transcript as it arrives. If a talk-turn or phrase appears to cue the user in question, or aligns with that user’s stated interests, the agent generates an appropriate response, ensuring it neither prematurely interrupts nor overlooks critical questions. Finally, the Voice Generation module can convert the generated text to speech in real time, ideally replicating the user’s vocal intonation. According to the authors’ description, streaming both the LLM output and text-to-speech conversions reduces latency, thereby making the delegate’s contributions more natural.

One of the paper’s core contributions is a comprehensive benchmark derived from real-world meetings. Rather than rely on synthetic or contrived data, the researchers take genuine meeting transcripts, subdivide them into sample contexts, and label each portion based on whether it represents an Explicit Cue, Implicit Cue, or an opportunity to Chime In. They also present instances where the LLM should Keep Silence—cases in which the conversation neither addresses the user nor benefits from unsolicited input. Such classification fosters granular performance measurements: e.g., an agent’s capacity to respond correctly to a direct mention, or to refrain from speaking when uncalled for. Simultaneously, the paper addresses the difficulty of evaluating success in a domain where conversation is fluid and inherently subjective. Ultimately, they rely on multiple metrics, including (1) Response Rate, measuring how often the agent speaks when it should speak, (2) Silence Rate, measuring how often the agent avoids speaking when it should not, and (3) Recall or Precision metrics, capturing the agent’s coverage and accuracy in referencing the same main points as the ground-truth human response.

Further complicating matters, the authors observe differences in engagement preferences across models. GPT-4 and GPT-4o typically strike a balanced posture, often responding when relevant but remaining silent otherwise. Gemini 1.5 Pro, however, veers more conservative, erring on the side of under-engagement. Meanwhile, Gemini 1.5 Flash and Llama3-based models adopt a more proactive approach, sometimes yielding responses even when silence would be preferable. According to the paper, the feasible cause for these disparities lies in the fundamental architectural choices behind each LLM and the specific fine-tuning or prompting that shapes their behavior. See [Google, 2024a, Gemini 1.5: Unlocking Multimodal Understanding across Millions of Tokens of Context, https://arxiv.org/abs/2403.05530], as one example of how the Gemini line has been trained for diverse but sometimes more aggressive stances in textual engagement.

In measuring the quality of the delegate’s generated content, the project uses an approach that compares the agent’s output to the ground-truth references. They identify “main points” in the human user’s actual meeting statements, then check how many of these main points appear in the agent’s generated text. The authors define a “loose recall”—i.e., do any references overlap at all?—and a “strict recall”—i.e., does the generated text mirror multiple or most main points exactly?

According to their results, about 60% of responses address at least one main point from the ground-truth, indicating that these systems, while not flawless, do capture salient topics more often than not. Interestingly, the LLMs produced minimal hallucinations. Their main failings more commonly took the form of extraneous or repetitive content drawn from recent transcripts, occasionally resulting in overly verbose answers that might hold up the meeting. The authors surmise that future fine-tuning or integrating more refined context windows could reduce redundancy.

Of particular significance are the authors’ ablation studies on transcription errors. They craft “Noisy Name” scenarios where an attendee’s name is mistranscribed into a phonetic near-match. This arrangement, in real-world practice, might easily occur if an attendee’s name is Chinese, or if the speech recognition system confuses “Jason” with “Jisen.” The experiments reveal that such confusions dramatically degrade performance, especially in “Explicit Cue” instances, because a model that fails to re-map “Jason” to the user’s name may never realize it has been directly addressed. The authors argue that more robust name recognition or specialized fine-tuning is needed to resolve such pitfalls before real-world deployment.

In discussing the paper’s practical ramifications, the authors illustrate a phased approach to implementing meeting delegates responsibly. In Phase I (“Execute”), the system operates on user-defined data boundaries, strictly following instructions without autonomous decisions about what confidential data to disclose. In Phase II (“Assist”), the system can reason semi-autonomously and propose actions, yet still obtains final user approval for any significant steps. Phase III (“Delegate”) envisions a future of full autonomy, wherein the agent sets data-sharing boundaries, collects new information, and acts on the user’s behalf in real time, using advanced privacy filters.

Although Phase III is the ultimate aspiration, the authors make no illusions about the complexity of ethical, privacy, and trust issues that must be tackled. Indeed, they reference broader discussion in [Yan et al., 2024, On Protecting the Data Privacy of Large Language Models (LLMs): A Survey, https://arxiv.org/abs/2403.05156], underscoring that such a system must incorporate robust data-protection techniques—encryption, differential privacy, or other means—to mitigate concerns about misrepresentation and improper data usage.

The authors also deploy prototype systems in sample “demo scenarios,” such as daily project stand-up meetings. The system joins a meeting with multiple people, each assigned a distinct role. Sometimes, participants intentionally ask questions targeted at the delegate, checking if it replies with relevant background knowledge. Other times, they direct queries to different individuals, testing whether the delegate understands it should remain silent.

These short experiments, though limited, highlight real-world obstacles like response latency: even a five-second lag can seem disruptive. They highlight that GPT-4o can sometimes deliver responses faster than GPT-4, though still not at a speed that fully replicates instantaneous human conversation. They suggest solutions might include more local inference using smaller LLMs, such as Llama3-8B, which, if fine-tuned specifically for meeting-scenario tasks, could address both latency and domain adaptation, possibly bridging the gap between performance and practicality.

Ultimately, the paper underscores the considerable promise of an LLM-based Meeting Delegate while being candid about the road ahead. Cost, speed, reliability, data security, user acceptance, and trust all loom large among the challenges. Nevertheless, by crafting a benchmark dataset from real transcripts, offering a systematic evaluation methodology, and comparing multiple families of LLMs, the authors provide the research community with a blueprint for improvement. Their demonstration that around 60% of responses can already capture at least one main point from the ground-truth conversation is encouraging, hinting that as these models evolve, they may become increasingly adept at substituting for humans in routine or time-consuming gatherings without diluting the meeting’s substance.

In conclusion, MEETING DELEGATE: Benchmarking LLMs on Attending Meetings on Our Behalf by Hu et al. (2025) [https://arxiv.org/abs/2502.04376] delivers a compelling view of how advanced language models might handle the labyrinth of multi-speaker dialogues. By systematically investigating tasks of “Explicit Cue,” “Implicit Cue,” “Chime In,” and “Keep Silence,” the authors elucidate the strengths and weak points of GPT-4, Gemini 1.5, and Llama3-based models in dynamic conversational settings. They note that refinements in transcription accuracy, name recognition, and domain adaptation can significantly shrink error margins. They also articulate a responsible, phased adoption plan that respects privacy and fosters user trust. While the system is only an initial foray—still tethered by limited real-time reasoning, potential hallucinations, and unavoidable delays—it already showcases how such a delegate can lighten the meeting load, especially in contexts like standing project updates.

Looking forward, the methods and findings in this paper augur a future where large language models might fluidly, and perhaps autonomously, represent busy professionals in a broad spectrum of collaborative forums. By tackling transcriptions, ensuring judicious cue recognition, and persistently refining context windows, the domain stands poised for deeper transformations in how we conduct, manage, and attend meetings in both remote and in-person workplaces.

Get The Kingy Brief.

Get The Kingy Brief.