EnronSR: A Benchmark for Evaluating AI-Generated Email Replies

Moran Shay; Roei Davidson; Nir Grinberg

doi:10.1609/icwsm.v18i1.31448

Authors

Moran Shay Ben-Gurion University of the Negev
Roei Davidson University of Haifa
Nir Grinberg Ben-Gurion University of the Negev

DOI:

https://doi.org/10.1609/icwsm.v18i1.31448

Abstract

Human-to-human communication is no longer just mediated by computers, it is increasingly generated by them, including on popular communication platforms such as Gmail, Facebook Messenger, Linkedin, and others. Yet, little is known about the differences between human- and machine-generated responses in complex social settings. Here, we present EnronSR, a novel benchmark dataset that is based on the Enron email corpus and contains both naturally occurring human- and AI-generated email replies for the same set of messages. This resource enables the benchmarking of novel language-generation models in a public and reproducible manner, and facilitates a comparison against the strong, production-level baseline of Google Smart Reply used by millions of people. Moreover, we show that when language models produce responses they could align more closely with human replies in terms of when responses should be offered, their length, sentiment, and semantic meaning. We further demonstrate the utility of this benchmark in a case study of GPT-3, showing significantly better alignment with human responses than Smart Reply, albeit providing no guarantees for quality or safety.

EnronSR: A Benchmark for Evaluating AI-Generated Email Replies

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information