Clarifying the Dialogue-Level Performance of GPT-3.5 and GPT-4 in Task-Oriented and Non-Task-Oriented Dialogue Systems

Shinya Iizuka; Shota Mochizuki; Atsumoto Ohashi; Sanae Yamashita; Ao Guo; Ryuichiro Higashinaka

doi:10.1609/aaaiss.v2i1.27668

Authors

Shinya Iizuka Graduate School of Informatics, Nagoya University
Shota Mochizuki Graduate School of Informatics, Nagoya University
Atsumoto Ohashi Graduate School of Informatics, Nagoya University
Sanae Yamashita Graduate School of Informatics, Nagoya University
Ao Guo Graduate School of Informatics, Nagoya University
Ryuichiro Higashinaka Graduate School of Informatics, Nagoya University

DOI:

https://doi.org/10.1609/aaaiss.v2i1.27668

Keywords:

Large Language Models, Task-oriented Dialogue, Non-task-oriented Dialogue, GPT-3.5, GPT-4, Evaluation

Abstract

Although large language models such as ChatGPT and GPT-4 have achieved superb performances in various natural language processing tasks, their dialogue performance is sometimes not very clear because the evaluation is often done on the utterance level where the quality of an utterance given context is the evaluation target. Our objective in this work is to conduct human evaluations of GPT-3.5 and GPT-4 to perform MultiWOZ and persona-based chat tasks in order to verify their dialogue-level performance in task-oriented and non-task-oriented dialogue systems. Our findings show that GPT-4 performs comparably with a carefully created rule-based system and has a significantly superior performance to other systems, including those based on GPT-3.5, in persona-based chat.

Clarifying the Dialogue-Level Performance of GPT-3.5 and GPT-4 in Task-Oriented and Non-Task-Oriented Dialogue Systems

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section

Information