Conversational Agents for Complex Collaborative Tasks

Copyright © 2020, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602 54 AI MAGAZINE Even the most sophisticated current artificial intelligence (AI) systems cannot anticipate all the problemsolving scenarios to properly execute a solution without some human guidance. However, it becomes harder and harder for humans to provide this guidance as systems become more and more difficult to understand, control, and trust. For many black-box systems, the user has little control over the problem-solving strategies beyond parameter tuning, and the solutions are often given without grounds that are comprehensible to a human. In contrast, a system that could engage with humans in multimodal natural language dialogue, in which they jointly identify problems and refine, develop, and explore solutions, would constitute a major step forward toward expanding the repertoire of problems that can be (jointly) solved by leveraging both the human judgment and expertise and the system’s computational and analytic capabilities, and gaining the user’s trust that the solutions (jointly) arrived at are explainable and verifiable.  Dialogue is a very active area of research currently, both in developing new computational techniques for robust dialogue systems and in the active fielding of commercial conversational assistants such as Apple’s Siri and Amazon’s Alexa. This article argues that, while current techniques can be used to design effective dialoguebased systems for very simple tasks, they are unlikely to generalize to conversational interfaces that enhance human ability to solve complex tasks by interacting with artificial intelligence reasoning and modeling systems. We explore some of the challenges of tackling such complex tasks and describe a dialogue model designed to meet these challenges. We illustrate our approach with examples of several implemented systems that use this framework. Conversational Agents for Complex Collaborative Tasks

ven the most sophisticated current artificial intelligence (AI) systems cannot anticipate all the problemsolving scenarios to properly execute a solution without some human guidance. However, it becomes harder and harder for humans to provide this guidance as systems become more and more difficult to understand, control, and trust. For many black-box systems, the user has little control over the problem-solving strategies beyond parameter tuning, and the solutions are often given without grounds that are comprehensible to a human. In contrast, a system that could engage with humans in multimodal natural language dialogue, in which they jointly identify problems and refine, develop, and explore solutions, would constitute a major step forward toward expanding the repertoire of problems that can be (jointly) solved by leveraging both the human judgment and expertise and the system's computational and analytic capabilities, and gaining the user's trust that the solutions (jointly) arrived at are explainable and verifiable.
 Dialogue is a very active area of research currently, both in developing new computational techniques for robust dialogue systems and in the active fielding of commercial conversational assistants such as Apple's Siri and Amazon's Alexa. This article argues that, while current techniques can be used to design effective dialoguebased systems for very simple tasks, they are unlikely to generalize to conversational interfaces that enhance human ability to solve complex tasks by interacting with artificial intelligence reasoning and modeling systems. We explore some of the challenges of tackling such complex tasks and describe a dialogue model designed to meet these challenges. We illustrate our approach with examples of several implemented systems that use this framework.

Conversational Agents for Complex Collaborative Tasks
James Allen, Lucian Galescu, Choh Man Teng, Ian Perera Most current dialogue systems only support interactions within very simple task models that can be represented as a short list of attributes. Once the values of these attributes are elicited, the task is essentially complete (see, for example, Williams et al. 2016 and the Dialogue State tracking Challenge [DStC] 1 ). recent work has focused primarily on machine learning, motivated partly by the belief that machine learning leads to robustness and overcomes some of the brittleness found in earlier hand-engineered systems. However, the task models and dialogues supported remain very simple (for example, querying bus schedules, making restaurant reservations, booking plane tickets), and the task complexity has remained largely at the same level as that of systems developed decades ago, at least because since of the slot-filling dialogue managers described in Goddeau et al. (1996). 2 We will demonstrate that it is possible to create robust conversational systems in much more complex domains. As opposed to the tasks of prior systems -which often can be solved by identifying speech acts and arguments for slot filling -tasks in complex domains may have no clearly defined solution or end state, and the reasoning required for both the user and the system cannot be articulated easily by a nonlinguistic user interface with predefined functions. In addition, these tasks cannot be solved by a standalone optimization process or a deeplearning algorithm, because in complex problemsolving the goal is often only vaguely outlined to begin with and becomes iteratively refined and modified as the problem-solving session progresses. Furthermore, even if such a black-box system could solve these tasks, the human user would have little trust in the results because the black box cannot provide explanations that the user can examine and understand. to tackle these tasks, we need systems that can provide humans with relevant information, understand their intentions within context, and integrate the human expertise and judgment to jointly explore potential solutions and identify desirable ones, providing transparent reasoning throughout the process. Such systems operate as active collaborators in the problem-solving process, interacting with the humans in the same ways that humans naturally use among themselves when they work together, both in terms of the mode of communication and the protocol implicit in negotiating a joint endeavor.
While the main ideas behind our collaborative problem-solving (CPS) approach are not new, it was only in the last few years that we integrated the natural language components and the CPS-based dialogue model into a fully domain-independent framework, thereby allowing third parties to bootstrap from these components to independently develop their own dialogue systems in a variety of domains. Some of the technical aspects of the CPSbased model are outlined in Galescu et al. (2018). In this article we consolidate our entire approach, using examples from multiple domains, to give the reader a better understanding of the capabilities and the broad applicability of our framework, in particular to tasks of complexity higher than can be handled by the predominant state of the art. Here we also extend the discussion of the key domain-independent capabilities and show in detail how the framework can be instantiated for specific applications. We also include an evaluation of our approach using several metrics.

Examples of Complex Collaborative tasks
As motivation, we describe two example tasks that require a level of human-machine interaction and collaboration that is well beyond the capability of standard dialogue systems. For each domain we introduce our dialogue system implemented using the CPS framework. More example dialogues of complex tasks and CPS-based systems capable of supporting these tasks can be found on the Institute for Human and Machine Cognition (IHMC) website. 3

Collaborative World Modeling
While researchers from a wide range of disciplines have developed complex simulation tools for exploring aspects of the world (for example, agriculture, economics, social stability, weather, hydrology), automated methods for combining such systems to answer larger questions about the world remain elusive. the state of the art in world modeling involves an extremely labor-intensive process, requiring person-years of effort by highly trained modelers to determine how a given scenario could correspond to a configuration of quantitative modeling engines, to identify or approximate the required data and parameter values needed, and then to actually run the resulting models over a set of scenario variations. For example, the Australian National Outlook (Hatfield-Dodds et al. 2015), which used mostly existing models for climate, land use, the economy, and other systems, took a large team of domain experts and scientists two years to produce. the Collaborative World Modeling System (CWMS) is an effort to build a collaborative conversational agent to assist humans in building such models. CWMS has to understand vague user goals, suggest and understand discussions about the problem-solving strategy, be able to plan and execute simulations, explore alternatives, and present and explain analyses. Figures 1 and 2 show a sample interaction with the current prototype system. More details can be found in Allen et al. (2018b) and .
to get a better feeling for how humans should be able to interact with AI systems, consider what happens in the dialogue in figure 1. It starts with the user suggesting a goal (1), which the system accepts and takes initiative to refine (2), and then suggests an initial course of action (4). to support this interaction, the system reasons that malnourishment is both an indicator of food insecurity and also that it is a value that can be computed by one of its reasoning engines. the system then constructs a plan to run a reasoning engine that can compute expected childhood malnutrition rates, based on baseline information on the expected availability of food and other variables. the answer computed is given in (6). the user then expresses a desire to elaborate the scenario based on predictions that there would be an El Niño event in the coming year. the system recognizes the user's intention is to run a new analysis with the new assumption. the change in crop yield is computed (figure 2) and fed through an economic model to compute the effect on food availability, and the change in food availability is then fed through a food security model to compute malnourishment rates. the user follows up by asking about the effect of changing the planting date (11). the system knows a problem-solving strategy for exploring the effects of changing variables, and so asks whether it should construct a simulation experiment that estimates the crop yields for a range of different planting dates (12). Once the user concurs, the system builds and executes a plan for the experiment and shows a plot of the results to the user, identifying the best option. While this is as far as we have space to discuss it here, the user could easily continue on, for instance asking for elaboration (for example, OK, and how would that affect the malnourishment rates?) or exploring other options (for example, What if we could increase the amount of fertilizer that is available?) or pursuing some new strategy for dealing with the problem, such as shipping more food aid to the region.

Biocuration
In molecular biology, finding explanations for biologic observations and phenomena requires building and visualizing complex causal models with varying granularity and conceptual elements, and running simulations on these models to detect their dynamic properties. to manage this wide range of complex problem-solving behaviors, the Blabbing on Biocuration system (BoB) dialogue system integrates a variety of specialized agents with access to an extensive list of curated databases covering gene expression, protein activities, and molecular pathways, as well as knowledge extracted automatically from reading scientific publications in molecular biology (Allen et al. 2015;Gyori et al. 2017;Valenzuela-Escárcega et al. 2018). Figure 3 shows an excerpt from an actual dialogue with BoB, 4 where the user is attempting to find support for the hypothesis that a drug has a particular effect on a gene that it does not target directly. BoB can System creates a plan to run a series of simulations that vary the planting date parameter, again estimating yields for an El Nino year by interpolating over results obtained from simulations of the past 30 years. The plan then assembles the results and presents to the user in a chart. 14 S: The best scenario involves planting crops one week earlier than usual. help to identify possible pathways by which the drug can affect the desired gene. there are often multiple such pathways, so the mere fact that such a pathway exists is in itself useful, but typically this has insufficient explanatory power. BoB can provide additional help by assembling such pathways into molecular models, identifying missing pieces in the model, and making suggestions for revising the model, all in collaboration with the user. these models can be analyzed statically to find mechanistic support for the hypothesis. In addition, their behavior over time can be analyzed by running simulations. While the parameters for these simulations are generally set automatically, the user can explore what-if scenarios by asking the system to alter them and compare the outcomes of the simulations under different conditions. In addition to carrying out a dialogue with the user, the system also uses a graphical interface (figure 4) to display multiple views of the model being built as well as supporting evidence from databases and the literature, with hyperlinks that the user can follow to assess their reliability and usefulness in the biologic context of their problem. Although CWMS and BoB involve substantially different types of AI reasoning systems and need to understand quite complex sentences expressing the user's intentions about different tasks, these two systems have been built on the same CPS framework, without having to develop separate language understanding and dialogue models for each domain. this shows that the CPS framework can serve as the basis for developing the next generation of conversational agents.

Background
In this section we review the current state of the art and show that significant extensions are needed to meet the challenges of complex domains such as world modeling and biocuration.

State-Based Dialogue Systems
Most current work in conversational agents is performed at essentially the same level of task complexity as systems dating back to at least the 1990s. to understand this notion of tasks, consider the representation used in the DStC1, described in Williams et al. (2016). A dialogue system is formalized in terms of dialogue states, each consisting of a set of attributes (or slots) and their possible values organized as a frame. the focus of DStC1 was providing information over the telephone about bus schedules. the slots that comprise a dialogue state are shown in figure 5, reproduced from Williams et al. (2016). the state of the conversation so far is captured by the values of the slots that have been instantiated. For instance, if the user said I need to take a bus from City Hall tomorrow then the dialogue state would have the tIME slot set to tomorrow's date (obviously relative to the time the conversation is occurring) and the OrIGIN POINt OF INtErESt would be set to City Hall. In the early conversational systems using this model (for example, Goddeau et al. 1996), the system would then examine the dialogue state to find an attribute that still needs to be filled. In this case, the system mainly needs to identify the destination to find appropriate bus routes. this would motivate a system response such as Where do you want to go? Once this question has been answered, the system would make an application-program-interface call to a bus-schedule database to identify the routes and present them to the user.
A significant advantage of such frame-based systems is that they enable robust semantic parsing (for example, using pattern matching or a trained neural network). the relevant frame significantly restricts the space of possible interpretations and parsing can be reduced to essentially a keyword/ named-entity recognition task. For example, no matter the surrounding context, if a fragment that can denote a day (for example, Saturday, tomorrow, …) is found, then it is used to fill the DAtE slot. Likewise, time expressions (for example, 3 pm, in the afternoon, …) are used to fill the tIME slot. A simple grammar of addresses can recognize phrases to fill the DEStINAtION_StrEEt slot. With a handful of special patterns, the frame can be instantiated even from ungrammatical or nonsensical sentences (whether initially misspoken or resulting from speech recognition errors).
Much of the research in dialogue systems in the past two decades has focused on creating better dialogue management strategies. A clear step forward was to track multiple possible dialogue states rather than a single one (Pulman 1997  or the DEStINAtION POINt OF INtErESt slots in the frame. Assuming the speech recognition slightly preferred City Hall and destinations are more commonly stated before origins, the system can compute a distribution over the four possible dialogue states. See Kim et al. (2008) and Williams et al. (2016) for good examples of systems following this approach.
At this stage, the dialogue manager would need to decide between a number of plausible continuations: Ask user to confirm top speech recognition hypothesis (for example, Did you say City Hall?), ask user to confirm an entire interpretation (for example, You want to go to City Hall?), ask user to identify/confirm a slot (for example, Where are you going to?), or ask user to repeat the statement (for example, I didn't understand. Would you repeat that?). In any of these situations, when the user provides the next response, the possible interpretations of that utterance would be combined with the current context to provide updated probabilities over the possible states.
the earliest frame-based systems, to determine the next step, used hand-constructed rules (for example, Zue et al. 2000;Larsson and traum 2000), and some of those dialogue managers have performed well in DStC (for example, Wang and Lemon 2013). Much of the effort over the past decade, however, has focused on using machine learning to learn dialogue management strategies. there is a long history of using Partially Observable Markov Decision Process models (see the excellent review in Young et al. 2013) and more recently also neural net approaches (for example, Serban et al. 2016). these techniques generally require a large corpus of sample dialogues in the domain, ideally annotated with the correct current dialogue state (represented, for example, as a partially instantiated frame with the currently known values).
While we do not know the details of the implementations of commercial conversational assistants, such as Apple's Siri, Google's Google Assistant, and Amazon's Alexa, it is safe to say that they represent the conversational state using something equivalent in expressive power to the frame representations. Each task (for example, set an alarm, find a restaurant, navigation) has an associated set of information that needs to be acquired, and once that information has been collected an application-program-interface call is made to the back-end application that supplies the functionality.
there are some generalizations of frame-based systems that extend the range and complexity of conversations they can support. For instance, the system may accommodate multiple tasks, each represented as a separate frame. to identify the intended task and frame from the first utterance (or when the user switches to a different task), the system associates with each frame keywords and phrases that are useful indicators for the frame, and checks which frames allow the extraction of the most information from the given sentence. For example, a sentence that mentions the word "bus" would be a strong indicator that the utterance should be interpreted with respect to the frame in figure 5, rather than, for example, another frame that involves making a restaurant reservation. Many neutral network-based approaches use joint models for simultaneously predicting the intent (task frame) and the slot fillers (for example, Liu and Lane 2016;Zhang and Wang 2016;Wang, Shen, and Jin 2018). recently DStC included a track on developing multitask dialogue systems (Li et al. 2020), although there have been earlier forays into this area (for example, Mrkšiċ et al. 2015;Wen et al. 2017;Nouri and Hosseini-Asl 2018). While tackling multitask dialogues does pose additional challenges compared with handling single-task dialogues, the increase in task complexity is fairly minimal.
Another generalization is to represent a task as a series or transition network of frames, where once all the information in one frame is acquired, the system can transition to a new state (with a new frame). this model is captured in the industry standard VoiceXML 5 and can be used to develop commercial systems such as the customer service systems one encounters over the telephone. In practice, these systems are typically driven by system prompts, that is, the system asks a question or provides options for the user to select, and then moves on to the next state based on the answer. Such systems are often referred to as system-initiative. In contrast, the framebased systems described above are typically userinitiative, that is, the user initiates the conversation as opposed to answering system queries.
Currently, most work on modeling dialogue is based on neural net models for all or most of the system components (for example, Wen et al. 2017;Lin et al. 2019). these systems are trained on transcripts of dialogues, with or without annotation of the dialogue states (that is, frames). they then attempt to generate the next turn in a dialogue, given the complete dialogue history up to that turn. Most such end-to-end systems are evaluated on datasets of complete dialogues, either generated automatically or crowdsourced, but are never put to a real test with human users (some evaluation schemes use simulated users). DStC8 (Li et al. 2020) introduced a form of human evaluation via Amazon's Mechanical turk, where users converse with the system to achieve a given task. However, unlike traditional user experiments, this type of evaluation suffers from a significant weakness: the task is described in detail, step by step, in natural language, which users can follow almost exactly to conduct a reasonably successful conversation.
Of note, while there has been a large effort in developing more robust and effective state-based dialogue management techniques, there has not been much effort in developing systems for more complex tasks. As we will discuss below, as task models become more complex, there are significant  hurdles to overcome. First, relatively simple parsing techniques and methods of information extraction (for example, pattern recognition for slot filling and intent detection) start to lose their effectiveness as the domain of discourse becomes larger. Second, both probabilistic and neural network-based frameworks rely on having a fairly simple notion of state, and a relatively simple set of choices that can be made as the system's contribution. Both of these spaces increase substantially with the increase in task complexity. In addition, related to this second point, the more complex the dialogue states, the larger the corpora that need to be collected and, for some approaches, annotated to train the probabilistic models.

task-Based Dialogue Systems
While state-based systems dominate the literature, there has been a steady development of conversational systems supporting more complex task models. these systems explicitly model the task being performed using hierarchical task representations, and engage the user in more complex, longerterm tasks such as tutoring, collaborative planning and control, and task learning. the notion that the structure of the dialogue reflects the structure of the underlying task was noted early on by Grosz (1974). As we will see later, this is not strictly a one-to-one correspondence, but the task structure does provide significant insight into the dialogue structure. the more complex relationship between the two was elaborated in Grosz and Sidner (1986). A number of dialogue systems are driven directly from task models. For instance, StEVE teaches students about physical tasks using a virtual environment (rickel and Johnson 1998). COLLAGEN (rich and Sidner 1998) and, more recently, Disco (rich and Sidner 2012), provide general frameworks for building systems that engage in collaborative conversation based on a library of explicit tasks (not tied to any specific domain). Another task-based model that supports multitasking in dialogue is described in Lemon et al. (2002). Let us examine a task-based dialogue system in more detail. ravenClaw (Bohus and rudnicky 2009) is driven by a hierarchical task model such as the one shown in figure 6 for booking rooms for meetings. the task (rOOMLINE) consists of four subtasks: logging in, in which the system welcomes the user and asks if the user is already registered and asks for their name; obtaining specifics of the reservation, including location, time and other information such as room size; querying to the back-end reservation system; and presenting and discussing the results. the dialogue engine is task-independent and includes a number of generic conversational skills, including language interpretation, response generation, clarification requests, and error detection and management. the system manages the dialogue by moving through the task tree from subtask to subtask as each is completed. Each task is executed based on its specification, which may include the following general behaviors: • Giving the user some information (for example, the WELCOME task involves saying Welcome) • requesting information from the user (for example, ASKrEGIStErED involves asking if the user is registered; DAtEtIME involves asking the user for the desired time of the meeting) • Making a call to a back-end reasoner (for example, GEtrESULtS involves querying the room reservation database) As in frame-based systems, each task has a set of slots and semantic patterns for interpreting values for that slot. For example, the LOGIN task has a slot for the username, while the ASKrEGIStErED task has a Boolean slot indicating whether the user is registered. the dialogue manager operates by executing tasks in the task tree, maintaining a stack of active tasks, much like the attentional stack in Grosz and Sidner (1986). to execute the current task, the system either: performs one of the actions on the stack and, if appropriate, waits for a response; or, if there are no pending actions, pushes a subtask onto the stack and then executes that subtask. For instance, after the user logs in, the system asks when the user wants a room (subtask DAtEtIME). At that time, there are three tasks on the stack: the top is DAtE-tIME, with a slot for reservation time. Below that is the task GEtQUErY, with a time slot which it shares with DAtEtIME, plus the location slot and other slots from its other subtasks. At the bottom of the stack is the root goal, rOOMLINE. In interpreting the answer, all the slots on the stack are candidates to be filled. For instance, if the user said Gavett Hall at 3PM Tuesday in answer to the time question, the system would fill in not only the requested tIME slot, but also the LOCA-tION slot even though it has not asked about this (yet). this architecture thus supports the same robust parsing as phrase/keyword spotting in frame-based systems.
task-based models can engage in dialogues over tasks more complex than can be easily modeled in state-based systems. In addition, because of the explicit task model the system has a representation of what its current goals and intentions are (so, in principle, it could answer questions about what it is doing). these models work well in instructional environments where the system designer essentially lays out a concrete lesson plan, or in applications where there are well-defined tasks to be accomplished, such as in a room reservation or a travel agent domain. However, task-based dialogue models are relatively underrepresented in current research because domains where they are most effective are typically too complex to construct a sizable corpus, and thus are not amenable to machine learning approaches.

Dialogue Systems Based on CPS
While task-based models support more complex dialogues than state-based models, they still fall short for a wide range of applications involving agent interactions. Consider an example of an actual humanhuman dialogue collected by one of the authors (figure 7). Participant A was trying to mail a letter to Mexico. Participant B was an administrator in the office. While it would be easy to build a dedicated state-or task-based system to handle the task of mailing a letter, this dialogue does not track any predefined task and none of the approaches discussed above could explicitly model the discussion and modification of the goal and subsequent novel solutions. (In fact, most of the conversation was to define exactly what the goal was, not developing or performing a task!) A system participating in this conversation needs to be able to generate new tasks on demand, thus requiring reasoning capabilities similar to AI planners (for example, Ghallab, Nau and traverso 2004), which can generate novel task models by combining operators from a plan library. In addition, building mutually agreeable plans requires intention recognition (for example, Allen and Perrault 1980;Kautz and Allen 1986) and mixedinitiative planning (Ferguson and Allen 2007).
Early work on SharedPlans (Grosz and Kraus 1996;Lochbaum et al. 1990) and plan-based dialogue systems (for example, Allen and Perrault 1980) laid good theoretical foundations for such systems. A core underlying principle was that communication acts can be formalized and reasoned about in terms of their effects on beliefs and goals of the dialogue participants. Perhaps the best developed formalism is described by Cohen and Levesque (1990). However, while there were some interesting demonstrations, these approaches have not been effective in building robust dialogue systems in practice. It is just too difficult to account for all the different actions that could occur in a dialogue from first principles alone.
When agents are engaged in solving problems together, they need to communicate to agree on what goals to pursue, what steps to take to achieve those goals, and to negotiate roles, resources, and more. In other words, the dialogue agents must take into account the nature of the joint activity itself. We call this collaborative problem solving, or CPS. Examples of early systems that took this approach include Chu-Carrol and Carberry (1998) and Litman and Allen (1987). In the early 2000s, Allen and colleagues described a preliminary plan-based CPS model of dialogue based on an analysis of an agent's collaborative behavior at various levels (Allen et al. 2002). A dialogue system based on this model is described in Blaylock and Allen (2005). Even earlier systems, such as the original the rochester Interactive Planning System (Ferguson and Allen 1998;Allen et al. 2000), operated using similar intuitions but implemented the behaviors directly, without an explicit CPS model. the CPS model consists of the following four conceptual levels: An individual problem-solving level, where each agent manages its own problem-solving state, and plans and executes individual actions. this level includes the AI systems that implement the functionality of the specific domain, as well as the overall management of these systems; A CPS level, which models and manages the joint or CPS state (shared goals, resources, situations) and is independent of any specific domain; An interaction level, where individual agents negotiate changes in the joint problem-solving state, independent of the particular domain; and A communication level, where language and/or other forms of communication realize the interaction level acts. While this may be domain-specific, in our systems we use generic semantic parsing and interpretation that applies  to any domain. For natural language generation (NLG), communication acts are in terms of standard, domain-independent speech acts, whose content is expected to be domain-specific. Consider first the general structure of a problemsolving state, either the collaborative problem-solving state established in the dialogue, or the individual problem-solving state of a single agent. Figure 8 shows the management of goals and ways to achieve goals (that is, tasks and solutions). We see there an encoding of the life cycle of goals and solutions: a new goal/ solution may be adopted (ADOPt); an existing goal/ solution may be focused on (SELECt) for pursuing in the subsequent dialogue; the current goal/solution may be deferred for now, possibly to be resumed later (DEFEr); a goal/solution may be abandoned (ABANDON); or may be accomplished and dismissed (rELEASE). When problem-solving acts are used to interpret the behavior of a single agent's reasoning, it is a characterization of key parts of an intelligent agent's behavior. For instance, I might describe my behavior as follows: I decided to buy a rib-eye steak for dinner (ADOPt a goal), but after I found out how expensive it was I decided to buy a hamburger instead (ABANDON then ADOPt a new goal).
For shared goals in a collaborative situation, these problem-solving acts can only be accomplished via communication. In other words, the two agents have to agree before something becomes shared. Consider again the dialogue in figure 7. Before the conversation, agent A has the goal to mail the letter (A0). Utterance A1 attempts to introduce a shared goal to establish the price of the postage to Mexico, which agent A believes would allow successful completion of the goal (a proposal to an ADOPt GOAL act). Agent B does not accept the proposed goal because they do not have the information to accomplish this goal (B2). In response, agent A proposes a related goal, namely, to find a method to determine the postage (utterance A3). Agent B does not address the request directly but asks what the exact goal is (utterance B4). thus, we describe B4 as a rEQUESt CLArIFY GOAL. Agent A answers this question, which in this context is interpreted as identifying a new goal (IDENtIFY GOAL). With a joint goal finally established, agent B suggests the simple solution of placing a charge slip on the letter for the post office. Utterance B6 is interpreted as both an implicit acceptance and a proposal of a solution to the agreedupon goal (PrOPOSE SOLUtION). In response, agent A accepts the proposed solution (utterance A7). the dialogue ends with the accomplishment of the joint goal as agent A hands over the letter.
the way the individual problem-solving state is implemented and managed is idiosyncratic to each  application domain and typically involves specialized reasoning engines to execute the actual tasks. In the CPS framework, the individual problem-solving state is managed by a component called the behavioral agent (BA). For example, the BA in CWMS encodes modeling tasks that create and execute simulations to support activities such as intervention planning and prediction. these tasks can then be executed by invoking back-end specialized reasoning engines, such as a crop modeling system for agricultural simulations. In the BoB system the BA coordinates the activities of many knowledge sources and reasoning engines, for building and reasoning with mechanistic molecular models, simulation and analysis of dynamic molecular models, pathway analysis, and specialized database lookups.
A key insight is that the collaborative problemsolving state can be task-or domain-independent and implemented in a general fashion, given a suitable interface to the BA, which will perform the domainspecific reasoning at the individual problem-solving level. the domain-independent dialogue manager coordinates the interpretation of the dialogue, interacting with the BA as needed. It bridges the divide between how humans interact when problem solving, and how the back-end systems perform the problemsolving processes for the domain (see figure 9).
Challenges for Complex Dialogue Systems to build dialogue systems for complex tasks, we face a number of challenges: the language is relatively unconstrained; the exact nature of the tasks cannot be anticipated and coded in advance; and the system behavior cannot be characterized by static policies. We will discuss how these considerations impact slot-filling and state/task-based systems and the CPS systems.
Language Understanding the slot-filling approach to language understanding allows robust interpretation of sentences even in the presence of speech recognition errors and ungrammatical utterances. It is highly limited, however, as it is based on predefined (by explicit definition or by learning) extraction patterns that are associated with each slot. this works fine for slots such as tIME We clearly need compositional semantic parsing that captures complex relations between objects, as well as complex objects such as events and nominals with relative clauses.
Specifying task Models tasks are often not well defined at the start and have to be constructed incrementally during the dialogue itself. Consider a relatively basic transportation planning domain, namely the original trIPS system (Ferguson and Allen 1998), in which a human and system collaboratively build a schedule of transport actions to accomplish some goal (for example, Use truck 3 to get the people in the city Abyss, then go to Bath and get the people there. Meanwhile, use the helicopter to get the people in Calypso). Here the task is constructed in the dialogue and then executed and monitored and may be revised in subsequent dialogue if the situation changes. the actual task is arbitrarily complex, and it is not feasible to enumerate all possibilities in advance. rather, the dialogue interactions can be characterized by metalevel actions, for example, add this step to the plan, change a parameter value (for example, the vehicle) in this planned action, replace this action in the plan with a different action.

Determining System Behavior
In the state/task-based dialogue models, the range of system actions is quite limited, typically consisting of a few actions: asking the user for a slot value, performing a clarification or confirmation, or performing a back-end application-program-interface call. In the CPS domains, the possible system actions are the result of a complex problem-solving process, where the system needs to recognize the user's intention, system planning may occur on the basis of this intention, and problems may arise in planning that need to be resolved. While the range of possible actions can be enumerated at the meta-level, as we discussed, the actual actions are essentially unlimited given they could be based on any possible aspect of any possible plan that can be constructed.

A Framework for Collaborative Problem-Solving Systems
One of the main goals of our recent work has been to create tools for generic linguistic interpretation and intention recognition, and to provide a dialogue shell independent of domain-specific problem solving. the domain-specific BA then instantiates the higher-level intentions into concrete problem-solving actions and verifies that such actions make sense in the domain context. As a consequence, in the CPS model the back-end problem solvers are relatively insulated from the need to worry about linguistic issues of sentence understanding as well as discourse and dialogue management.
We will describe the CPS framework in more detail in this section and return to the problems of natural language understanding in complex tasks in the next section.

USER Language & GUI gestures API calls & responses Domain-Speci c Behavioral Agent
Domain-independent Collaborative Problem Solving Dialogue Manager Figure 9. The CPS Model Bridges the Divide between Intuitive Human Behavior and Specific AI Reasoners.
Operations on the Problem-Solving State the interaction level consists of an interaction speech act where the content of the act is an operation on the problem-solving state. As the simplest example, a shared goal can be established between agents A and B if A proposes a goal and B accepts it.
(1) PrOPOSE A ADOPt G1 :as (GOAL) Let's analyze the food security situation in Sudan next year (2) ACCEPt B ADOPt G1 :as (GOAL) OK there are two key parts: the communicative act (for example, PrOPOSE, ACCEPt), and the interaction act (for example, ADOPt), which identifies what action the communicative act is attempting to perform on the collaborative state. the main communication acts are shown in table 1. these acts are augmented by the specific operation on the CPS state that is being proposed or accepted, for example, a new top-level GOAL in the above. One might also suggest refining a current goal. For example, the second part of utterance 2 in figure 1 proposes adopting G2 (looking at malnourishment rates) as a subgoal to goal G1 (analyzing food security), formalized as follows: (3a) PrOPOSE B ADOPt G2 :as (SUBGOAL :of G1) Shall we look at child malnourishment rates?
Another key relation involves refining or changing an existing goal. this commonly occurs during clarifications. For instance, the dialogue might have continued as follows: (3b) PrOPOSE A ADOPt M1 :as (MODIFICAtION : OF G1) Focus on the eastern part of the country In this case, agent A refined the goal to a more specific region to be analyzed. Once agent B accepts this, the shared goal will be updated. table 2 shows the different relations between the new act and the existing CPS state.

Managing Domain-Specific
Intentions: the EVALUAtE-COMMIt Cycle the CPS manager interprets and drives the interactions that embody the collaborative problem-solving negotiation between the user and the system (that is, the interaction level in the above discussion). this cannot be done accurately without an ability to reason about the domain-specific intentions as well. For instance, the sentence Can you analyze food insecurity in Sudan next year in figure 1, after appropriate semantic parsing, could be identified as likely to be a PrOPOSE of a new top-level goal. this hypothesis can be derived solely based on the current problem-solving context (no goal has been agreed to yet, this being the first utterance) and the form of the speech act (a rEQUESt), but it cannot be confirmed without checking that analyzing food insecurity is a reasonable collaborative goal in the current context, using domain-specific knowledge and reasoning. to make the system as domain-independent as possible, the CPS manager generates a ranked list of candidate CPS acts based on general knowledge, and requests the BA to evaluate the likelihood of each in turn given its domain-specific knowledge about the current problem-solving state. If the BA deems a hypothesis acceptable, the CPS manager commits to the act and changes the CPS model, thereby identifying what the system believes was the intended interpretation. this interchange can be formalized as follows, where G1 is the hypothesized goal derived from a user utterance: → the CPS manager requests that the BA evaluate a hypothesis about the CPS act of adopting G1 as a goal. On receiving from the BA that it deems the hypothesis ACCEPtABLE, the CPS manager commits to the interpretation (and issues a confirmation to the user). Once the acceptance is generated, G1 becomes a shared joint goal for both the system and the user. On the other hand, the BA might not find the hypothesis acceptable. If the BA cannot infer an appropriate intention underlying an utterance, it would respond with FAILUrE. If the BA can identify the intention but refuses (or is unable) to agree to it, it would respond with UNACCEPtABLE, with an optional reason: In systems we have implemented so far the most common reason is that the agent cannot perform the requested task or action because there are not sufficient resources.
With a FAILUrE, the CPS manager can suggest an alternative from its list of candidates and the BA will evaluate its appropriateness. this EVALUAtE-COMMIt cycle (figure 10) is critical for enabling intention recognition that exploits both the linguistic context (that is, the exact phrasing of utterances and the discourse context) and the domain-specific problem-solving context.
In certain cases, the structure of the utterance and the problem-solving context might not be sufficient for the CPS manager to identify the problem-solving intention. the CPS manager can then send an underspecified intention to the BA and have it identify the intention. For instance, consider a dialogue in a collaborative blocks-world task where either the user or the system can manipulate the blocks: User: Let's build a tower. System: OK User: I will move the blocks.
Without specific knowledge of the domain, the CPS manager might not be able to determine the intention behind the assertion I will move the blocks. But it does hypothesize that the assertion is relevant to the current established goal of building a tower (call this G3) in some way. the message exchange is as follows: the BA proposes that the assertion modifies the shared goal G3 (filling in a constraint about who will move the blocks to build the tower). If the CPS manager accepts it then it will commit to the creation of the modified shared goal. Not in this case, but another possible reply is that the BA determines that the assertion should make a subgoal of G3.
Managing the BA's Contribution to the Dialogue: WHAt-NEXt So far we have discussed how user utterances in a dialogue are interpreted to update the collaborative state. Here we will discuss how to manage the BA's utterances. Unlike a majority of other dialogue systems, the CPS framework does not enforce strict turn taking. Both the user and the system may produce multiple utterances in a row (see for example the dialogue in figure 3). For the CPS manager to control and coordinate system behavior, the BA has to wait until asked before it can contribute to the joint CPS state and dialogue. We call this the WHAt-NEXt message. Every time the interpretation of a user utterance is completed, the CPS manager evaluates the current state and sends a WHAt-NEXt message if the system is allowed to take the turn. For instance, if Five blocks (answer to above question) The nominal target of SB525334 is TGFBR1. (answer to above question)  there is pending user input, the user utterance takes priority and is processed first. this ensures that the system takes into account all the information from the user that could supplement or even modify the current state, to avoid situations where the system plans a response to the user's first utterance before considering the second one. For example: User: Let's build a tower. User: It should be 5 blocks tall. System: How tall should it be? < responding before considering the second user utterance > the turn management allows both the user and the system to plan and execute multiple utterances within their turn. 6 In response to the WHAt-NEXt request, the BA has a number of options. For example, it might invoke its planners and reasoners, or it might take an inventory of available resources. Eventually, however, it should respond to the CPS manager, even if just to say it is waiting for a task to finish. the CPS manager can then proceed to coordinate the next step in the collaborative problem-solving process. table 3 summarizes a range of responses that might be returned.
A Generic Shell for CPS the framework described in the last section defines the interface between the language interpretation/ dialogue management components and the AI reasoning systems but makes no commitment to how these components are implemented. Given the diversity of applications where CPS could be used, BAs may differ significantly in their structure, just as the back-end reasoning engines they use may differ dramatically. As long as they support this interface, they can be integrated into the CPS framework. the same is true of the language and discourse processing components. However, a key strength of our approach is that much of the language and discourse processing can be implemented in a domain-independent fashion and be used in multiple collaborative systems in different domains and with radically different AI reasoning systems.
We have created a generic dialogue shell for systems that support collaborative problem-solving dialogues. this is described in detail in Galescu et al. (2018) and the code for the system (called Cogent) is publicly available on GitHub (the link is provided in the sidebar). this dialogue shell has been used to build systems in a range of domains, including a mixed-initiative system for planning and execution in blocks world (Perera et al. 2017); learning about structures in blocks world ; an assistant to a biologist for building, visualizing, running, and modifying complex biologic causal models (Gyori et al. 2017;Burstein et al. 2020); helping a human composer create and edit music scores (Quick and Morrison 2017); playing cooperative games ; and World Modeling (Allen and teng 2019). Each one of these systems uses very different forms of domain-specific reasoning, but all use the same CPS framework and interface to the generic dialogue shell.

Generic Language Understanding for CPS
One of the unique strengths of Cogent, as exemplified in its instantiations in vastly different domains, is that the language and discourse processing can be constructed in a domain-independent manner. In contrast, virtually all current dialogue systems use domain-specific slot-filling parsers (whether hand-built or learned), and a new parsing system needs to be custom-constructed for each new domain, often starting by collecting (and annotating) a large corpus. As mentioned earlier, the slot-filling approach is not viable for such tasks as the complexity and variety of possible utterances require a compositional analysis of meaning. Progress will be greatly hampered unless we can build such a parsing system once and reuse it in new domains. In this section we briefly describe such a system. More details can be found in Allen and teng (2017) and Allen et al. (2018a). the core engine for processing language is the trIPS parser. the name reflects its roots in the original trIPS system (Ferguson and Allen 1998), which focused on a transportation domain. In the twentyplus years since, trIPS has been developed into a domain-general, broad coverage, deep semantic parser for both dialogue and open text such as web pages and scientific articles. By broad coverage, we mean that the lexicon substantially covers typical English usage (on the scale of WordNet; Fellbaum 1998). By deep, we mean that all words are assigned senses that are organized into an ontology, and that each sense has associated semantic roles, semantic preferences, syntactic linking templates, and axiomatization. By ontology, we mean not only a hierarchy of concepts with inheritance of properties, but also axioms that capture the relationships between concepts, especially temporal and causal relationships. the trIPS parser produces a rich representation of the sentence's meaning assigning word senses from its ontology to most words and linking them with well-founded semantic roles. Figure 11 shows a sample parse in graphical form. At a superficial level, the trIPS representation looks similar to the abstract-meaning representation (Banarescu et al. 2013), but there are fundamental differences. Most importantly, all words in the trIPS representation have senses in the trIPS ontology, whereas, in abstract-meaning representation, for the most part only verbal forms and their derivational forms have sense tags. In addition there is no distinction in abstract meaning representation between a statement that a particular peach is juicy (for example, The peach is juicy) and a statement that all peaches are juicy (for example, Peaches are juicy). Such distinctions about quantifiers and others are critical for effective intention recognition. the trIPS representation has a formal semantics that generalizes other well-known formalisms such as minimal recursion semantics (Copestake et al. 2005), hole semantics (Bos 2002), and dominance constraints (Koller et al. 2003). thus, while the basic logical form does not scope quantifiers, operators, and adverbials, it can encode scoping constraints and supports tractable algorithms for scope disambiguation (Manshadi et al 2018). the output from the parser is passed through a series of graph-based transformations, which rewrite and simplify the graphs into deeper representations. All of them use a subsystem that matches and rewrites graphs using rules defined in terms of the ontology. We say a pattern graph P matches a target graph t if there is a one-to-one mapping of the nodes and arcs in P to a subset of t such that the ontology type of each node in P is equal to or a supertype of the ontology type of the corresponding node in t. For example, a rule that identifies a likely intended speech act using conventional linguistic signals can be summarized as follows and shown graphically in figure 12 (terms with the prefix ONt:: are types in the trIPS ontology): If node A has a :content link to node B and B has a : modality link to node C, and (1)  Matching the parse for Can you analyze food insecurity in Sudan next year (figure 11) against the above rule, we can derive that the sentence should be interpreted as an interaction act PrOPOSE. this transformation mechanism is used in successive phases of interpretation described below.
Conventional Speech Act Interpretation the first stage involves mapping the parsing output to a ranked set of possible intended speech act interpretations based on its lexical/syntactic/semantic structure, building from work originally by Hinkelman and Allen (1989). the example discussed above and shown in figure 12 depicts a simple rule that maps sentences of the form Can you do X (for example, the sentence in figure 11) to a proposal to adopt X as a shared goal. there are approximately 100 handbuilt rules that identify common phrasings with likely intentions in English. For instance, there are multiple ways in which goals might be proposed, including: We can't run the simulation as we have no rainfall data.

EXECUtION-StAtUS
BA reports a goal/task is (successfully) completed I've built the tower.
The simulation is complete.
BA reports that a goal/task is in progress I'm still working on it.
I'm running the simulation now.
BA reports that it is waiting for the user I'm waiting for your decision. Other rules relate to common forms of acceptance, agreement, and rejection, as well as sentence fragments as answers (for example, two green blocks). Some of the more complex rules are patterns that match conditional statements and map to speech acts such as conditional ASK-IF (Is X if Y?) and ASK-WHAt-IS (What is X if Y?) acts. these patterns encode conventional language use for English and are independent of any domain, but in conjunction with domain-specific named entity recognition they can be deployed for utterances with specialized vocabularies (for example, Is Erk inactivated if I add Selumetinib?).
It is also important to note that multiple patterns might match an input. For instance, Do you know how to open the door? might be simply a yes-no question about the hearer's abilities to open the door or, more likely, a proposal that the hearer actually open the door. the output of this first phase is a ranked list of possible conventional interpretations.

reference resolution
Another phase of processing rewrites subgraphs that capture definite descriptions and other referring expressions to terms that they refer to. A typical case involves pronominal reference to objects in the discourse, for example, U: Take a block out of the box. U: Then put another block in it.
Other cases refer to described events and activities, for example, S: Should I compute a baseline estimate? U: How long will it take?
trIPS provides a rudimentary reference resolution capability that identifies likely antecedents of referring expressions by considering semantic compatibility and salience heuristics based on recency and grammatical role. Often the semantic constraints are derived from knowledge of the types of arguments the relations can take. For instance, in the first example, the word in is disambiguated to a relation ONt:: IN-LOC in the trIPS ontology, which is defined as a relation between two physical objects, the second of which is a container of some sort. to interpret the it reference, the system looks for the most recent mention that could be a container, in this case the box. In the second example, it is the subject of take, which is disambiguated to ONt::tAKE-tIME. the semantic constraints on this ontology type indicate that its subject should refer to an event or plan. thus, it is resolved to the most salient event in the discourse, that is, the proposed action of computing a baseline. Although the reference resolution mechanism does not operate using rewriting rules, it does rewrite the relevant terms by adding appropriate referential chains from referring expressions to their antecedents.

Domain-Specific Ontology Simplification
this optional stage is domain-dependent, and allows the semantic representation obtained from the above graph rewriting to be further transformed into different structures, even in terms of a different (domain-specific) ontology. this allows the representation to be simplified and customized to facilitate reasoning in the BA. For instance, in biocuration, a wide range of verb senses can be used to indicate the causal relation of regulation, which in the trIPS ontology can be expressed as instances of ONt::CONtrOL-MANAGE (for example, controls), ONt::OBJECtIVE-INFLUENCE (for example, affects, impacts) and others. rather than having the BA deal with all these variations, we can define a single transformation rule that maps any node labeled with one of these types to a new node with a domain-specific relation named, for example, BOB::rEGULAtE. Furthermore, ONt::CONtrOL-MANAGE has dozens of subtypes that are senses of additional relevant verbs (for example, govern, which belongs to ONt:: GOVErNING, a grandchild of ONt::CONtrOL-MANAGE). these descendent types are automatically also included in the transformation. Such canonicalization and transformation can greatly simplify the reasoning the BA has to perform to interpret the user's utterances. A detailed description of the ontology mapping mechanism as applied to event extraction in the biology domain can be found in Allen et al. (2015).

Managing the Collaboration
Starting from a semantic parse of the user input, an utterance is successively transformed into deeper, and if desired more domain-specific, representations using several levels of graph-based rewriting rules. the resulting output is then passed to the CPS manager, whose job is to manage the shared problem-solving state by coordinating the interactions between the human and the BA. the CPS manager maintains the state regarding the negotiation of goals. Each state has a set of graph patterns that determine the appropriate action at the CPS level as well as a transition to the next state. the graph-matching mechanism described above is used to determine the active transitions. On entering a new state, the system performs the actions associated with the state. For example, it may send a message to the BA, and its transitions would interpret the response from the BA. As an illustration, consider the user utterance Let's build a tower and a fragment of the state and transition specification that deals with a simple propose-and-accept interaction to establish a new goal (figure 13). the CPS manager starts in the state labeled StArt. One of the transitions from StArt specifies a pattern that matches if the user proposes an event of type ONt::EVENt-OF-CHANGE. this matches the building event obtained from the utterance, and the system moves to state S1 and issues a call to the BA to evaluate the hypothesis that the user is proposing to ADOPt a new goal. two transitions leave S1, matching the ACCEPtABLE and UNACCEPtABLE responses from the BA, respectively. Following the ACCEPtABLE transition to S2, the system sends a COMMIt act to the BA and generates an ACCEPt act to inform the user, thus establishing the shared goal of building a tower. From S2, one of the transitions (among others not shown) can be followed if there are no pending speech acts (that is, the user has not said anything else since we started this processing), in which case the system moves to S4 where it issues a WHAt-NEXt request to the BA to allow the system to take initiative for a response. the complete transition network to manage the CPS interactions consists of about thirty states and 120 transitions. As with language interpretation, the collaboration management transition network is fully domainindependent and used in all applications.
the actual system is more complex as the CPS manager also needs to consider the current shared state to make decisions, such as whether a proposal should be considered a new top-level goal or a subgoal to an existing goal. the CPS manager can rank the hypotheses based on the current state as well as linguistic and other domain-independent constraints, but ultimately, the decision of which hypothesis to accept can only be made after consulting the BA with its domain-specific reasoning.

Instantiating a Cogent-Based System
Because Cogent provides generic natural language understanding and (CPS-based) dialogue management, to create a new dialogue system the main effort is on the development of a BA that coordinates the domain-specific back-end AI systems and an NLG component. there are no requirements for how these two components should be implemented, except that the BA must implement the protocol described in the "Framework for Collaborative Problem-Solving Systems" section for managing the CPS state (the EVALUAtE-COMMIt cycle for goals, handling WHAt-NEXt for taking initiative), and be able to map from CPS acts (ADOPt, ABANDON, rELEASE) to individual problem-solving acts it can execute. the BA also needs to be able to map the semantic interpretations produced by Cogent into possibly idiosyncratic representations used by the back-end AI systems. this process can be simplified by using the ontology mapping mechanism described in the "Generic Language Understanding for CPS" subsection.
the NLG component is highly domain-specific. Other than some domain-general speech acts (for example, greetings, acknowledgments, reports of failure during natural language understanding) from the CPS manager, the content to be generated mainly comes from the BA itself, therefore it makes sense that both the BA and the NLG component be developed jointly. All Cogent-based systems currently implemented use some type of template-based NLG.
regarding language understanding, the generic trIPS parser and ontology provide reasonable coverage of utterances encountered in any domain, except for specialized lexical items such as organization acronyms and protein names. thus, for most domains it is essential to implement a named entity recognizer. In Cogent, a preprocessing component called TextTagger takes databases of domain-specific names and augments the parser input with appropriate ontology types together with standardized identification information relevant to the domain (for example, International Organization for Standardization codes). In some domains, such as the blocks world, this could be as simple as a list of names of blocks and other objects in the domain. In others, such as the biocuration domain, texttagger reads and processes millions of terms from biologic terminology and ontology resources, including names of proteins, chemicals, and other relevant objects. In the World Modelers domain, texttagger reads large resources providing information about geographical regions (for example, local districts, towns, states, regions, and countries). An application program interface for augmenting parser input is provided such that developers can build their own named entity recognizers into the language-processing pipeline.
While our motivation for this framework is to facilitate the development of dialogue systems for domains of high complexity, we note that a BA at the level of complexity used in current dialogue-state tracking systems can easily work within Cogent. It would simply never use the PrOPOSE act itself, because these systems do not ever have any initiative. the user may PrOPOSE goals (I am looking for a cheap restaurant in Cambridge), ask questions (via ASK-IF/ASK-WHAt-IS) or make ASSErtIONs to specify values for various parameters (slots). the BA may use ASK-IF/ ASK-WHAt-IS acts for getting values for its required slots, ANSWEr for relaying answers to questions, and EXECUtION-StAtUS to update the user when an action (for example, a booking) is done.

Evaluation
We described in this article a framework for developing systems that support dialogue-based interaction between humans and complex intelligent systems. It is hard to imagine how a conceptual framework could be evaluated directly. rather, the worth of a framework is revealed by the breadth and depth of the systems that can be implemented in it, and to some extent the ease with which such systems can be developed and used. to draw an analogy with the state-based framework, researchers do not evaluate the state-based model of dialogue on its own, but rather they evaluate functional systems that are implemented using said state-based model. Similarly, here we describe a few studies of the BoB system, which is the most extensively studied and used system based on the CPS framework to date.
the BoB system described in the "Examples of Complex Collaborative tasks" section was built on Cogent and involved a number of research teams in AI and biology. the BA, conforming to the specification described in the "Framework for Collaboration Problem-Solving Systems" section, was developed by SIFt ( Smart Information Flow technologies) Figure 13. A Fragment of the CPS Manager Model (Simplified).
specialized named entity taggers for biology (including genes, proteins, drugs, drug-protein and protein-protein interactions, cellular processes, and diseases), and with some domain-specific ontology mappings as described in the "Generic Language Understanding for CPS" section. Note that, however, parsing and interpretation are not specialized for the domain. the same domain general parsing and CPS manager are used across all Cogent-based systems, including BoB. 7 BoB is under active development and has been regularly undergoing different types of evaluations, including some user studies. In one such study conducted using an early version of the system, eight biologists (most of whom were well versed in molecular biology but not necessarily experts in biologic modeling) were recruited by Harvard Medical School, tufts University, and Oregon Health and Science University to test the system on three types of problems. For two of the problems the goal was to formulate and test a hypothesis that explained an observation about the effect of a drug on one or more molecular targets. the user was to first construct a plausible biologic mechanism, and then build a model for this hypothesized mechanism and check its validity by running simulations with this model (figure 3 provides an example of a dialogue while solving this kind of problem). the third problem was more open-ended; it was to look for a drug candidate that had a desirable outcome on a set of genes involved in a particular type of cancer. the users were given a short video introduction to BoB and a list of typical questions and statements that BoB could understand, although they were encouraged to express themselves in any way they found natural. the subjects worked under three scenarios: using BoB plus the occasional assistance of a BoB expert to help with the interaction (but not with solving the problem itself); using BoB alone; and using information sources available online (a large list of resources was provided and the subjects were free to use any additional resources they were familiar with). In all scenarios, subjects were limited to thirty minutes per problem. Due to this time constraint and the complexity of the problems, full task completion was not expected (only one of the users, who was an expert modeler, completed his first two problems in full). thus, standard evaluation metrics such as task-completion time, could not be used. Instead, a third-party evaluator (MItrE Corporation) devised a set of metrics for success in four subtasks (each with several milestones, which we will not go into here): discovering relevant information; finding complete molecular paths between the drug and the measured protein(s); building a biologically plausible model that addressed the experimental result; and successfully simulating the experimental result. the user performance for each of the four subtasks was assigned a score of 1 for completion, 0.5 when some but not all of the subtasks' milestones were achieved, and 0 for no milestone achieved. these scores were summed up to compute the final task completion score, which ranged from 0 to 4. the results for all three evaluation scenarios are summarized in table 4. An analysis of the transcripts found that, for a total of 491 user inputs, the system responded appropriately to seventy-two percent of them. In addition, the system was robust enough that users could continue (for example, by reformulating a request) even when the system did not understand or did not have the necessary problem-solving capability to respond appropriately initially.
the magnitude of the scores reflected the difficulty of the problems the users had to tackle, particularly for users who, while knowledgeable of molecular biology, were not modeling experts. However, all users were able to accomplish at least some of the subtasks. Based on the scores of the three test scenarios and responses from user surveys after the tasks, it was clear that the users found using BoB was far more efficient than using internet resources alone. they were able to progress much faster and further along toward solving the problems. they also found BoB fairly easy to use. It was reported that users had little need for assistance, and where assistance was provided (in the first test scenario), the users found it helpful and straightforward, which suggested that users inexperienced with BoB would need relatively little training with the system to be able to use it productively. Many of the users judged that BoB was a very helpful tool that they would like to use. Indeed, some of the biologists at Harvard Medical School and tufts have integrated BoB into their regular suite of tools used during their research.
MItrE is also carrying out periodical stress tests on BoB, using the following series of hallmarks for guidance. robustness: the system's language understanding and conversational capabilities are able to handle variations in how users might express themselves (lexical and syntactic variation, spelling errors) and conversational breakdowns (for example, the user not answering or providing incorrect answers to a question from the system, or switching topics). Explainability: the system can provide reasons for its behavior and explain its failures. Context awareness: the system uses dialogue context to improve understanding. Habitability: the system guides and enables users to use language naturally within the constraints of the system. Bidirectionality: the system actively and meaningfully contributes to the problem solving (and the conversation), rather than simply responding to questions and directives.
While explainability and habitability are, by and large, functions of the BA and the system's graphical user interface, Cogent's natural language understanding capabilities play a large role in the system's robustness. they also play a role in linguistic context awareness, although the testing was focused more on the task context (again, a function of the BA). the CPS model is crucial in enabling bidirectionality, although the content of the system's contributions is based on its domain-specific problem-solving capabilities. Figure 14 shows an example of the system taking initiative to make suggestions on how to improve a mechanistic model. the system also kept in sight the overall goal (finding how ErBB3 activates JUN) over the course of the dialogue. When the model under construction became capable of explaining the goal, BoB actively detected this change and informed the user of it, summarizing the causal explanation derived from the constructed model.
We will focus on robustness and bidirectionality in our discussion, as these metrics were most relevant to the domain-independent dialogue and CPS model. MItrE designed a set of 124 test inputs (in the context of a dialogue). Overall, the system scored eighty-eight percent on handling the robustness tests (including indirect ways of asking questions, typos, and other spelling mistakes in every-day as well as biology-specific words, such as synonyms for biologic entities). On bidirectionality, the system achieved a score of seventy-five percent. From MItrE's experience in evaluating such systems, it was judged that scores above sixty percent were expected to lead to a good user experience with the system. these evaluations indicated that third parties were able to integrate sophisticated domain-specific AI systems within the Cogent shell, and build an efficient and effective dialogue system capable of helping users solve complex problems. the underlying CPS model and the system's domain-independent language understanding and dialogue management capabilities were a viable approach to solving complex tasks in collaboration with the human user.
It should be noted that, because these were stress tests, with deliberately ill-formed phrasings and spelling mistakes designed to resemble specific alternatives, they reflected an expectation of system performance under a worst-case scenario. Nonetheless, although the results from both these tests and the user studies mentioned above were encouraging, further and more extensive evaluation would be needed to better understand the behavior and performance of the model. 8

Concluding remarks
We have discussed a framework for dialogue systems that can partake in dialogues for tasks significantly more complex than possible with current state-based and machine learning-based approaches. this model supports dialogue systems in which humans can collaborate with AI reasoning systems to jointly tackle complex problem-solving tasks. By developing a system that exploits an abstraction of the CPS process that is portable across domains, we provide a rich environment for building a wide range of new applications without the need to develop each system from scratch. Significantly, this model can be used in any domain that can be cast as CPS, including applications in which the task models cannot be defined in advance, which broadens the repertoire and complexity of tasks that can be addressed by conversational agents.
Our solution provides a well-defined interface between the generic dialogue system and the domain-specific AI reasoning systems that vary from domain to domain. What is required to implement a new system is the development of a domain-specific BA that interacts with the generic CPS model and coordinates the back-end reasoning systems. More details on the BA for CWMS and how it drives various agricultural and economic reasoning engines can be found in Allen and teng (2019). More details on the BA in the BoB system, and how it drives multiple biologic simulators and reasoning agents, can be found in Burstein et al. (to appear). the code for our generic dialogue components, including the parser, is available on GitHub and is described in Galescu et al. (2018).
While the framework has proven to be effective for building dialogue systems across a variety of complex domains, there is still much room for improvement. For language interpretation, the parser currently exploits only very simple features of the discourse context. Possible interpretations are ranked mostly based on static semantic preferences for arguments for each predicate, as well as domain-specific preferences for senses. taking better account of the nuances of the dynamically evolving context would result in more accurate parsing. In addition, more complex discourse interactions, such as answering multiple choice clarification questions, currently are handled fairly formulaically, putting the burden of fine disambiguation on the BA.  For CPS state management, the system relies on a state transition network to determine the permissible changes and the actions to be performed. the CPS manager can handle many common interaction scenarios, including clarification dialogues and redirection and modification of the problem-solving subtasks. However, unexpected responses and interruptions sometimes could derail the problem-solving process. In most cases the system can recover and continue, but with some loss of context and state information. refining and expanding the transition network to better manage the problem-solving states is one of our highest priorities.
Although our framework and system significantly reduce the efforts required to build AI systems that can collaborate with humans, this is not to say that Ben is the user; Bob is the system.

Links to TRIPS-Based Systems
For further details on trIPS-based systems, the reader is referred to the following links: trips.ihmc.us/parser/cgi/lex-ont, for browsing the trIPS lexicon and ontology trips.ihmc.us/parser, for on-line interfaces to the trIPS parser customized for different domains, including CWMS, BoB, and Cabot (blocks world) trips.ihmc.us/cogent/video, for examples of demos and dialogues carried out with several Cogent-based dialogue systems github.com/wdebeaum/cogent, for source code for the generic dialogue shell based on CPS (Cogent), which includes the parser, dialogue management, and the CPS manager, but no BA building such systems is now easy. the development of the BA remains a complex task, as most of the common-sense inference needed to understand the user intent and plan responses must be encoded there. Building a BA is, strictly speaking, outside the scope of Cogent. However, we will use our experience in interfacing with a variety of BAs (including some we built ourselves) to improve and support their integration and development.
While many challenges remain for building truly robust collaborative systems, we believe that the partition of responsibilities we have outlined in this article, with the dialogue being modeled by our domaingeneral model of CPS, will provide a framework for building truly useful systems in the future -systems that are capable of meaningful collaboration with humans to tackle tasks of a complexity found in reallife problems.