A biomedical open knowledge network harnesses the power of AI to understand deep human biology

Knowledge representation and reasoning (


BACKGROUND
Advanced machine learning (ML) has successfully been deployed for a wide range of applications. However, such ML has seen far less success in "semantically rich domains" such as biomedical sciences, where specification of knowledge is more abstract and fluid than that in other hard sciences. According to Herbert Simon, one of the founding fathers of AI, these unique domains typically lack mechanistic rules, and the complexity of the heterogeneous and deep human domain expertise cannot be statistically aggregated (Simon 1970). Big Data must be converted into Big Knowledge if we are to harness the data revolution and knowledge representation and reasoning (KR&R) represents a timely and exciting avenue to achieve this goal. KR&R, a field of AI, includes work that strives to emulate human learning by creating a cognitive network of semantically related concepts on which context and previous experience determine the emergence of knowledge (Croitoru et al. 2018). Early efforts to develop advanced data management systems included EBI's SRS server (Zdobnov et al. 2002) and Kleisli (Chung and Wong 1999) somewhat anticipating the data (and information) deluge that would follow in subsequent years, and clearly highlighting the need for additional efforts to address this need.
Health care costs make up almost one-fifth of the entire US GDP and affect every US citizen. The opportunity -indeed, the imperative -to tap into the wisdom latent in Big Data can no longer be overlooked. The "one-size-fits-all" approach is a major reason for patient treatment failures and costs. However, the biomedical public data and factual knowledge repositories are physically, technically, and thematically compartmentalized, posing a significant challenge when attempting to connect the dots across the domains of specialization in biomedicine.
Under the aegis of an NSF Convergence Accelerator award (Track A), we have developed concrete applications for our biomedical open knowledge network (OKN), named the Scalable PrecisiOn Medicine Knowledge Engine (SPOKE) following the hypothesis that connecting relevant information will enable the emergence of knowledge, and facilitate solutions to otherwise unattainable insights in understanding diseases, discovering drugs, and proactively improving personal health. Finally, by studying how human experts use SPOKE, we take a step toward a next generation of AI based on big knowledge, stepping beyond deep learning on data (Langley 2000).

GRAPH CONSTRUCTION AND CONTENT
SPOKE is a property graph containing more than 3 million nodes (of twenty-one types) and more than 15 million edges (of fifty-five types) (A detailed description of SPOKE architecture is in preparation at the time of this writing and will be published elsewhere). The OKN has so far integrated 37 data sources, listed at https://spoke.ucsf.edu/data-tools. Much of this data is composed of genomic associations with disease, chemical compounds and their binding targets, and metabolic reactions from select bacterial organisms of relevance to human health. Also, included are perturbagen-gene, foodchemical, and protein-celltype relationships (Figure 1). Several of the key concepts are mapped to biomedical ontologies (including disease, molecular pathways, and taxonomy among others) to provide an organizational framework and facilitate user navigation. All ontologies in SPOKE were incorporated from NCBO's BioPortal repository, which contains more than 900 controlled vocabularies spanning various aspects of biomedicine (Martinez-Romero et al. 2017;Noy et al. 2009). SPOKE also uses ontologies to mark up the datasets coming into the knowledge graph for consistent linking. SPOKE also strives to align with Biolink, a biomedical semantic standard currently being established by the NIH/NCATS Biomedical Translator Consortium (Consortium 2019).
As a stated aim of our present NSF-CA proposal, over time we will continue to grow SPOKE by the integration of hundreds of data sources in the public domain including those from EPA, CDC, DHSS, and the FDA.
Of note, to enhance its relevance to human health, SPOKE focuses on experimentally determined information. Thus, computational predictions and literature curation are not currently prioritized in SPOKE.
Some of the specific areas in which this NSF award focuses include: Proteins, by domain and including their threedimensional shapes -to answer questions such as potential targets of a drug that cause side effects, or how can an existing drug be repurposed for new indications, or whether a protein target involved in a specific disease is suitable for drug discovery (that is, druggable).
Drug discovery capabilities, such as adverse drug effects, drug-drug interactions, over a billion smallmolecule compounds that are readily available by make-on-demand vendors and interactions between drugs and proteins -a rich source of information for drug repurposing.

F I G U R E 1 Scalable PrecisiOn Medicine Knowledge Engine (SPOKE) metagraph. Nodes denote biological concepts and links show how data is related and connected in the graph
Geospatial measurement data, to bring in sociodemographic, economic, and environmental factors in health and disease.
Users can interact with the data remotely and build applications powered by the graph either interactively via Cypher queries or programmatically via one of the REST Application Programming Interfaces (APIs).

Scientific evaluation and stress-testing of the biomedical OKN
As of this writing, the network structure and balance of SPOKE has been characterized and preserved via a series of computationally intense graph-theoretical "knowledge mining" methods, including shortest path algorithm function, motif discoveries, and metabolic cycle discovery.

Scientific validation -the road ahead
In order for SPOKE to be the basis of further scientific inquiry or new products, a series of "stress-tests" simulating real-world utility need to be conducted. While anecdotal accounts of successful drug discovery guided by smaller knowledge networks reveal the potential utility of biomedical OKNs, the very concept of biomedical OKNs still must be subject to a systematic, scientific evaluation . As SPOKE continues to grow, evaluation will take place both at the structure level of the knowledge network as well as by benchmarking specific queries and use cases against medical reality.

Empirical relationships between graph node concepts and paths
In addition to the generic graph-theoretical analysis (for example, centrality, degree, and others), we tested the utility of the specific node content using empirical data. For a set S of N concepts represented by nodes in SPOKE (for example, "blood glucose," "gene variant X," and "protein Y"), we asked whether their values measured in real life exhibits a statistical relationship to a particular structure of the subgraph in SPOKE spanned by these nodes in S. In the simplest case of sets of N = 2 nodes we ask: "Are two blood metabolites observed to be highly correlated in a cohort, on To address this question, we took advantage of a recent wellness study that collects "multiomics" data in a cohort 108 healthy individuals, in which thousands of omics-data points (genomics, blood proteomics, metabolomics, and clinical phenotype) were measured (Price et al. 2017). In this study, thousands of blood analytes (abundance of circulating proteins or metabolites) were measured for each individual. In total, 8888 pairs of these variables were found to be correlated with high statistical significance (r 2 > 0.9) (Price et al. 2017). We next mapped these correlated proteins or metabolites onto nodes in the SPOKE OKN and found that, remarkably, they were connected by a path that was significantly shorter than that connecting two random nodes of the same type ( Figure 3). This result offers the first empirical evidence that the graph structure of the SPOKE network that was computationally assembled from diverse biomedical medical databases preserves meaningful information about mechanistic pathways that traverse various domains, most of them never explicitly mentioned in the literature.
Based on our preliminary data, we argue that SPOKE use-cases themselves serve as stress tests; we illustrate some such AI applications below.

Network visualization
A complex knowledge network like SPOKE can be visualized through the Neighborhood Explorer (NE) Tool (Huang, Morris, and Branzini 2017) to support interactive exploration by experts and citizen scientists in support of knowledge exploration (for example, to support basic research), optimization (for example, to resolve data problems), and communication (for example, to better inform patients and physicians).
While standard network visualizations of large realworld networks often resemble "hairballs" that provide little actionable insight, these interactive, multilevel SPOKE visualizations compute and display clusters of related nodes and backbones between major nodes at each level of detail (Saket et al. 2014). These additional visualizations (now under construction) resemble geospatial maps at midfidelity resolutions ( Figure 4) with continents of similar nodes and real paths (backbones) for each level, similar to geographic maps that show real cities and real roads at every level of detail.

Knowledge graph analysis
For the contemporary biomedical researcher, in need of accessing vast amounts of trusted information, SPOKE provides the NE ( Figure 5). For example, ClinicalTrials.gov links diseases with drugs; the GWAS Catalog contains genetic associations for thousands of phenotypes and diseases; and ChEMBL contains binding information of pharmacological compounds to their protein targets. However, if an investigator seeks to identify all existing (approved and nonapproved) drugs that target proteins encoded by genes containing SNPs associated with a given disease (to repurpose drugs for rare genetic disease, for instance), this will involve cumbersome manual search in a number of pertinent databases separately. Furthermore, serial queries for a group of diseases or drugs would require repeated and F I G U R E 3 Analysis of blood proteomics and metabolomic data in healthy participants shows that pairs of blood analytes (protein or metabolite levels in circulation) that are correlated are connected by on average a shorter path in the Scalable PrecisiOn Medicine Knowledge Engine (SPOKE) graph than any pairs of randomly chosen nodes F I G U R E 4 Initial rendering of a subgraph of Scalable PrecisiOn Medicine Knowledge Engine (SPOKE) using a multi-level, map-like network visualization. Diseases are denoted in the top layer and they cluster by symptom and genetic similarity. The inset shows how additional details appear when zooming over an area (for example, zooming on immune system disease uncovers more details about additional diseases that belong to that category)

F I G U R E 5 A view of the Scalable PrecisiOn Medicine
Knowledge Engine (SPOKE) neighborhood explorer. The top panel shows the controls that allow a user to select nodes/edges for expansion as well as other key parameters. The bottom panel shows an example of the graph neighbors of the SARS-CoV-2 spike protein (light blue), which includes three human proteins (green) and the genes encoding them (blue). One such protein (ACE2_HUMAN) has edges connecting it to three compounds (two of them approved and one -ORE-100-in experimental phase) complicated programmatic queries in various databases and assembling the results. NE solves this need. In the future, a robust, well-supported commercial product, powered by SPOKE, with a superior UI and performance, will enable investigators to perform smart queries and return actionable information, either for hypothesis generation or to inform concrete experimental approaches.

FROM KNOWLEDGE TO INSIGHTS: AI APPLICATIONS
We envision a vast and integrated knowledge network connecting up to hundreds of millions of biomedical facts, with potential utility in a broad diversity of practical applications for specialists and informed general public alike. Its F I G U R E 6 Scalable PrecisiOn Medicine Knowledge Engine (SPOKE)-enabled reconstruction of the hypothesis that dexamethasone might help recovery of patients with COVID-19. Multiple sources of evidence were required to formulate this scenario without human intervention value is best harnessed by apps that are designed to extract useful information (for example, mine the OKN) for specific applications.
SPOKE was used to predict a possible treatment to reduce mortality of COVID-19 patients placed on mechanical ventilation ). We constructed a chain of causation, a path in the SPOKE network that connects the ACE2 protein, the cell surface protein used by the SARS-CoV-2 virus to enter the host, to the use of Dexamethasone (a corticosteroid). SPOKE exposed a pharmacological connection that no literature or Google search would have unearthed: through the analysis of gene expression profiles, we discovered that mechanical tissue stress caused by ventilation caused upregulation of ACE2 ( Figure 6) and that dexamethasone suppresses the tissue hormone midkine (MK), that is critically involved in transducing mechanical stress to further upregulation of ACE2. Therefore, there exists a vicious cycle: mechanical ventilation used to combat respiratory distress caused by the virus would itself also facilitate the spread of the virus in the lungs. These results suggest that administration of corticosteroids, which was debated in the early days of the pandemic, could improve outcome of severe (that is, ventilated) COVID-19 cases. Indeed, clinical studies have since reported that corticosteroids reduced the mortality of ICU specifically for patients on ventilators by 30 percent (Wu et al. 2020;Group et al. 2021). Here SPOKE, allowing seamless search across domains of knowledge, showed its unique power in "connecting the dots," alleviating the core problem of "database selection" in complex disciplines with countless specialties.
Another example of "connecting dots" is provided by integrating the role of bradykinin in COVID-19. Again, the entry point for the virus is ACE2, which has a direct connection to the bradykinin receptor BRKB2, and hence to its protein BKRB1_HUMAN, which represents the intersection between endocrine and immune regulation systems. This triggers proteolysis of the KNG1_HUMAN protein, which gets cleaved into kininogen. Kininogen has a large number of connections and effects, one of which is bradykinins, which have a potent vasopressor activity (Garvin et al. 2020). Thus, elevated bradykinin levels likely cause increases in vascular dilation, vascular permeability, and hypotension, all features observed in severe COVID-19 patients.

Repurposing pharmaceutical drugs
Pharmaceutical and biotechnology drug development is an expensive endeavor, and some estimates put the current cost of a new drug at $2.6 billion (DiMasi, Grabowski, and Hansen 2016). Only one for every 20 products that enter phase I clinical trials ever becomes a commercialized product; fully 50 percent fail in the costly, last stage of clinical trials -or fail to meet the proposed clinical endpoints on a significant part of the patient population.
SPOKE shows promise in repurposing existing drugs or discovering new therapeutic applications for them. Its predecessor, HetioNet, was stress-tested to find concrete examples of drug repurposing, in two retrospective studies: A. Bupropion, first approved for depression in 1985, was approved for smoking cessation in 1997 (Harmey, Griffin, and Kenny 2012). Predictions based on SPOKE clearly highlight this new indication (Himmelstein et al. 2017). B. SPOKE evaluated the top 100 scoring compounds for epilepsy seizure control, successfully classifying seventy-seven compounds as antiictogenic (seizure suppressing), eight as unknown (no established effect on the seizure threshold), and fifteen as orictogenic (seizure generating). Notably, the predictions contained twenty-three of the twenty-five diseasemodifying antiepileptics in PharmacotherapyDB v1.0 (Himmelstein et al. 2017).
The therapeutic effect at genomic, metabolomic, proteomic, physiological, or toxicological level may help identify additional uses for an existing drug. SPOKE can also determine ideal patient profiles and population targets for new therapeutic drugs prior to entering late-stage clinical trials.

Predicting new chemical biology from a small molecule's OKN neighborhood
In another planned application, we plan to encode a small molecule's OKN-derived biological context instead of its raw chemical structure, into an "OKN fingerprint." Such small molecules are "drug-like compounds." Similar structures have been observed to exhibit similar bioactivities across a standardized panel of wet-lab assays, and this phenomenon can be exploited to identify new drugs with desired activities. Too little information exists to construct experimentally derived fingerprints, and hence computational predictions of such fingerprints have been proposed (Martin and Sullivan 2008).

Delivering SPOKE to the clinician: BRIDGE
For clinicians to be able to ingest the ever-expanding volumes and types of information available for their patients, data and algorithms such as those enabled by SPOKE must be delivered in a clear, actionable format that is workflow friendly and will enable them to respond adequately (and in real-time) to complex scenarios to optimize patient outcomes. BRIDGE is a platform that launches directly from a patient's chart in the electronic health record (EHR), and assembles relevant clinical, laboratory, imaging, and patient-generated data to visualize an individual's trajectory and support clinical discussions and decision-making. Live since March 2019, it has supported a number of ongoing clinical validation projects.
The SPOKE-BRIDGE integration (Figure 7), due to complete in Fall 2022, will be thoroughly evaluated in the neurosciences using a research roadmap evaluating both inclinic adoption, as well as near-and long-term key clinical outcomes. The integration computes personalized biomedical profiles by selecting variables from a patient's clinical record and propagating (embedding) them through the entirety of the OKN (potentially billions of concepts) to provide a deep description of the patient's health status. Such network embeddings operate by learning lowrank vector representations of graph nodes and edges that preserve the graph's inherent structure. Embedding variables from hundreds of thousands of EHR's onto SPOKE showed that new knowledge (that is, biomedical discoveries) can emerge from such a process (Nelson, Butte, and Baranzini 2019; Nelson et al. 2021). Similar approaches have been used to analyze knowledge networks from different domains where they showed superior performance and accuracy compared to previous graph exploratory approaches (Bordes et al. 2013;Mohamed and Nováček 2019;Nickel, Tresp, and Kriegel 2011;Yang et al. 2015).
Dimensionality reduction makes such a complex biomedical profile useful and actionable for the clinician, who is alerted only to relevant clinical processes, medications, contraindications, or differential diagnostic considerations that arise from the embeddings with the OKN. The clinician queries whether their patient's biomedical profile is mathematically closer to one of their multiple diagnostic considerations on their differential, or leverages insights from other patients to predict which medication is a more precise metabolic fit for that individual. Other models are being constructed to identify biologically similar individuals (using distance measures for multifactor data at deep granularity) to surface undiagnosed conditions, as well as for critically important disease progression predictions. This approach is also being used to study the histories of patients formally diagnosed with a complex neurological condition (for example, Parkinson's disease) to explore how far in advance this outcome could have been predicted, and on the basis of which clinical markers.

F I G U R E 7
Prototype of the potential applications of BRIDGE-SPOKE. (Left) Data from the patient's EHR can be used as access points to SPOKE to provide estimates of disorders the patient may be at risk for over a selected timeframe. (Middle) Through BRIDGE, the clinician can select data points to submit to SPOKE, such as laboratory data or specific symptoms, to inform differential diagnosis. The results are shown as a network of disease probabilities and risk factors giving insight into why SPOKE selected these disorders. (Right) For a specific diagnosis, SPOKE could be used to identify which treatments are most likely to generate the desired outcomes, while informing about the most likely side effects. SPOKE, Scalable PrecisiOn Medicine Knowledge Engine

SUMMARY
Knowledge is an emergent property of the interconnected web of trusted information and known facts. To mine for "unknown knowns," we must "connect the dots" from several information sources. When heterogeneous networks are connected at a massive scale, new knowledge can be extracted as an emergent property of the network. Here, the paradigm of knowledge networks -amply proven in search -and KR&R are applied into biomedicine, a discipline that, we argue, is inherently graph-theoretic.
Machine and deep learning models such as neural networks were traditionally "black boxes," capable of delivering new data (predictions), but in and of themselves, no new knowledge. This perceived limitation has hampered their adoption in a range of chemical and biological contexts, under the sensible argument that a recommendation, prediction or prognosis a scientist or clinician cannot understand will provide no guarantee of correctness in a true discovery context. SPOKE enables the use of explanatory (that is, "clear box") ML approaches with the ability to predict biomedical outcomes in a biologically meaningful manner. It has the potential to support a host of "explainable AI" techniques (see DARPA's XAI program).
At the same time, it is important for this body of knowledge to contain all the right data to create realistic and equitable models that factor in the full diversity of population and result in better health outcomes and treatments for all members of society. We believe technology can help change the current equation of designing for the "majority," and be a great leveler.

A C K N O W L E D G M E N T S
The development of SPOKE and its applications are being funded by grants from the National Science Foundation (NSF_2033569), NIH/NCATS (NIH_NOA_ 1OT2TR003450), and the Marcus Program in Precision Medicine Innovation. SEB holds the Heidrich Family and Friends Endowed Chair of Neurology at UCSF. SEB holds the Distinguished Professorship in Neurology I at UCSF.

C O N F L I C T O F I N T E R E S T
S.E.B. is co-founder of Mate Bioservices, a start-up company set out to commercialize applications based on the Knowledge graph. Sui Huang first studied medicine, followed by molecular biology and physical chemistry at the University of Zurich in the 1990s. He was a faculty at the Harvard Medical School/Children's Hospital and then at the University of Calgary, conducting studies on cell fate control and tumor angiogenesis. He has championed the embrace of complex systems theory by biomedical research. His current work at the Institute for Systems Biology which he joined in 2011, uses new technologies, including single-cell omics, along with the theory of non-linear stochastic dynamical systems to better understand the dynamics in health and disease, including cancer drug resistance, stem cell differentiation, and wellness-disease transitions in Personal Medicine.