Source Attribution: Recovering the Press Releases Behind Health Science News

Ansel MacLaughlin; John Wihbey; Aleszu Bajak; David A. Smith

doi:10.1609/icwsm.v14i1.7312

Authors

Ansel MacLaughlin Northeastern University
John Wihbey Northeastern University
Aleszu Bajak Northeastern University
David A. Smith Northeastern University

DOI:

https://doi.org/10.1609/icwsm.v14i1.7312

Abstract

We explore the task of intrinsic source attribution: inferring which portions of a derived document were adapted from an unobserved source document. Specifically, we model the relationship between news articles and their press release sources using a dataset of 64,784 health science news articles and 23,068 press releases. We approach the problem at the sentence level and work with science journalism professors to develop a four point Likert scale describing the extent to which a news article sentence is derived from the content in the corresponding press release. Because manual annotation of news article - press release pairs is time-consuming, we turn to a mix of expert, non-expert, and heuristic-based annotation to label our dataset. After a small pilot study, which found that humans, when only able to view the text of the news article, struggle to identify which content is derived or not, we compare four different sentence regression models on the task. We find that modeling a sentence's context in the entire document is important, with the best performing model, a sequence regression model with BERT token representations, achieving a spearman's ρ of 0.49 and NDCG@1 of 0.60 on the expert-labeled test set. Examining the model's predictions, we find that it successfully identifies copied or closely paraphrased sentences in articles with a mix of derived and original content, but struggles to differentiate between loosely paraphrased and original sentences in articles with mostly original writing.

Source Attribution: Recovering the Press Releases Behind Health Science News

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information