You Have the Floor: A Speaker-Aligned Corpus Derived from the Congressional Record

Authors

  • Jennifer L Bochenek Drexel University
  • Jake Ryland Williams Drexel University

DOI:

https://doi.org/10.1609/icwsm.v19i1.35941

Abstract

The United States Congressional Record serves as a comprehensive archive of legislative discourse, yet its sheer volume and unstructured format pose significant challenges for researchers interested in analyzing political language, speaker behavior, and ideological framing. This paper presents a new dataset that organizes Congressional speeches by individual speakers. Data was obtained using a mixture of the Congress.gov Application Programming Interface (API) and web-scraping techniques to retrieve the full text of the Congressional Record. After extracting roll-call votes and standardizing the transcripts to remove noisy artifacts and normalize formatting, each speaker is separated into individual files and annotated with metadata including name, political affiliation, years active, district or state represented, and professional social media accounts, if known. This enables fine-grained analysis of rhetorical patterns and linguistic strategies across different political groups as well as time periods. By making the dataset publicly available, we aim to support interdisciplinary research utilizing natural language processing (NLP).

Downloads

Published

2025-06-07

How to Cite

Bochenek, J. L., & Williams, J. R. (2025). You Have the Floor: A Speaker-Aligned Corpus Derived from the Congressional Record. Proceedings of the International AAAI Conference on Web and Social Media, 19(1), 2385–2395. https://doi.org/10.1609/icwsm.v19i1.35941