You Have the Floor: A Speaker-Aligned Corpus Derived from the Congressional Record
DOI:
https://doi.org/10.1609/icwsm.v19i1.35941Abstract
The United States Congressional Record serves as a comprehensive archive of legislative discourse, yet its sheer volume and unstructured format pose significant challenges for researchers interested in analyzing political language, speaker behavior, and ideological framing. This paper presents a new dataset that organizes Congressional speeches by individual speakers. Data was obtained using a mixture of the Congress.gov Application Programming Interface (API) and web-scraping techniques to retrieve the full text of the Congressional Record. After extracting roll-call votes and standardizing the transcripts to remove noisy artifacts and normalize formatting, each speaker is separated into individual files and annotated with metadata including name, political affiliation, years active, district or state represented, and professional social media accounts, if known. This enables fine-grained analysis of rhetorical patterns and linguistic strategies across different political groups as well as time periods. By making the dataset publicly available, we aim to support interdisciplinary research utilizing natural language processing (NLP).Downloads
Published
2025-06-07
How to Cite
Bochenek, J. L., & Williams, J. R. (2025). You Have the Floor: A Speaker-Aligned Corpus Derived from the Congressional Record. Proceedings of the International AAAI Conference on Web and Social Media, 19(1), 2385–2395. https://doi.org/10.1609/icwsm.v19i1.35941
Issue
Section
Dataset Papers