Author Homepage Discovery in CiteSeerX


  • Krutarth Patel Kansas State University
  • Cornelia Caragea University of Illinois at Chicago
  • Doina Caragea Kansas State University
  • C. Lee Giles Pennylvania State University



Researcher Homepage Classification, Researcher Homepage Discovery, Search-then-classify Framework, Digital Libraries, Deep Learning


Scholarly digital libraries provide access to scientific publications and comprise useful resources for researchers. CiteSeerX is one such digital library search engine that provides access to more than 10 million academic documents. We propose a novel search-driven approach to build and maintain a large collection of homepages that can be used as seed URLs in any digital library including CiteSeerX to crawl scientific documents. Precisely, we integrate Web search and classification in a unified approach to discover new homepages: first, we use publicly-available author names and research paper titles as queries to a Web search engine to find relevant content, and then we identify the correct homepages from the search results using a powerful deep learning classifier based on Convolutional Neural Networks. Moreover, we use Self-Training in order to reduce the labeling effort and to utilize the unlabeled data to train the efficient researcher homepage classifier. Our experiments on a large scale dataset highlight the effectiveness of our approach, and position Web search as an effective method for acquiring authors' homepages. We show the development and deployment of the proposed approach in CiteSeerX and the maintenance requirements.




How to Cite

Patel, K., Caragea, C., Caragea, D., & Giles, C. L. (2021). Author Homepage Discovery in CiteSeerX. Proceedings of the AAAI Conference on Artificial Intelligence, 35(17), 15146-15155.



IAAI Technical Track on Highly Innovative Applications of AI