TY - JOUR AU - Patel, Krutarth AU - Caragea, Cornelia AU - Caragea, Doina AU - Giles, C. Lee PY - 2021/05/18 Y2 - 2024/03/28 TI - Author Homepage Discovery in CiteSeerX JF - Proceedings of the AAAI Conference on Artificial Intelligence JA - AAAI VL - 35 IS - 17 SE - IAAI Technical Track on Highly Innovative Applications of AI DO - 10.1609/aaai.v35i17.17778 UR - https://ojs.aaai.org/index.php/AAAI/article/view/17778 SP - 15146-15155 AB - Scholarly digital libraries provide access to scientific publications and comprise useful resources for researchers. CiteSeerX is one such digital library search engine that provides access to more than 10 million academic documents. We propose a novel search-driven approach to build and maintain a large collection of homepages that can be used as seed URLs in any digital library including CiteSeerX to crawl scientific documents. Precisely, we integrate Web search and classification in a unified approach to discover new homepages: first, we use publicly-available author names and research paper titles as queries to a Web search engine to find relevant content, and then we identify the correct homepages from the search results using a powerful deep learning classifier based on Convolutional Neural Networks. Moreover, we use Self-Training in order to reduce the labeling effort and to utilize the unlabeled data to train the efficient researcher homepage classifier. Our experiments on a large scale dataset highlight the effectiveness of our approach, and position Web search as an effective method for acquiring authors' homepages. We show the development and deployment of the proposed approach in CiteSeerX and the maintenance requirements. ER -