Data Labeling for Machine Learning Engineers: Project-Based Curriculum and Data-Centric Competitions


  • Anastasia Zhdanovskaya Toloka
  • Daria Baidakova Toloka
  • Dmitry Ustalov Toloka



Education, Data Labeling, Data Collection, Machine Learning, Data-centric AI, Project-based Learning, Career Development


The process of training and evaluating machine learning (ML) models relies on high-quality and timely annotated datasets. While a significant portion of academic and industrial research is focused on creating new ML methods, these communities rely on open datasets and benchmarks. However, practitioners often face issues with unlabeled and unavailable data specific to their domain. We believe that building scalable and sustainable processes for collecting data of high quality for ML is a complex skill that needs focused development. To fill the need for this competency, we created a semester course on Data Collection and Labeling for Machine Learning, integrated into a bachelor program that trains data analysts and ML engineers. The course design and delivery illustrate how to overcome the challenge of putting university students with a theoretical background in mathematics, computer science, and physics through a program that is substantially different from their educational habits. Our goal was to motivate students to focus on practicing and mastering a skill that was considered unnecessary to their work. We created a system of inverse ML competitions that showed the students how high-quality and relevant data affect their work with ML models, and their mindset changed completely in the end. Project-based learning with increasing complexity of conditions at each stage helped to raise the satisfaction index of students accustomed to difficult challenges. During the course, our invited industry practitioners drew on their first-hand experience with data, which helped us avoid overtheorizing and made the course highly applicable to the students’ future career paths.




How to Cite

Zhdanovskaya, A., Baidakova, D., & Ustalov, D. (2023). Data Labeling for Machine Learning Engineers: Project-Based Curriculum and Data-Centric Competitions. Proceedings of the AAAI Conference on Artificial Intelligence, 37(13), 15886-15893.