Education Sciences (Apr 2025)

Messy Data in Education: Enhancing Data Science Literacy Through Real-World Datasets in a Master’s Program

  • Iraklis Varlamis

DOI
https://doi.org/10.3390/educsci15040500
Journal volume & issue
Vol. 15, no. 4
p. 500

Abstract

Read online

The increasing importance of data science in today’s world highlights the need to prepare students for the complexities of real-world data. This paper presents insights and findings from 15 years of teaching Data Mining and Business Intelligence in a Computer Science Master’s program, where a key component of the course is a semester-long assignment involving publicly available, messy, and often incomplete datasets. These datasets include examples such as publicly accessible datasets on accidents or fines from data.gov.uk, data from data contest platforms like Kaggle, and house rental data from platforms like Airbnb. Through these assignments, students are tasked with not only applying algorithmic tools but also addressing challenges like missing information, noisy inputs, and inconsistencies. They also learn the importance of finding and integrating supplementary open data sources to enhance the value and depth of their analyses. The primary objective of this approach is to enhance students’ problem-solving abilities by engaging them in complex, real-world data scenarios where they must navigate and resolve issues related to data quality and completeness. This approach cultivates critical skills such as data wrangling, preprocessing, and the extraction of meaningful insights, along with the ability to understand and articulate the business value of the data. Working hypotheses, such as the impact of data quality on analysis outcomes, are explored, and the paper demonstrates how addressing these challenges improves students’ decision-making processes in data-driven tasks. By engaging with real-world datasets, students develop resilience, adaptability, and problem-solving abilities, which are essential for navigating the complexities of data science in professional settings. This paper highlights the educational benefits of using messy data to bridge the gap between theoretical knowledge and real-world application while also demonstrating how this method explicitly improves students’ problem-solving and critical thinking skills in the context of data science.

Keywords