Transparent Data Preprocessing for Machine Learning

Data preprocessing plays a crucial role in most data science applications. It includes the cleaning, transformation and generally manipulation of datasets in order to provide a data format which is suitable for downstream machine learning models. Preprocessing data for machine learning is often an iterative process and the right steps are found through trial-and-error. In this project, we aim to support data scientists and domain experts in building their preprocessing pipelines. For this, we develop tools which make the changes conducted by single steps more explicit. This helps in understanding how preprocessing changes the data. Also, we plan to develop approaches which visualize how different preprocessing configurations affect the model which is trained with the processed data.

Research Challenges

The envisioned system extracts metadata from the preprocessing pipeline like the operators which are used and the data which is processed in the pipeline. In this context, the overhead which is added by the metadata extraction component has to be kept minimal. Also, to enable an easy adoption of the transparency system, minimal manual effort is an important requirement. Therefore, when developing and evaluating approaches for metadata extraction, we have these important requirements in mind.
To give important insights into the effects of different preprocessing operators on the data, we aim to construct abstractions which help users in clearly depicting what kind of changes are conducted by the different preprocessing operators.

Resources

Git-Repository: https://github.com/bastistrasser/hawk

Publication: Sebastian Strasser, Meike Klettke: Transparent Data Preprocessing for Machine Learning, HILDA@SIGMOD, 2024 (DOI)

Transparent Data Preprocessing for Machine Learning

Research Challenges

Resources

Lehrstuhl Data Engineering