Publications

Automating Data Quality Validation for Dynamic Data Ingestion

Published in 24th International Conference on Extending Database Technology (EDBT), 2021

Data quality validation is a crucial step in modern data-driven applications. Errors in the data lead to unexpected behavior of production pipelines and downstream services, such as deployed ML models or search engines. Typically, unforeseen data quality issues are handled via manual and tedious debugging processes in a reactive manner. The problem becomes more challenging in scenarios where large growing datasets have to be periodically ingested into non-relational stores such as data lakes. This is even worse when the characteristics of the data change over time, and domain expertise to define data quality constraints is lacking. We propose a data-centric approach to automate data quality validation in such scenarios. In contrast to existing solutions, our approach does not require domain experts to define rules and constraints or provide labeled examples, and self-adapts to temporal changes in the data characteristics. We compute a set of descriptive statistics of new data batches to ingest, and use a machine learning-based novelty detection method to monitor data quality and identify deviations from commonly observed data characteristics. We evaluate our approach against several baselines on five real-world datasets, on both real and synthetically generated errors. We show that our approach detects unspecified errors in many cases, outperforms other automated solutions in terms of predictive performance, and reaches the quality of baselines that are hand-tuned using domain expertise.

Recommended citation: S. Redyuk, Z. Kaoudi, V. Markl, S. Schelter (2021) Automating Data Quality Validation for Dynamic Data Ingestion. EDBT’21, Nicosia, Cyprus

Towards Unsupervised Data Quality Validation on Dynamic Data

Published in the International Workshop on Explainability for Trustworthy ML Pipelines (ETMLP), 2020

Validating the quality of data is crucial for establishing the trustworthiness of data pipelines. State-of-the-art solutions for data validation and error detection require explicit domain expertise (e.g., in the form of rules or patterns) or manually labeled examples. In real-world applications, domain knowledge is often incomplete, data changes over time, which limits the applicability of existing solutions. We propose an unsupervised approach for detecting data quality degradation early and automatically. We will present the approach, its key assumptions, and preliminary results on public data to demonstrate how data quality can be monitored without manually curated rules and constraints.

Recommended citation: S. Redyuk, V. Markl, S. Schelter (2020) Towards Unsupervised Data Quality Validation on Dynamic Data. ETMLP’20, Copenhagen, Denmark

Learning to Validate the Predictions of Black Box Machine Learning Models on Unseen Data

Published in the 4th Workshop on Human-In-the-Loop Data Analytics (HILDA), 2019

When end users apply a machine learning (ML) model on new unlabeled data, it is difficult for them to decide whether they can trust its predictions. Errors or shifts in the target data can lead to hard-to-detect drops in the predictive quality of the model. We therefore propose an approach to assist non-ML experts working with pretrained ML models. Our approach estimates the change in prediction performance of a model on unseen target data. It does not require explicit distributional assumptions on the dataset shift between the training and target data. Instead, a domain expert can declaratively specify typical cases of dataset shift that she expects to observe in real-world data. Based on this information, we learn a performance predictor for pretrained black box models, which can be combined with the model, and automatically warns end users in case of unexpected performance drops. We demonstrate the effectiveness of our approach on two models – logistic regression and a neural network, applied to several real-world datasets.

Recommended citation: S. Redyuk, S. Schelter, T. Rukat, V. Markl, F. Biessmann (2019) Learning to Validate the Predictions of Black Box Machine Learning Models on Unseen Data. HILDA’19, Amsterdam, Netherlands

Automated Documentation of End-to-End Experiments in Data Science

Published in the 35th IEEE International Conference on Data Engineering (ICDE), 2019

Reproducibility plays a crucial role in experimentation. However, the modern research ecosystem and the underlying frameworks are constantly evolving and thereby making it extremely difficult to reliably reproduce scientific artifacts such as data, algorithms, trained models and visualizations. We therefore aim to design a novel system for assisting data scientists with rigorous end-to-end documentation of data-oriented experiments. Capturing data lineage, metadata, and other artifacts helps reproducing and sharing experimental results. We summarize this challenge as automated documentation of data science experiments. We aim at reducing manual overhead for experimenting researchers, and intend to create a novel approach in dataflow and metadata tracking based on the analysis of the experiment source code. The envisioned system will accelerate the research process in general, and enable capturing fine-grained meta information by deriving a declarative representation of data science experiments.

Recommended citation: S. Redyuk (2019). Automated Documentation of End-to-End Experiments in Data Science. In Ph.D. Symposium track, IEEE 35th International Conference on Data Engineering (ICDE’19), Macau, China

Sergey Redyuk

Publications

Automating Data Quality Validation for Dynamic Data Ingestion

Towards Unsupervised Data Quality Validation on Dynamic Data

Learning to Validate the Predictions of Black Box Machine Learning Models on Unseen Data

Automated Documentation of End-to-End Experiments in Data Science