Machine Learning and Data Validation.

P. Pantziarka

Back

Doctoral Thesis

Open access

Machine Learning and Data Validation.

P. Pantziarka

University of Surrey

Doctor of Philosophy (PhD), University of Surrey (United Kingdom).

2005

Abstract

Data validation describes the process of checking the internal consistency, correctness and quality of a data-set. The role of data validation in the broader context of data quality/data cleansing is described. In particular problems related to syntactical and semantic errors are defined, and the concept of a validation model is introduced. The role of machine learning in the building of validation models is described and a range of machine learning techniques is surveyed. A novel machine learning strategy that combines genetic algorithms and association rules to generate data validation models is proposed. An algorithm is developed to discover validation rules from numeric data sets and is implemented as a Java toolset called eaVal. A series of experiments using eaVal for data validation are carried out and it is shown that it can successfully discover validation rules which identify records within a dataset which have a high probability of containing errors. A method of post-processing the results from eaVal is proposed. This utilises Bayesian Networks, which are derived directly from the validation rules discovered by eaVal, to identify which fields within an invalid record set have the highest probability of being invalid. Experimental evidence of the efficay of the technique is shown. The post-processing phase is shown to be a major step towards semantic data validation. A case study is also described that uses the tools and techniques described in this work to perform a data validation exercise on a clinical dataset. The case study indicates that the methods developed can provide useful information to a data analyst when validating numerical datasets. Furthermore it is also shown that the discovery of validation rules is a useful mechanism for identifying records which are interesting or unusual. Finally current limitations and future directions of this work are also discussed.

Files and links (1)

pdf

2773318118.82 MBDownload View

TextCC BY-NC-SA V4.0, Open Access

Metrics

57 File views/ downloads

89 Record Views

Details

Title: Machine Learning and Data Validation.
Creators: P. Pantziarka
Contributors: University of Surrey (United Kingdom). (Institution)
Awarding Institution: University of Surrey (United Kingdom).; Doctor of Philosophy (PhD)
Theses and Dissertations: Doctor of Philosophy (PhD), University of Surrey (United Kingdom).
Publisher: University of Surrey; Guildford
Number of pages: 209
Date published: 2005
Date submitted: 06/05/2020
Identifiers: 99517023602346
Academic Unit: Surrey research (other units)
Resource Type: Doctoral Thesis

Machine Learning and Data Validation.

Abstract

Files and links (1)

Metrics

Details

Usage Policy