Abstract
Data validation describes the process of checking the internal consistency, correctness and quality of a data-set. The role of data validation in the broader context of data quality/data cleansing is described. In particular problems related to syntactical and semantic errors are defined, and the concept of a validation model is introduced. The role of machine learning in the building of validation models is described and a range of machine learning techniques is surveyed. A novel machine learning strategy that combines genetic algorithms and association rules to generate data validation models is proposed. An algorithm is developed to discover validation rules from numeric data sets and is implemented as a Java toolset called eaVal. A series of experiments using eaVal for data validation are carried out and it is shown that it can successfully discover validation rules which identify records within a dataset which have a high probability of containing errors. A method of post-processing the results from eaVal is proposed. This utilises Bayesian Networks, which are derived directly from the validation rules discovered by eaVal, to identify which fields within an invalid record set have the highest probability of being invalid. Experimental evidence of the efficay of the technique is shown. The post-processing phase is shown to be a major step towards semantic data validation. A case study is also described that uses the tools and techniques described in this work to perform a data validation exercise on a clinical dataset. The case study indicates that the methods developed can provide useful information to a data analyst when validating numerical datasets. Furthermore it is also shown that the discovery of validation rules is a useful mechanism for identifying records which are interesting or unusual. Finally current limitations and future directions of this work are also discussed.