Poor data quality is a serious and costly problem affecting organizations across all industries. Real data is often dirty, containing missing, erroneous, incomplete, and duplicate values. It is estimated that poor data quality cost organizations between 15% and 25% of their operating budget. Existing data cleaning solutions focus on identifying inconsistencies that do not conform to prescribed data formats assuming the data remains relatively static. As modern applications move towards more dynamic search analytics and visualization, new data quality solutions that support dynamic data cleaning are needed. An increasing number of data analysis tools, such as Watson Analytics, provide flexible data browsing and querying abilities. In order to ensure reliable, trusted and relevant data analysis, dynamic data cleaning solutions are required. In particular, current data quality tools fail to adapt to: (1) fast changing data and data quality rules (for example as new datasets are integrated); (2) new data governance rules that may be imposed for a particular industry; and (3) utilize industry specific terminology and concepts that can refine data quality recommendations for greater accuracy and relevance. In this project, we will develop a system for dynamic data cleaning that adapts to changing data and rules, and considers industry specific models for improved data quality.

Industry Partner(s):IBM Canada Ltd.

Academic Institution:McMaster University

Academic Researcher: Fei Chiang

Focus Areas: Cybersecurity, Digital Media

Platforms: Cloud