IBM RESEARCH SPOTLIGHT
A dynamic and scalable data cleaning system for Watson Analytics
Fei Chiang, McMaster University
IBM Canada Ltd.
SOSCIP, IBM Canada Ltd., OCE, NSERC
In today’s age of information, big data plays a crucial role in decision-making for many organizations, a fact that is likely to increase in the coming years as organizations struggle to remain competitive.
Big data lends many insights that give organizations an edge by helping them make informed decisions that improve profit margins and help them better understand customers and employees.
Data, however, is often incomplete and rife with errors. That’s where Prof. Fei Chiang from McMaster University comes in. Her research, which includes a team of developers, software architects, students and statisticians from IBM labs in Ottawa, Chicago and Germany, involves improving data quality to achieve trusted and accurate results from data analysis tasks.
The project is part of the Smart Computing R&D Challenge, a $7.5 million initiative between OCE, NSERC and SOSCIP.
“Poor data quality costs businesses and organizations millions of dollars a year in operational inefficiencies, poor decision-making and wasted time,” she explained.
“Real data often contains missing, duplicate, inconsistent and empty values. The currency and timeliness of data is important because many decisions need to be made with the most recent data.”
Prof. Chiang is working with IBM to improve data quality metrics in Watson Analytics, IBM’s cloud-based data analytics platform. The first step is to build a set of detailed quality metrics that provide in-depth information on the data quality problems. The metrics are aggregated to provide customized data quality scores to users based on their data analysis task.
The aim is to provide organizations with a means of “cleaning” their data and streamlining that process from months to days.
It also provides valuable hands-on training for data scientists to fill a significant gap in the Canadian market. One such student is Yu Huang, a PhD student at McMaster who is the lead developer in designing and testing algorithms.
“I’m responsible for developing new data quality metrics that help users better understand and correct data quality problems, and exploring how our techniques can be incorporated with IBM software tools,” he explained.
He’s hoping to apply those skills to meet the future needs of the private sector.
“I’ve learned how to integrate state-of-the-art research with the needs of industry,” he explained.