McMaster researcher to “clean” big data

Researcher: Fei Chiang, McMaster University
Project title: A Dynamic and Scalable Data Cleaning System for Watson Analytics
Industry Partner: IBM
Supported by: IBM Canada Ltd., IBM Canada Ltd., SOSCIP, OCE, IBM Canada Ltd., NSERC

Cybersecurity Digital Media

In today’s age of information, big data plays a crucial role in the decision-making process for many organizations, a fact that is likely to increase in the years to come as organizations struggle to remain competitive. According to a survey released by PwC Canada in 2016, at least 33% of organizations already define themselves as “highly data-driven.”[i]

Big data can lend many insights that give an organization an edge in the global marketplace. It can inform decisions that can improve profit margins and help leaders better understand patients, customers and employees, and improve inefficient and costly processes.

Data, however, can also be rife with errors, duplications, inconsistencies and incomplete information. That’s where Fei Chiang and her team comes in. Chiang is an assistant professor in the Department of Computing and Software, Faculty of Engineering at McMaster University. Funded as part of the OCE-SOSCIP Smart Computing R&D Challenge, a collaboration between OCE, NSERC and SOSCIP to provide $7.5M in funding to SOSCIP projects, her SOSCIP project involves working on improving data quality to achieve trusted and accurate results from data analysis tasks.

“Poor data quality costs businesses and organizations millions of dollars a year in operational inefficiencies, poor decision-making and wasted time,” she explained.

“Real data often contains missing, duplicate, inconsistent and empty values. The currency and timeliness of data is also important as many decisions need to be made with the most recent data.”

Chiang is working with IBM to improve the data quality metrics in Watson Analytics, IBM’s cloud-based data analytics platform. The first step is to build a set of detailed quality metrics that provide in-depth information on the data quality problems. The metrics are aggregated to provide customized data quality scores to users based on their data analysis task.

When the research is completed, the aim is to provide organizations with a means of “cleaning” their data, a process which could take weeks to months. Chiang’s research could trim that time down to days, saving the organization money and vital time needed to make big decisions in a fast-paced marketplace.

The research also provides valuable hands-on training for data scientists, which represents a significant gap in the Canadian market.

One of those students is Yu Huang, a PhD student at McMaster. Huang is the lead developer in designing and testing algorithms.

“I’m responsible for developing new data quality metrics that can help users better understand and correct their data quality problems, and exploring how our techniques can be incorporated with IBM software tools,” he explained.

Huang’s aspirations to become a data scientist are closer to fruition thanks to the strong technical and problem-solving skills he’s gained from the project. He’s hoping to apply those skills to meet the future needs of the private sector.

“I learned how to integrate state-of-the-art research with the needs of industry,” he explained.

The full team includes developers, software architects and statisticians from IBM labs in Ottawa, Chicago and Germany.

Chiang grew up in Toronto and studied computer science at UofT and the University of Waterloo.

In her spare time, she enjoys spending time with her family, travelling, and playing competitive tennis.

SOSCIP is a research and development consortium that pairs academic and industry researchers with advanced computing tools to fuel Canadian innovation.  SOSCIP supports projects that have the potential to have a considerable impact on the lives of Canadians, within areas such as water, cities, health and cybersecurity. The consortium includes 15 of Ontario’s most research-intensive academic institutions as well as Ontario Centres of Excellence and the IBM Canada Research and Development Centre.