Artificial Intelligence-real world applications

Machine Learning Data Catalogs

By Cassandra Balentine

Data is arguably a company’s most important asset. It is essential that every business finds the right mix of solutions to help manage that data so it is both safe and accessible.

Machine learning data catalogs (MLDC) address the data management and governance challenge. According to Forrester Research, MLDCs are defined as machine learning (ML)-powered metadata catalogs that maintain traits of data within a data fabric for activation within systems of insights.

G2 Crowd, in its assessment of the best MLDC software points out that the solutions allow companies to categorize, access, interpret, and collaborate around company data across multiple data sources, while maintaining a high level of governance and access management. The firm notes that artificial intelligence (AI) is critical to many features of ML data catalogs, enabling functionality such as ML recommendations, natural language querying, and dynamic data masking for enhanced security purposes.

Companies utilize MLDC to make data discovery easier for business users and allow IT to enact controls to ensure data security.

The Role of MLDC
It is clear that the importance of data within any organization continues to grow—along with the amount of data that must be managed and accessed.

“MLDCs provide data intelligence and insight capabilities with metadata. They eliminate the inefficiencies, delays, and inaccuracies of manual metadata management and governance processes,” says Manish Sood, CEO, Reltio. MLDCs allow companies to use ML to profile, categorize, and collaboratively maintain data assets while providing necessary governance and access control.

A data catalog is primarily designed to reduce manual data tagging with automatic tagging of data that accelerates the time it takes to get data identified, governed, and made accessible for analytics. “Without a data catalog the only other way to show what data the organization has to ensure its properly governed for General Data Protection Regulation, HIPPA, and other regulations is to catalog it manually—an impossible undertaking for today’s petabyte enterprises and one reason why so many organizations have yet to become data driven,” says Alex Gorelik, founder/CTO, Waterline Data.

Many experts say a big part of achieving the cultural shift necessary for digital transformation is to ensure there’s a central source of high-quality data that the entire company has access to. “This is a tall order. For one, you’ve got mountains of siloed data scattered across the organization. Most of this data hasn’t been classified, so it can’t be governed or be made searchable, much less put to use. Meanwhile, more data is pouring in by the minute,” comments Gorelik.

A MLDC is a curated and organized collection of data assets, where users efficiently find and asses the affinity of such assets against their needs. “Offering visibility to the schema/metadata through tracking and search is just the tip of the catalog. A MLDC must provide the capability to mine the data for discovery of relationships between assets and enforce data quality and maturity of assets,” says Emily Washington, senior VP, product management, Infologix, Inc.

“Simply knowing what data is available—and understanding that data—has become a huge challenge for enterprises,” admits Dharma Kuthanur, senior director, product marketing for enterprise data catalog, Informatica. A MLDC lets enterprises classify and organize data assets across cloud, on premises, and big data anywhere, so users can easily find and understand relevant data for their business needs. “It is a well-worded cliché to state that analysts and data scientists spend 80 percent of their time searching for the right data and only 20 percent doing actual analysis.”

AI-powered data catalogs can flip this around by delivering intelligence and automation in the form of recommendations, data similarity detection, automatic identification of data domains, automatic business term associates, and a holistic view of data relationships. “Because of this, data catalogs are emerging as a foundational capability to drive key business initiatives like self-service analytics, data governance, and IT/cloud modernization,” says Kuthanur.

Andy Sheldon, VP marketing, Unifi Software, adds that a MLDC can help an organization become more data literate by enabling business users to find the data they need to answer questions or hypothesis.

Driving Growth
A variety of factors drive market demand for MLDCs. These include an increasing volume of data, data privacy and regulatory demands, and a change in the way businesses view and utilize data.

“In a digital economy when many enterprises are going through various transformation initiatives, managing data as a strategic asset is becoming critical,” shares Sood. Ensuring reliable data for personalized, connected customer experience or for meeting new compliance regulations push enterprises to have holistic data strategy and leverage technologies such as ML to improve data quality and governance. “Standalone data catalogs that store metadata across systems may not fully serve today’s enterprise needs. Companies are thinking holistically and strategically about how the actual data—master data, reference data, interactions and relationships—ties with metadata. ML-powered modern data management platforms provide a more comprehensive solution to meet the needs of digital transformation.”

Gorelik also sees an increased demand for data catalogs relative to compliance concerns. He says demand has soared as organizations move to get their data assets to comply with regulations and strengthen data assets in order to comply with regulations and strengthen data-driven decision making.

Sheldon says the rapid creation of a searchable and valuable source of data discovery is a key driver for MLDC adoption. “The modern enterprise is awash with data and there’s more on the way every day. At the same time, many more business users require access to some or all of that data in order to do their job. The perennial problem of IT being stuck in the middle, handing interrupt-driven requests for data simply doesn’t scale operationally. Therefore, you need a way for business users at all levels to find data and get answers to their questions through natural language queries from all that data.”

Kuthanur sees an explosion in the volume, variety, and velocity of data. “It is big and fast data from an increasing and ever-changing set of data sources,” he describes. Secondly, there is business pressure from a growing set of different user types to leverage this data to address a plethora of business needs and digital transformation properties. “They all want quick access to relevant data in a self-serve model, and they need to have trust and confidence in the data, which requires data governance,” he offers. Finally, there is growing scrutiny on data privacy and protection and the need to comply with regulations around that. “As enterprises embark on their digital transformation priorities, they realize that good data underpins all of these priorities. The ability to easily discover and understand the data you need is emerging as a foundational capability to address all of these disparate business requirements.”

When speaking about MLDCs, Washington typically sees two primary enterprise use cases that drive the need—implementing a governance framework and managing analytics, KPIs, and metrics. The first is typically implemented by an enterprise with an existing siloed data governance initiative based on unstructured files and multiple tools. The latter is typically implemented by the enterprise with regulatory pressure to have accurate financial reporting and no means to establish a standardized book of KPIs and metrics as well as performing end-to-end data quality.

Benefits of MLDC
MLDCs provide an efficient way to manage, monitor, and improve the use of enterprise data assets. “Use cases range from compliance and data profiling to ease of search and reporting, as well as enabling processes for continuous improvement through workflow and collaborative data curation,” says Sood.

Gorelik says data catalogs provide insight into the organization’s data assets for greater visibility, quality, and control. “This ensures data is properly governed for compliance as well as making sure only the right people have access,” he shares. “By automating the processes that identify data, classify data, govern data, and provide the right access of data to the right people, data catalogs provide all kinds of benefits like fast, self-service analytics, rooting out redundant data, and ensuring higher quality data. All of this in turn supports stronger, data-driven decision-making throughout the enterprise.”

When implemented correctly, an MDLC provides a consistent and enterprise-wide view of highly valuable data assets across all lines of businesses and functions of the organization. Washington points out that this allows users to quickly access meaningful information that is fit for purpose and provides an efficient way to understand who owns the data and whether it’s relevant for an individual’s use case. “This prevents users from having to search across systems, people, and processes for the information he or she needs. By leveraging ML, the volume of information available—especially within data lakes—can be more easily harvested, related, and prepared for the organization.”

“An MDLC democratizes data for users across the enterprise by opening up visibility into and understanding of enterprise data beyond a small, closed circle of data owners and subject matter experts,” says Kuthanur. It does this by cataloging all data assets and providing a simple, search-based discovery to find relevant data along with a holistic view of the data to help users understand the data—where the data is coming from, how it’s being used, what other data it’s related to, business context for that data, and the quality of the data.

“An MLDC also enables enterprises to bring otherwise siloed or tribal data knowledge to the forefront by enabling users to add rich usage and business context to the data and enabling this shared data knowledge to be easily shared across the enterprise,” adds Kuthanur. “It’s no wonder that data cataloging is emerging as the critical first step for all digital transformation priorities from next-generation analytics to data governance and cloud modernization.”

A primary benefit is the rapid creation of a searchable and valuable source of data discovery. Sheldon says the ML part profiles data connected to the catalog and builds a business glossary or ontology that extracts tribal knowledge inherent in IT and makes the data more comprehensible to a wide audience. “AI within the catalog can make intelligent recommendations about other datasets that might be of interest based on the user’s search criteria, or show similar datasets, surface trusted data, or data that has been curated. The implementation of a MDLC increases the value of all data and potentially opens new revenue streams such as data as a service,” he offers.

Sheldon points out that knowledge graph technology helps users understand the provenance of the data or the relationships between data sets and attributes. “Comprehensive lineage that also displays any transformations that have occurred on the data to create derived attributes is another aspect fueled by ML or AI.”

Challenges
There are a few considerations and challenges an organization may face when implementing MDLC.

Washington says populating the catalog with valuable metadata across systems is critical to the success of an MLDC. “No matter how automated the MLDC is to crawl for metadata, you need to ensure the data is leverageable,” offers Washington.

Assuming the catalog is appropriately populated, the other primary challenge is usage/adoption. “As you open up a broader audience to access the catalog, it must be easy to use and value must be well understood in order to encourage users to keep coming back and getting comfortable with the quality of information it presents,” says Washington.

You need a MLDC that will connect to and search all your data. “There is no point having a data lake catalog unless you intent to put 100 percent of your data into the lake—something that no enterprise plans to do, otherwise how will data not in the lake be discovered—in another, separate catalog?” asks Sheldon. He says this potentially ends up with the disastrous possibility of a catalog of catalogs whereby the user is required to know which catalogs to search in before they can find what they want. “Data discovery should not stop with source systems modern business intelligence and analytics servers should be discoverable so that dashboards or visualizations can be searched just like google serves up images that match search terms.”

Many enterprises leverage an MLDC to execute on strategic projects like driving a data-driven digital transformation of their business. “Scale is one of the common challenges for these enterprises—the ability to scan and gather metadata from a range of data sources, and support millions of data objects,” says Kuthanur. The ability to curate and enrich data at this scale is another key requirement and AI and ML play a critical role in this. “For instance, data domains can be automatically associated to physical data assets. While MLDCs can be effective in harnessing shared data knowledge from users, that approach in isolation will not scale. It has to be combined with AI/ML-driven automation.”

Kuthanur explains that understanding an end-to-end view of where the data is coming from and how it gets used is also critical to deepen understanding of this data. “This requires connecting to different data sources and extracting lineage from commonly used data integration tools as well as BI tools. This can be a challenge in enterprises with complex, evolving data landscapes.”

Sood points to another challenge of thinking beyond the catalog to formulate a holistic data management strategy. “Standalone initiatives on data catalogs, master data management, compliance, and analytics may lead to limitations and rework. A MLDC solution should not be selected based purely on cataloging, which is just one step of the overall journey for compliance or digital transformation, but rather with a comprehensive set of objectives that tie into the end business value.”

Leveling Up with Data
Data management and governance is a critical and growing concern for organizations today. MLDC provide a solution, enabling users to categorize, access, interpret, and collaborate around data across multiple data sources. In addition to better discoverability and accessibility, they aid in the road to new compliance for many new data regulations.

Nov2019, Software Magazine

Comments are closed.