Microsoft Azure Data & AI Roadmap
Microsoft roadmap for data warehousing and Power BIWatch the On-Demand Webinar
Data is being created faster than ever, and in more locations than organizations can track. Data may be stored on-premises, across different public clouds, and in many software-as-a-service (SaaS) applications. Despite its importance, most organizations aren’t prepared to properly govern all this data. Many organizations don’t know what data they have and furthermore, where to find it.
An organization that doesn’t have a holistic understanding of its data estate across the different sources will find it increasingly challenging to adapt quickly to changing market and regulatory conditions. Tracking all this data manually in Excel or in another spreadsheet will become increasingly difficult and cumbersome. And without this insight, employee efficiency is bound to suffer.
In a recent Global Data Transformation survey from McKinsey & Co., respondents said an average of 30 percent of their total enterprise time was spent on non-value-added tasks because of poor data quality and availability. No one can afford to lose that level of productivity.
Importance of Data Governance
Organizations have an ever-increasing appetite to leverage their data for business advantage, either through internal collaboration, data sharing across ecosystems, or as the basis for AI-driven business decision-making. However, while leveraging this data, organizations must take care to maintain employee, partner, and customer trust in their approach. This requires data governance and data governance solutions that help organizations handle data responsibly, ethically, compliantly, and accountably.
These solutions must perform well in data acquisition, cataloging, lineage, and analysis to deliver high-quality trusted data and provide insights into the location and movement of sensitive data across the entire data estate.
Here’s where Microsoft Purview comes in. Let’s look at how Purview works and it’s easy to understand why it’s getting so much attention.
What is Microsoft Purview?
Microsoft Purview is a unified data governance service that helps you manage and oversee your on-premises, multi-cloud, and software-as-a-service (SaaS) data. Built over Apache Atlas, an open-source project for metadata management and governance, Microsoft Purview lets you find, understand, govern, and consume data sources into your on-premises and/or cloud data estate.
Purview consists of three components: the data map, data catalog, and data insights. We’ll look at each of these elements individually. But first, here’s a high-level architectural diagram:
Microsoft Purview Data Map provides the foundation for data discovery and effective data governance. It is a cloud-native PaaS service that captures metadata about enterprise data present in analytics and operation systems on-premises and cloud. It powers the Purview Data Catalog and Purview Data Insights as a unified experience within Microsoft Purview Studio.
Registering your Data Source
Each data source has specific requirements for authentication and configuration to allow for scanning of its assets. For example, if you have data stored in an Amazon S3 standard bucket, you’ll need to provide configuration from the data source to the Purview scanner via a ‘connection.’ The Purview scanner uses this connection to your Amazon S3 buckets to read your data, and then report the scanning results back to Purview.
Microsoft continues to increase the number of supported data source connections to ensure that all your data sources can be pulled into the catalog and made accessible to users. You’ll need to check the most recent support information for both data sources and file types.
Collections are used to organize your sources and assets into main categories. By using collections, you can manage and maintain data sources, scans, and assets in a hierarchical model of your data landscape based on how your organization plans to use Microsoft Purview to govern that landscape. Access to collections, data sources, and metadata is set up and maintained based on the collections hierarchy, and follows a least-privilege model for access.
In Microsoft Purview, you identify data assets by assigning labels or classifications. It’s the way to apply sensitivity, compliance, industry, business, and company-specific metadata to your data assets so the data catalog can be populated according to the categorizations you want. Classification is based on the business context of the data. Microsoft has provided 200+ pre-defined system classifications, and you can also easily create your own custom classifications.
By categorizing sensitive data, you can leverage Microsoft Purview’s data owner access policies and your collection hierarchy architecture to restrict search ability once the data is in the catalog. This is particularly important for sensitive data so you can establish and ensure appropriate governance. You’ll get the benefits of Purview whenever sensitive or protected data is involved, and you can still maintain regulatory compliance.
Examples of sensitive information include social security numbers, addresses, credit card numbers, and other personally identifiable information that is governed by regulations, such as GDPR.
The pre-built system classification rules for sensitive information are the same as in Microsoft 365, so you can extend your sensitivity labeling policies in the Microsoft 365 Compliance Center to Microsoft Purview-supported data stores. Learn how here.
In Microsoft Purview Studio, once you have created your collection hierarchy, registered a data source, configured a scan, you can then trigger a “scan” to gather your data source’s metadata. During the scan, assets are automatically labeled based on the classifications and sensitivity labels you’ve set up. The data map will be populated with the metadata when the scan is completed, which you can find sources under the “collections” you’ve created.
It’s important to note that scanning will impact the data source’s performance, so you should schedule scans accordingly.
Purview’s data catalog is a detailed visual of your data estate. The catalog makes data sources easily discoverable and understandable. Once a search is entered, Purview returns a list of data assets matched to the relevant keywords, collections, and classifications you’ve selected.
You can read a lot more about searching the data catalog here.
When you’ve found the data asset you want to view, the data lineage should be visible. Data Lineage is broadly understood as the lifecycle that spans the data’s origin and where it moves over time across the data estate, including how the data was transformed. It can be extremely useful for visually tracing a data asset. Without it, users can spend a lot of time tracing root cause problems created by upstream data pipelines that are owned by other teams.
Lineage is supported only for some data sources. For a list of services that support pushing lineage to Purview, look at the Microsoft Purview documentation. It is updated regularly. Custom lineage reporting is also supported via Atlas hooks and REST API.
To make the information more valuable to your business users, you can add business-related terms to your assets or select terms from a default glossary. This process allows users to easily browse and search the data catalog using familiar terms. It also eliminates the need for Excel data dictionaries. The glossary also allows terms to be categorized so that they can be understood in different contexts. For example, a customer could also be referred to as client, purchaser, or buyer. These terms can then be mapped to assets like a database, tables, columns, etc.
Purview’s Data Insights is a single window into your data assets, scans, glossary, classifications, and sensitivity labeling. It gives you the ability to see audit trails of your data categorized by sensitivity and business relevance. Regarding sensitive information, it’s a simplified compliance risk assessment across all your operational and transactional data sources.
More to Come
Here’s a helpful blog from Microsoft on the different insights available through Purview data insights.
Microsoft continues to build out Purview’s functionality. In an October webcast, Microsoft showed the following roadmap:
Future areas specifically identified on that roadmap were Data Sharing, Data Quality, and Data Policy. While little more has been said about these areas or the timelines to general availability, it’s probably safe to assume that data sharing will incorporate some form of interactive data access within Purview Studio. The data quality element will highlight quality concerns, so you can give problematic areas your immediate attention. And lastly, data policy functionality could enhance options for data governance and include workflows and notifications, which would be a welcome addition.
Microsoft Purview is reimagining data discovery and governance in the cloud as data volume grows at exponential rates, takes on a variety of different forms, and is hosted in many different locations.
It answers the age-old questions:
- What data am I tracking?
- Where is it located?
- Is it safe?
- How do I get to it?
At Hitachi Solutions, our Azure data experts can help you gain strategic insight across your entire data estate with Purview. You can check out our solution offer here or contact us for more information. We look forward to joining you on your digital data transformation.
Subscribe to our blog and never miss a post
Join our growing community of professionals and get insights, resources, and tips in your inbox weekly.