Microsoft announced public preview of Fabric, their data platform for the age of AI. This is an incredible move by Microsoft as the last 40 years have produced very few end-to-end data and analytics platforms, and even fewer that have successfully stood the test of time.
Hitachi Solutions has been along for the journey since September 2021, providing testing and feedback to Microsoft. We wanted to share our insights of Microsoft Fabric from conception to public preview, technical principles, key features, and our thoughts on its medium-to-long-term market outlook.
Back in 2020, Microsoft’s Azure Intelligence Platform – or AIP, as it was briefly known – had an existential identity crisis. And I don’t mean the kind of dilemma that comes from having four different acronyms in two years (MDP, AIP, MIP and finally MIDP). Instead, it was the kind of dilemma that comes from not knowing what its Synapse platform wants to be when it grows up: a traditional data warehouse that duplicates data lake data into the compute layer, like the old-fashioned “generation 2” systems, or a trendy Lakehouse that ephemerally spins up compute to query an open data lake on demand, like Databricks, Dremio, and Synapse Serverless. Trying to do both at once is a recipe for disaster, and AIP was a textbook case of a mediocre multitasker – attempting everything, winning at nothing.
Fast forward to mid-2021, a re-org at Microsoft merged the industry-dominating Power BI team with the Azure Synapse, data integration, and messaging portfolios. This provided a transformational opportunity. Enter Project Trident, Microsoft’s internal code name for what today what is known as Microsoft Fabric.
As one of a handful of experts invited into the conceptual phase, I initially understood Project Trident as just another product codename, like Project Babylon (Purview), or a feature, like Project Athena (Synapse Link). But I quickly discovered it was something else entirely. It was a bold plan by the Product Group (PG) to reimagine the entire Microsoft data and AI stack. The core mission was made clear and has changed little over the course of the past 2-years: SaaS-ify all analytics capabilities into a unified analytics fabric to support low-code plus pro-dev.
In practical terms, this meant expanding the Power BI SaaS platform to encompass not just reporting, but also Data Factory, Data Engineering, Data Warehousing, Data Science, Real-time Analytics, and a novel rule-based data-actioning engine called Reflex (later renamed to Data Activator).
Interview feedback evolved to Figma designs, recorded demos, a 6-month private preview, and finally the big reveal of the platform name: Microsoft Fabric. This 18-month journey was a masterclass in product design thinking and a testament to the PG’s unwavering commitment to quality.
For instance, recognizing that Azure Data Factory is respected in the data engineering community for its orchestration abilities, its Pipelines feature was wisely cherry-picked over less desirable features, such as Mapping Data flows and its attempt at Power Query. The sunk cost fallacy was avoided by simplifying Microsoft Fabric Data Factory to two key products: Data Factory Pipelines for orchestration and the crowd-favorite data transformation engine, Power Query Dataflows.
Jumpstart Your Adoption of Microsoft Fabric
It’s not difficult to see how Microsoft Fabric is a game changer. What may be less clear is how it can accelerate your data journey.Get started now
The best-of-breed design decisions didn’t stop there.
Now, imagine that you have a data warehouse full of tables that you want to query. How do you do it? You connect to the vendor database, write some SQL, join some tables, and voilà – there is your data. That’s how it works, and that’s how it has always worked.
But what if there was a way to query tables without connecting to the database? A way that lets you query your own data without being locked to a specific technology, vendor, or even programming language. A way that lets you select the ideal compute engine for the task at hand, whether it’s inside or outside of your platform. A way that lets you future-proof your data estate by adding or removing compute engines as you please, without messing up your data or paying for expensive migrations.
To those not paying attention for the past few years, this sounds like madness. However, this is precisely the promise of “Lakehouse”, a relatively recent design principle on which Fabric was weaved. Lakehouse means that your data lives in one place: a data lake. And not just any data lake, but OneLake: a single open-data location for your entire tenant that can be accessed by any engine. No more copying data from your data lake to your data warehouse, then to your semantic layer, then to your report. No more data hops. No more data silos. Just one lake and many engines. Like this:
There’s a lot more to Fabric than Lakehouse (we’ll dive deeper into some of its other features in a future blog post). In addition to Data Factory detailed above, here is a quick summary of some key technical components:
|OneLake is a single logical data lake for the entire enterprise. Conceptually, think of OneLake as an Azure Data Lake Storage gen2 (ADLSgen2) resource or a hard drive on your computer.
|Microsoft has taken the concept of Lakehouse and made it a tangible feature of Fabric. You can think of a Lakehouse as a partition of your OneLake – a fiefdom carved out to satisfy a specific purpose such as a project, product, team, or department. It provides virtualization capabilities, called Shortcuts, making ADLSgen2, AWS, and GCP data lakes look like they’re natively integrated behind Lakehouse’s ADLSgen2 API. Moreover, on the backlog are replication capabilities to stream database data directly into OneLake. Conceptually, think of a Lakehouse as an ADLSgen2 container or C:\ on your computer.
|Delta Lake is the data storage format for all structured data. This is the Rosetta stone allowing all types of compute (T-SQL, Spark, Kusto, Databricks, etc.) to interop by agreeing on a single file format. All of the following compute engines natively store their data as type Delta Lake.
|The T-SQL Data Warehouse replaces Synapse dedicated and serverless SQL pools. It offers net-new features that are above and beyond what is currently available today, such as:
Performant cross-database queries that generate a single execution plan, allowing teams to explore decentralized Data Mesh designs and break away from giant monolithic data warehouses.
Performance by default without the need to create indexes or assign table distributions, allowing developers to focus on valuable insights instead of hand-cranking their screwdrivers. Note for those who are particularly technically savvy: indexing and table distributions will come later.
Delta Lake-backed SQL tables, enabling streaming and fast interoperability with other compute types, such as Spark, without incurring the cost of copying data between technologies.
|KQL Databases provide extremely fast performance on streaming and time-series data.
|Power BI Direct Lake mode provides all the performance and functionality of Import mode but without the hassle of setting up schedules to refresh the data. This is particularly useful in event-driven, streaming, or complex data refresh pattern scenarios.
|Spark needs little introduction as it’s the dominant technology in the analytics industry. It is the Swiss army knife of code-heavy data transformation engines, providing best-of-breed capabilities to handle every use-case, such as streaming, big data, data engineering, or data science.
|Purview will automatically capture, catalog & tag all data assets in Fabric. This is an underrated feature that will prove to be one of the most important value adds.
|OneSecurity Currently, the modern data estate is littered with options on where and how to define granular security at the data lake, warehouse, semantic or reporting layers. It’s confusing. OneSecurity promises to replace these options with a one-stop-shop to define Row-Level Security (RLS), Object-Level Security (OLS), and Dynamic Data Masking (DDM) where your data begins: OneLake. More importantly, these security rules will automatically propagate to all downstream systems.
|Co-pilot is available across most experiences, such as auto-generating DAX code or even entire reports. Co-pilot will continue to be further embedded in other experiences, such as Data Wrangler – a feature that auto-generates Pandas code for common data-cleaning activities such as filtering, renaming, replacing, etc.
|Git branches can be assigned to a workspace, allowing version control for selected Fabric items, including Power BI datasets and reports. This is particularly important for enterprise customers with mission-critical workloads.
The history of Analysis Services provides a roadmap for how Fabric will be rolled out in the next few years.
- 2010: Analysis Services Tabular (IaaS)
- 2015: Power BI (SaaS)
- 2017: Azure Analysis Services Tabular (PaaS)
- 2019: Microsoft officially champions Power BI for all net-new workloads
As shown, there was a 4-year gap between Power BI’s release and its officially becoming Microsoft’s default semantic layer. When considering that Fabric is simply adding new technology to a mature SaaS platform (Power BI) by an experienced PG, we predict that Fabric will be the official “default” for all Microsoft data and analytics workloads within 2-3 years – or by 2026.
Fabric is an end-to-end data platform that lets customers skip the tedious and mundane parts of the data supply chain and jump straight to the fun and valuable parts: analyzing data, generating insights, and – with Data Activator – taking meaningful action.
This first-mover advantage, in conjunction with Microsoft’s access to 1 billion people who trust and use its product daily (otherwise known as Microsoft Office users), means it has the ingredients for a cake that its competitors can’t make. And it appears that Microsoft just turned on the oven: as of July 1st, 2023, Fabric will be baked into all Power BI tenants, worldwide.