Businesses are continuously generating and collecting more data. To keep up with and manage this information, a data lake provides a powerful method for storing and leveraging data. From a massive Internet of Things (IoT) operation streaming data from millions of devices globally, to a small business trying to combine social media data with their sales transactions for analysis, data lakes provide actionable insights for your company when it comes to collecting, processing, and analyzing data.
What is an Azure Data Lake?
A data lake is a repository that holds a huge amount of raw data in a structured, semi-structured and/or unstructured storage format. This differs from the structured and defined environment of a data warehouse. While data lakes can be found upstream from data warehouses, they can also be used separately for analysis and exploration. Here, data scientists, BI developers and analysts can dive in for any ad-hoc exploration and analysis or develop production processes fed by the data lake. Being highly agile, it is easy to continuously reconfigure a data lake or add new data sources, unlike a data warehouse where there is often a lengthy process associated with production environments.
1. Utilize Any Data Source
The agility and power of a data lakes comes from its ability to store three different kinds of data — structured data, unstructured data and a hybrid of the two called semi-structured data. Being highly scalable, minimal effort is needed to ramp up what’s being transferred into the data lake.
You can quickly and easily store just about anything:
- Unstructured data: This is often text heavy data with structures that are not uniform, including raw sensor log data from millions of devices constantly streaming data to an IoT hub, PowerPoint files, images, audio, videos, emails, and raw social media data.
Example uses: processing and processing IoT sensor data through a hub to reduce the data volume and get defined metrics from raw data.
- Semi-Structured: This is typically when the data has metadata tags associated with it, giving it some sort of loosely defined structure.
Example use: Storing CCTV video footage or PowerPoint files of images with metatags for location, dates, times and other metadata, to leverage with imagine recognition technology with TensorFlow.
- Structured data: This data has a rigid format, which includes data residing in traditional databases like SQL servers and analysis services cubes. This includes processed IoT hub data with billions of rows.
Example use: Storing and processing historical extractors or restores of production environment transactional or multidimensional databases in cold storage, if needed for an audit or system failure.
2. Supports Hadoop and Other Powerful Tools
Hadoop is a software frame work that is excellent at batch processing unstructured data at scale, and is synonymous with powerful and efficient big data environments. Hadoop is useful as an initial landing place to host all kinds of data in low-level or raw form in front of downstream data warehouse and business intelligence (BI) tools. Hadoop is excellent for processing large amounts of unstructured data, analyzing structured data and near-real time streaming of data. The Azure Data Lake supports Hadoop within it and allows users to leverage the power of Hadoop using a non-technical front end. For highly technical users and developers, it is possible to hard code Hadoop jobs.
Other tools supported in the Azure Data Lake include:
- U-SQL: This language is a hybrid with SQL and C# to process massive amounts of data. Being scalable and distributed makes it ideal for analyzing data across relational databases and unstructured data stores.
- Apache Spark: Spark is an analytics engine for big data environments and machine learning. It is famously used at many internet giants including Netflix and eBay. Azure HDInsight Spark clusters allow you to run Spark jobs reading of an Azure Data Lake source.
- Apache Hive: This is the standard tool used for SQL queries over massive amounts of data in Hadoop. Hive is used to query and analyze data.
3. Cost-Efficient Solution
Ultimately, a data lake is designed for low cost storage. It costs a fraction of what it costs for on premise storages, and is cheaper than traditional cloud storage, making it possible to economically store high definition video, or data from devices live streaming data. With cold and archive storage pricing available, immense savings are available for data not constantly needed on demand. Huge economies of scale are available too with a tier pricing system in place, with costs per GB coming down as storage volume grows. Note however, that there are costs associated with operations and data transfers in an Azure Data Lake, and not just storage. More information on pricing can be found on the Microsoft website:
Gen1: https://azure.microsoft.com/en-us/pricing/details/data-lake-storage-gen1/
Gen2: https://azure.microsoft.com/en-us/pricing/details/storage/data-lake/
Note that Gen2 is a more modern version, that allows combing Hadoop with Azure Blob Storage.
4. Create a Universal Platform for all Data
Combing data for analysis or processing can be difficult if it is stored in separate silos across an organization, which in turn makes big data projects difficult for analysts and business end users. Leveraging big data analytics requires us to combine different sets of data to analyze for correlation or for reporting purposes. It is not uncommon for companies to be using several different data storage platforms at the same time, such as SAS, SQL, Teradata and Redshift servers. A data lake is an ideal platform for storing this data at a low cost, making it easier to combine and analyze the data, ultimately allowing analysts to continue enhancing your customer experience.
5. Seamless Integration with Microsoft’s Big Data Platform
The Azure Data Lake integrates easily with the following Azure services:
- Azure Data Lake Analytics: This Microsoft tool allows you to perform analytics on exabytes of data.
- Azure Data Factory: This cloud-based tool is used for ETL processes to feed data warehouses and SQL databases for reporting. Azure Data Factory can feed the datawarehouse, or take data from there as a data source.
- Azure HDInsight: This highly secure and low cost managed cloud service by Microsoft allows processing of data using Hadoop, Spark, Hive, R and other tools for your data lake.
6. Highly Secure
Like all parts of the Azure cloud platform, the Data Lake is extremely secure. Microsoft uses some of the most advanced security technology and is trusted by hundreds of large and well-known companies across industries. Additionally, the Azure cloud has the most certifications for security and privacy requirements compared to any other cloud provider, ranging from meeting UK Government G-Cloud framework to China’s Information System Classified Security Protection standard.
Conclusion
Azure Data Lake is an easy-to-use tool that helps propel organizations into a data-driven culture. With inexpensive pricing and powerful big data technologies available on Azure Data Lake, there’s no reason why you cannot leverage big data technology in the same fashion as major technology giants. With powerful self-service abilities and a simple user interface, businesses can save time, resources and money with their big data platforms. In the age of analytics, it’s vital for organizations to intelligently utilize the data they collect to rapidly evolve with changing business and consumer habits. The future is bright for analytics — companies are swarming to collect and derive insights from data, and with some quick retooling you too can be a super star for your department and organization.
Have questions on how you can use a data lake in your organization, or about any of the above features? Contact Hitachi Solutions today.