Microsoft Azure cloud services enable enterprises to manage data at scale in the cloud. That also open massive possibilities for predictive analytics, AI, and real-time applications. Apache Spark has become the platform of choice for building these applications but deploying and managing Spark at scale has remained challenging especially for enterprise with large numbers of users and strong security requirements.
Databricks, designed by the founders of Apache Spark, is integrated with Azure to provide end-to-end, managed Apache Spark platform optimized for the cloud and enables collaboration between data scientists, data engineers, and business analysts. one-click deployment, autoscaling, and an optimized Databricks Runtime that can improve the performance of Spark jobs.
Apache Spark-based analytics platform
Azure Databricks provides complete Apache Spark cluster. It includes the following Spark components:
- Spark SQL and DataFrames: Spark SQL is a Spark module for structured data processing. A DataFrame is a distributed collection of data organized into named columns. It is similar to a table in a RDBMS or a data frame in Python/R.
- Streaming: Real-time data processing and analysis. Integrates with HDFS, Flume, and Kafka.
- MLib: Machine Learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives.
- GraphX: Graphs and graph computation for a broad scope of use cases from cognitive analytics to data exploration.
- Spark Core API: Includes support for R, SQL, Python, Scala, and Java.
Fully managed Apache Spark clusters
Azure Databricks provides a reliable environment in the cloud. It comes with the following benefits
- Spin up clusters and build quickly in a managed Apache Spark environment. Clusters are set up, configured, and fine-tuned to ensure high reliability and performance.
- Reduced resources and costs associated with scaling clusters manually by autoscaling up and down with your needs. Auto terminate your inactive clusters to save resources.
- An interactive workspace enabling data scientists, engineers and business users to collaborate as a team.
- Integrate directly with Azure data stores and services including Azure SQL Data Warehouse, Cosmos DB, Azure Data Lake Storage, Event Hubs, and Azure Data Factory. Also, enables single sign-on with the help of Azure AD.
- Azure Databricks also provides Power BI integration that allows you to discover and share your impactful insights quickly and easily. You can use other BI tools as well, such as Tableau Software via JDBC/ODBC cluster endpoints.
- Easily build, train, and deploy AI models at scale using GPU-enabled clusters. Use runtime for machine learning that comes preinstalled and preconfigured with deep learning frameworks and libraries such as TensorFlow, Keras, and XGBoost.
- Azure Databricks supports languages like Python, Scala, R, and SQL so you can use your existing skills to start building. Target any amount of data or any project size using a comprehensive set of analytics technologies including SQL, Streaming, MLlib, and GraphX.
- Use clusters with the help of REST APIs.
- Secure data integration capabilities built on top of Spark that enable you to unify your data without centralization.
- Instant access to the latest Apache Spark features with each release.
Enterprise security
Azure Databricks provides enterprise-grade Azure security to protect data and your business.
- Integrates with Azure Active Directory and enables to run complete Azure-based solutions using Azure Databricks.
- Provides Azure Databricks roles-based access to user for notebooks, clusters, jobs, and data.
- Enterprise-grade SLAs.
End to End Azure Integration
Below is the mapping of Azure Databricks with the Azure Cloud offerings:
- Diversity of VM types: All existing VMs availability: F-series for machine learning scenarios, M-series for massive memory scenarios, D-series for general purpose, etc.
- Azure SQL Data Warehouse, Azure SQL DB and Azure CosmosDB: Azure Databricks easily and efficiently uploads results into these services for further analysis and real-time serving, making it simple to build end-to-end data architectures on Azure.
- Azure Power BI: Direct connection to Power BI from Databricks clusters using JDBC.
- Azure Active Directory provide controls of access to resources and is already in use in most enterprises. Azure Databricks workspaces deploy in customer subscriptions so naturally AAD can be used to control access to sources, results and jobs.
- Security and Privacy: Azure Databricks provide the compliance certifications that the rest of Azure adheres to.
- Flexibility in network topology: Azure Databricks supports diversity of network infrastructure needs.
- Internally, Azure Databrick is implemented on Azure Container Services to run the Azure Databricks control-plane and data-planes via containers.
- Azure Databricks utilizes Accelerated Networking that provides the fastest virtualized network infrastructure in the cloud and further improve Spark performance.
- The latest generation of Azure hardware (Dv3 VMs), with NvMe SSDs capable of blazing 100us latency on IO. These make Databricks I/O performance better.