Introduction
Databricks is experiencing extremely rapid growth and adoption as a platform for Data innovation. In this article, we cover some of the reasons for this, the key benefits, and the differentiators as we see them.
A Common Environment For Data Analysts, Data Scientists and Data Engineers
Typically, the different job roles in a data team are siloed, each using their own languages, tools and approaches. For instance, big data engineers often use Scala as a result of the Hadoop/Spark evolution whilst Data Scientists prefer Python for their numerical work. Whilst Data Analysts might prefer Tableau or PowerBI, Data Scientists might prefer working with Notebook based UIs.
The real power behind Databricks is to get all of these people onto a common platform, logging into the same system and interacting with the same datasets through the same Notebook based UI.
Not only can this lead to significant cost avoidance of building and managing different tooling stacks, but it also gives these people a common language and frame of reference. For the first time, these people can really collaborate and cross over in terms of skills and responsibilities.
Fully Managed Spark
Though deploying Spark isn't the hardest thing in the world, it is not trivial, and does require effort to deploy, monitor, maintain and upgrade.
Using Databricks, you simply select an auto-scaling cluster size and create it. You know this is configured for best practices and security, and upgrades and maintenance are hugely simplified through the management.
The real benefit here is about time to value, avoiding the need to build out on premises or cloud infrastructure, and redirect time and budget straight into the business deliverables.
In Your Own Cloud Account
Often, people are concerned with using SaaS products for their data, due to regulatory requirements, information security, or the value of the data assets.
Databricks solves this elegantly, by storing all data and processing within your own cloud account, where it can be securely managed and controlled. This makes information security and governance much easier, whilst still giving you a SaaS like experience.
Unification Of Data Warehouse And Data Lake
Within enterprise, often there will be a number of data lakes for unstructured data, and a data warehouse for more business intelligence type scenarios. There will also be lots of ETL moving data between databases and data lakes.
Databricks have a concept of a “Data Lakehouse”, whereby you organise your data as a data lake, but can then query it using SQL and Data Warehouse type semantics.
This can massively simplify the technology estate and allow you to consolidate down onto one data store.
Unification Of Batch and Streaming
As with data lakes and data warehouses, there are also a mixture of batch and streaming workloads within enterprises today.
Databricks and Spark have a number of features which simplify this technology, allowing you to unify batches of data as streams, reducing complexity in the data infrastructure.
Exit Strategy Through Open Source Spark
For the most part, if you become unhappy with Databricks for any reason, there is usually a fairly simple exit process into an open source or managed Spark. As you will typically store your data on S3 in an open format such as Parquet, this really minimises lock-in.
Overall, the benefits of Spark are very clear to us. Less overhead, improved opportunities to collaborate without taking the pain of vendor lock-in is really an attractive proposition.