Giulia Lanzafame
on 10 June 2025

Apache Spark security: start with a solid foundation

Share on:

Everyone agrees security matters – yet when it comes to big data analytics with Apache Spark, it’s not just another checkbox. Spark’s open source Java architecture introduces special security concerns that, if neglected, can quietly reveal sensitive information and interrupt vital functions. Unlike standard software, Spark design permits user-provided code to execute with extensive control over cluster resources, thus requiring strong security measures to avoid unapproved access and information leaks.

Securing Spark is key to maintaining enterprise business continuity, safeguarding data in memory as well as at rest, and defending against emerging vulnerabilities unique to distributed, in-memory processing platforms. Unfortunately, securing Spark is far from a trivial task; in this blog we’ll take a closer look at what makes it so challenging, and the steps that enterprises can take to protect their big data platforms.

Why enterprises struggle with Spark security

Closing vulnerabilities in Java applications is very hard. Closing CVEs is fundamental for any software because it is one of the best ways to reduce the risk of being impacted by a cyber attack through known vulnerabilities. However, closing CVEs in Java applications like Spark is uniquely challenging for a number of reasons.

The first issue is the complexity in managing dependencies: a typical Java app may include more than 100 third-party libraries, each with different versions and dependencies. When a vulnerability is found in one library, updating or downgrading it can break compatibility with other dependencies that rely on specific versions, making remediation complex and risky. This tangled nest of dependencies can make some vulnerabilities practically impossible to fix without extensive testing and refactoring.

Apart from this, Java is very verbose and utilized greatly in corporate applications, typically in monolithic architectures of great complexity. Therefore, it is often the case that vulnerabilities affect millions of Java applications all over the world, creating a huge attack surface. The simplicity of exploitation and magnitude of these vulnerabilities make them challenging to eradicate entirely when impacted versions are deeply embedded in many systems. Consequently, developers are typically faced with a massive volume of CVE reports, which is challenging to prioritize and delays remediation.

Research shows that delayed patch updates are a major cause of security breaches in enterprise environments, for example the IBM 2024 Cost of Data Breach report shows that Known Unpatched Vulnerabilities caused $4.33M damage, and the Canonical and IDC 2025 state of software supply chains report indicates that 60% of organizations have only basic or no security controls to safeguard their AI/ML systems. These challenges create significant risks because delays in applying security patches can leave systems exposed to known vulnerabilities, while compatibility issues can force organizations to choose between security and stability and finally widespread vulnerabilities in widely used Java components can compromise millions of applications simultaneously, causing disruptions due to the need of critical fixes needed right away.

Java related challenges have a deep impact on Apache Spark. In the first place, Apache Spark has thousands of dependencies, so it becomes difficult to fix a CVE (both by pathing or bumping the version) because it is easy for the fix to break compatibility. This huge number of dependencies also has an impact on the number and severity of the vulnerabilities. In fact Spark has experienced several critical and high vulnerabilities over the years, which are traceable to its Java origins. In 2022 developers discovered the command injection vulnerability in the Spark UI (CVE-2022-33891) which had a 94.2% of exploitation and was in the top 1% of known exploitable vulnerabilities in recent times, and in 2024 alone two new critical vulnerabilities came out, clearly showing the threat posed by slow patching adoption in Java. These issues are not only a security concern for Spark clusters, but also force companies to make hard choices between implementing the latest security updates and prioritizing stability of their infrastructure.

Our effort on ensuring Spark’s security posture

At Canonical, we believe that robust security should be an integral part of your data analytics platform, not a secondary element – and with Charmed Spark, we aim to address the traditional complexity of securing enterprise Spark deployments.

We maintain a steady release pace of roughly one new version per month, while simultaneously supporting two major.minor version tracks, which as of today are 3.4.x and 3.5.x. This dual-track support ensures stability for existing users while allowing for ongoing feature development and security improvements. In addition, our proactive vulnerability management has led us, in the past year, to close 10 critical CVEs, resolve 84 high-severity vulnerabilities, and address 161 medium-severity vulnerabilities in Spark and its dependencies, extending this focus to related projects such as Hadoop for its dependencies.

By investing in automated, self-service security testing, we accelerate the detection and fixing of vulnerabilities, minimizing downtime and manual intervention. Our comprehensive approach to security includes static code analysis, continuous vulnerability scans, rigorous management processes, and detailed cryptographic documentation, as well as hardening guides to help you deploy Spark with security in mind from day one.

Charmed Spark is a platform where security is a central element, which benefits users by reducing exposure to breaches related to known vulnerabilities through updates and timely fixes, and by giving access to useful tools and documentation for installing and operating Spark in a securely designed manner. In an environment in which Java applications are a frequent focus of attacks and dependency complexity can slow the deployment of patches, Canonical’s approach acts to maintain increased levels of protection from threats, with users able to analyze and use data without inappropriate levels of concern regarding security weakness. This ultimately enables enterprises to focus on their core business application and to provide value to their customers without having to worry about external threats.

Canonical for your big data security

While the complexity of Java applications and their extensive dependency ecosystems present ongoing challenges, Charmed Apache Spark gives you a securely designed open source analytics engine without the level of vulnerability challenges that typically come with such a large Java-based project. Moving forward, these foundational security practices will continue to play a vital role in protecting the Spark ecosystem and supporting the broader open source community.
To learn more about securing your Spark operations, watch our webinar.

To learn more about securing your Spark operations, watch our webinar:

Starting a new big data project? Contact us
To learn more about Canonical’s solutions for big data, visit canonical.com/data/spark.

Quick links

Quick links

Quick links

Quick links

Quick links

Quick links

Quick links

Quick links

Quick links

Categories

Industries

Case studies ›

Partner programs

Quick links

Roles by department

Working here

Explore Canonical

Latest updates

Company highlights ›

Apache Spark security: start with a solid foundation

Why enterprises struggle with Spark security

Our effort on ensuring Spark’s security posture

Canonical for your big data security

Related posts

Spark or Hadoop: the best choice for big data teams?

Can it play Doom? Running an AI LAN party on a Spark cluster with ViZDoom

Why we built a Spark solution for Kubernetes