Skip to main content

Your submission was sent successfully! Close

Thank you for signing up for our newsletter!
In these regular emails you will find the latest updates from Canonical and upcoming events where you can meet our team.Close

Thank you for contacting us. A member of our team will be in touch shortly. Close

  1. Blog
  2. Article

Giulia Lanzafame
on 10 June 2025

Apache Spark security: start with a solid foundation


Everyone agrees security matters – yet when it comes to big data analytics with Apache Spark, it’s not just another checkbox. Spark’s open source Java architecture introduces special security concerns that, if neglected, can quietly reveal sensitive information and interrupt vital functions. Unlike standard software, Spark design permits user-provided code to execute with extensive control over cluster resources, thus requiring strong security measures to avoid unapproved access and information leaks. 

Securing Spark is key to maintaining enterprise business continuity, safeguarding data in memory as well as at rest, and defending against emerging vulnerabilities unique to distributed, in-memory processing platforms. Unfortunately, securing Spark is far from a trivial task; in this blog we’ll take a closer look at what makes it so challenging, and the steps that enterprises can take to protect their big data platforms.

Why enterprises struggle with Spark security

Closing vulnerabilities in Java applications is very hard. Closing CVEs is fundamental for any software because it is one of the best ways to reduce the risk of being impacted by a cyber attack through known vulnerabilities. However, closing CVEs in Java applications like Spark is uniquely challenging for a number of reasons.

The first issue is the complexity in managing dependencies: a typical Java app may include more than 100 third-party libraries, each with different versions and dependencies. When a vulnerability is found in one library, updating or downgrading it can break compatibility with other dependencies that rely on specific versions, making remediation complex and risky. This tangled nest of dependencies can make some vulnerabilities practically impossible to fix without extensive testing and refactoring.

Apart from this, Java is very verbose and utilized greatly in corporate applications, typically in monolithic architectures of great complexity. Therefore, it is often the case that vulnerabilities affect millions of Java applications all over the world, creating a huge attack surface. The simplicity of exploitation and magnitude of these vulnerabilities make them challenging to eradicate entirely when impacted versions are deeply embedded in many systems. Consequently, developers are typically faced with a massive volume of CVE reports, which is challenging to prioritize and delays remediation.

Research shows that delayed patch updates are a major cause of security breaches in enterprise environments, for example the IBM 2024 Cost of Data Breach report shows that Known Unpatched Vulnerabilities caused $4.33M damage, and the Canonical and IDC 2025 state of software supply chains report indicates that 60% of organizations have only basic or no security controls to safeguard their AI/ML systems. These challenges create significant risks because delays in applying security patches can leave systems exposed to known vulnerabilities, while compatibility issues can force organizations to choose between security and stability and finally widespread vulnerabilities in widely used Java components can compromise millions of applications simultaneously, causing disruptions due to the need of critical fixes needed right away.

Java related challenges have a deep impact on Apache Spark. In the first place, Apache Spark has thousands of dependencies, so it becomes difficult to fix a CVE (both by pathing or bumping the version) because it is easy for the fix to break compatibility. This huge number of dependencies also has an impact on the number and severity of the vulnerabilities. In fact Spark has experienced several critical and high vulnerabilities over the years,  which are traceable to its Java origins. In 2022 developers discovered the command injection vulnerability in the Spark UI (CVE-2022-33891) which had a 94.2% of exploitation and was in the top 1% of known exploitable vulnerabilities in recent times, and in 2024 alone two new critical vulnerabilities came out, clearly showing the threat posed by slow patching adoption in Java. These issues are not only a security concern for Spark clusters, but also force companies to make hard choices between implementing the latest security updates and prioritizing stability of their infrastructure.

Our effort on ensuring Spark’s security posture

At Canonical, we believe that robust security should be an integral part of your data analytics platform, not a secondary element – and with Charmed Spark, we aim to address the traditional complexity of securing enterprise Spark deployments. 

We maintain a steady release pace of roughly one new version per month, while simultaneously supporting two major.minor version tracks, which as of today are 3.4.x and 3.5.x. This dual-track support ensures stability for existing users while allowing for ongoing feature development and security improvements. In addition, our proactive vulnerability management has led us, in the past year, to close 10 critical CVEs, resolve 84 high-severity vulnerabilities, and address 161 medium-severity vulnerabilities in Spark and its dependencies, extending this focus to related projects such as Hadoop for its dependencies.

By investing in automated, self-service security testing, we accelerate the detection and fixing of vulnerabilities, minimizing downtime and manual intervention. Our comprehensive approach to security includes static code analysis, continuous vulnerability scans, rigorous management processes, and detailed cryptographic documentation, as well as hardening guides to help you deploy Spark with security in mind from day one.

Charmed Spark is a platform where security is a central element, which benefits users by reducing exposure to breaches related to known vulnerabilities through updates and timely fixes, and by giving access to useful tools and documentation for installing and operating Spark in a securely designed manner. In an environment in which Java applications are a frequent focus of attacks and dependency complexity can slow the deployment of patches, Canonical’s approach acts to maintain increased levels of protection from threats, with users able to analyze and use data without inappropriate levels of concern regarding security weakness. This ultimately enables enterprises to focus on their core business application and to provide value to their customers without having to worry about external threats.

Canonical for your big data security

While the complexity of Java applications and their extensive dependency ecosystems present ongoing challenges, Charmed Apache Spark gives you a securely designed open source analytics engine without the level of vulnerability challenges that typically come with such a large Java-based project. Moving forward, these foundational security practices will continue to play a vital role in protecting the Spark ecosystem and supporting the broader open source community.
To learn more about securing your Spark operations, watch our webinar.

To learn more about securing your Spark operations, watch our webinar:

Related posts


Giulia Lanzafame
10 December 2024

Spark or Hadoop: the best choice for big data teams?

Data Platform Article

I always find the Olympics to be an unusual experience. I’m hardly an athletics fanatic, yet I can’t help but get swept up in the spirit of the competition. When the Olympics took place in Paris last summer, I suddenly began rooting for my country in sports I barely knew existed. I would spend random ...


robgibbon
23 May 2024

Can it play Doom? Running an AI LAN party on a Spark cluster with ViZDoom

AI Article

It’s all about AI these days, so I decided to try and answer the important question: can you make a Spark cluster run AI agents that play a game of Doom, in a multiplayer LAN party? Although I’m no data scientist, I was able to get this to work and I’ll show you how so ...


robgibbon
17 October 2023

Why we built a Spark solution for Kubernetes

Data Platform Article

We’re super excited to announce that we have shipped the first release of our solution for big data – Charmed Spark. Charmed Spark packages a supported distribution of Apache Spark and optimises it for deployment to Kubernetes, which is where most of the industry is moving these days. Reimagining how to work with big data ...