Published August 30, 2022

Architecting for the Future: The Next Generation of Hadoop Workflows for Hyperscale Data 

By Jonny McCormick, Solutions Architect at Ocient 

As we recently learned through a survey of 500 industry leaders managing some of the world’s most complex datasets, the world is moving beyond the buzzwords of big data. Today it’s imperative for business leaders to find the right tools for the right jobs and find competitive advantages to drive revenue in rapidly changing environments.   

It wasn’t long ago that Apache Hadoop, an open-source framework that is used to store and process large datasets ranging in size from gigabytes to petabytes became close to an industry standard. Hadoop was a breakthrough in clustering multiple computers to analyze massive datasets in parallel and at a decent cost.  

The Hadoop ecosystem has grown significantly over the years due to its extensibility. Today, the Hadoop ecosystem includes many tools and applications to help collect, store, process, analyze, and manage big data such as Spark, Presto, Hive and more. It is a data analysis toolbox that has become well established for data scientists across industries.  

We’re big fans of these tools here at Ocient, but also want you to understand that there is a more cost-effective hammer for your most difficult data jobs. The trade-offs with each technology need to be understood. Hadoop, while continuing to be effective in managing large complex batch workloads, is not viable for delivering real-time insights into constant streaming data sets. On the other hand, while Spark’s fast compute engine can deliver excellent response times, it comes with the challenge of spiralling costs as data footprint grows into the hyperscale range. 

Why the industry embraced Hadoop 

In 2006 when Hadoop had its initial release, it was at a point where we were seeing huge growth in dataset sizes. Advances in hardware meant that large organizations were routinely able to store hundreds of TB to PB of data. However, standard data processing options, such as conventional relational databases, became a bottleneck due to their linear nature and slow response times. 

Hadoop quickly developed into a stable, fast, and efficient way to process large data sets, with Yahoo helping it lead the way in terms of size and benchmarks. Distribution under Apache open-source licenses made the software accessible and inexpensive. 

Since then, the ecosystem has continued to grow with new projects providing key capabilities to the platform. For example, Hive and PIG providing more accessible SQL and developer layers, YARN contributing to scalability, and Spark providing a performance layer that had been missing until then. 

The huge community and rich ecosystem around Hadoop means that it will continue to be a dominant player for big data solutions. However, there are growing datasets and organizations out there for which it will never be a perfect fit. 

What’s changed about modern workloads 

There is no question that the Hadoop ecosystem has provided invaluable components that when brought together have been able to deliver insight into some of the world’s richest and most complex data sets. However, it is no longer the case that multi petabyte data sets are the preserve of genomics institutes, governments, and social media monoliths. Large data storage and analysis challenges exist in most enterprise business today. 

The way that data is being delivered has also changed. Expectations on data insight are fast and frequent, and the fire hose of data that is being pushed towards warehouse and analysis platforms is ever growing larger. Streaming is much more the norm and with that comes higher expectations on accuracy and response time. 

This combination of change in data ownership and expectation has put pressure on traditional operational IT teams to become experts in big data and analytics platforms. The reality is that while the Hadoop ecosystem has many tools able to address some of the challenges these teams face, being able to identify, integrate architect and deliver a working solution given all the choices can be challenging. 

Even when turning to the few experts out there who have experience in building systems for truly hyperscale requirements, businesses must still consider the cost and complexity of the solutions required to address these needs.  

This why at Ocient we are dedicated to bringing ease, manageability and efficiency to the largest and most complex datasets, pipelines, and analysis problems. 

How Ocient replaced costly Spark and Hadoop jobs for MediaMath, accelerating innovation and reducing costs by 50% 

Our customer, MediaMath, is an ad tech company that handles more than six million bid opportunities per second with 10-12 petabytes of new records per day. Ocient’s solution allowed MediaMath to analyze higher-resolution data sets and deliver faster, more accurate insights on demand. Prior to working with Ocient, it was technically infeasible for MediaMath to query raw bidder log data at high enough resolution and in interactive time. This led to complex pre-aggregation and estimating pipelines that added cost and introduced technical limitations. With Ocient, MediaMath now returns rapid, more granular insights to improve campaign effectiveness and return on investment (ROI) for its customers. 

Why migrate your Hadoop workload to Ocient? 

Ocient, from our inception, has been designed to tackle the most difficult data problems. We are seeing more and more customers looking for help dealing with multiple, complex data sources generating millions of records per second and processing 100s of TBs per day. 

One key benefit of the Ocient solution is that our software delivers every stage of a data pipeline.Whether data needs to be pulled from an object store or is being pushed into a message queue, our load and transform layer is designed to take data at speed and can scale to meet the largest requirements.  

An ETL layer that is a fundamental part of our software gives some fantastic benefits when it comes to managing data load and of course ensuring performance can be delivered right down to the Foundation (storage) layer. 

The Foundation layer is core to our capabilities. Ocient is built from the ground up to be NVME native and brings data together securely, effortlessly, and with more flexibility than any other tool available. The Compute Adjacent Storage Architecture (CASA) and horizontal scalability of the platform provide Tbit/s data transfer speeds within our clusters. 

Critically, just like Hadoop, these capabilities are only valuable if they can be harnessed in a way that meets customer needs the most difficult part of any data pipeline is defining it. Hadoop has tools to help, but ensuring they are both effective and efficient is no easy task.   

Ocient has proven to be able to replace existing Hadoop environments, improving performance, footprint, and cost. We achieve this in partnership with our customers with an engagement model that is designed to demonstrate the Ocient value from day one. Delivering success into production is something Ocient Architects and Engineers are passionate about. The expertise and experience within Ocient ensure the most difficult problems have solutions that can be proven and delivered.   

Get in touch and let us know how we can help solve the problems of the past and position your organization for the future.