Product
Ocient Favicon
The Ocient Hyperscale Data Warehouse

To deliver next-generation data analytics, Ocient completely reimagined data warehouse design to deliver real-time analysis of complex, hyperscale datasets.

Learn More
Pricing Icon
Pricing

Ocient is uniquely designed for maximum performance and flexibility with always-on analytics, maximizing your hardware, cloud, or data warehouse as a service spend. You get predictable, lower costs (and absolutely zero headaches).

See How
Solutions
Customer Solutions and Workload Services Icon
Customer Solutions and Workload Services

Ocient offers the only solutions development approach that enables customers to try a production-ready solution tailored to their business requirements before investing capital and resources.

Explore
Management Services Icon
Management Services

Tap into the deep experience of the Ocient Management Services team to set up, manage, and monitor your Ocient solution.

Learn More
Company
Ocient Favicon
About Ocient

In 2016 our team of industry veterans began building a hyperscale data warehouse to tackle large, complex workloads.

Learn More
Published September 5, 2023

Big Data is Dead. Long Live Big Data.

Why the new era of ML and AI fuels the hyperscale future in AdTech.

Director of User Experience Jonathan KelleyBy Jonathan Kelley, Director of User Experience at Ocient 

Is Big Data dead? That’s the claim from one of Google BigQuery’s founding engineers, Jordan Tigani, now Founder and CEO at MotherDuck 

Tigani (whose team is doing incredibly innovative work) kicked off a debate I’ve been following for a few months in the world of Big Data, essentially predicting that Big Data is not a challenge for the vast majority of enterprises and organizations out there. 

Tigani no doubt has a vast knowledge base and wealth of experience to draw upon, and I appreciate him sharing such a thought-provoking post. At Ocient, however, we see the opposite end of the spectrum—our customers have quite large Big Data processing challenges and we’ve begun to market the term “hyperscale” to refer to the level of complexity and always-on data processing our customers require to drive innovation and grow their business. 

Tigani has a point, though. We almost did kill Big Data—collectively, through the last decade of best practices aimed at “making Big Data small.” Rollups, summarizations, attempts to down sample, these are all tactics customers have adopted to work around technology limitations.  

Trying to process Big Data at its full richness and dimensionality—not to mention analyzing it in real-time—can be a sure-fire way to break things. With the widespread practice of summarizing data and storing aggregates, it’s hardly a wonder that many organizations feel like the promise of Big Data never materialized. 

Enter a wave of new developments in artificial intelligence (AI) and machine learning (ML) where the promise of 10 years ago now has very real solutions. We’re standing at the edge of a new age of practical, accessible, and scalable AI and ML tools that will finally bring the original promise of Big Data back to life. And nowhere is this more critical—and more exciting—than in the AdTech world. 

Big Data was never dead 

We certainly see a world of challenges that customers face in processing data at scale, but in many industries—and perhaps most notably in AdTech—Big Data never died.

Perhaps it’s valid for businesses in certain sectors, but the AdTech industry is built on data—as are the increasing number of industries leveraging IoT, network, and other real-time data streams to generate valuable insights. 

Let’s dive into some of the points driving this conversation:  

 “Most people [companies] don’t have that much data.” 

As much as any other industry, AdTech grew up embracing the Big Data challenge. The most successful AdTech companies have always lived by the “more data is better” ethos. And the already-massive volume of ad exchange data continues growing steadily. In fact, 100% of AdTech leaders in a recent Ocient survey expect their data to grow “very fast” or “fast” in the years ahead.  

In some cases, we’ve even seen AdTech customers leave the cloud and host their large-scale data processing engines on premises. These workloads are too compute intensive and too expensive for them to run in the cloud. 

The continued proliferation of devices and media sources provides the AdTech industry with an enormous volume of structured and semi-structured data. But due to a variety of challenges, including legacy tech limitations, resourcing, and extreme data volumes, many DSPs and SSPs may opt to sample a small percentage of data. “Small,” however, in AdTech may still represents billions or even trillions of records. 

These records are often multi-dimensional: Each row might contain hundreds of sub-data points, like device type, geographic location, third-party audience data, etc. And that data is almost never in relational format; it’s coming through as a nested, semi-structured JSON object, for example. 

All of this is making AdTech one of the biggest drivers of demand for hyperscale data solutions—according to our estimation, a nearly $10B market that’s growing at 35% annually. 

Clearly, AdTech companies do have a lot of data (or could process a lot of data). Yet the reality is several AdTech companies have, until very recently, had no choice but to make significant compromises at hyperscale—letting large amounts of data fall through the cracks. In the case of legacy tech limiting innovation and growth, AdTech companies running legacy systems can’t effectively analyze the full scale and dimensionality of their data. So, some don’t collect all of it, while others are sampling the tiniest fraction of it.  

Take a typical campaign testing example: There might be tens of millions of digital ads going up for auction every second. An AdTech company would love to estimate the value of inventory as an input to the pricing algorithm, especially when looking at win data on the last three months, but given the volume and velocity, that’s just too much data. So, instead, they’re likely sampling a small percentage of their data. They know this sampling approach is hurting the accuracy and fidelity of their results, but they’re hamstrung by legacy tech that leaves them no other choice.  

Like with many data challenges, this isn’t a problem until it’s a problem. At the point where AdTech companies face business challenges—like the potential for increased customer churn, falling behind the competition, or missing opportunities to launch new products and services powered by data—then suddenly, their data problems become big data processing problems.  

 “Customers with giant data sizes almost never queried huge amounts of data.”/ “Most data is rarely queried.” 

While this point rings true in some instances, it doesn’t reflect the potential missed opportunities or cost benefit of leaving large amounts of data unqueried. So, while it may be currently true that most data is rarely queried and/or that huge data stores do not mean huge data queries, that status quo is not the ideal. Rather, the campaign testing example shows how these assertions might be stemming from a core challenge: Legacy tech limitations mean AdTech companies can’t query all their data, so they’re incentivized to develop workarounds to sample and summarize. 

Several of our customers are also facing challenges stemming from technical debt, and that’s changing behaviors inside the organization. This may lead to an inability to access more of their data, which in turn may limit possibilities for data teams and reduce the kinds of data-driven applications that are discussed, prototyped, and launched.  

Even the assertion that most queried data is less than 24 hours old still presents a big Big Data problem for AdTech companies: There’s an enormous volume of data coming in at incredible velocity—millions of digital ad auctions every second, with billions of multi-dimensional records behind them. AdTech companies’ critical need is to make sense of that data in real time so they can help their clients make predictive decisions that anticipate what’s happening in the market. 

It’s worth saying again: the reason many AdTech solutions aren’t querying enormous data sets isn’t because they don’t want to. The Big Data systems of the past limited how much data we could process either entirely blocking the ability to work with the data or introducing lots of friction on what could be queried.  

“Data is a liability.” 

Increasing regulatory concerns make collecting and storing huge volumes of data a liability that must be addressed. Yet, as explained quite well in this counterpoint post from Ponder, most companies want to do more, not less, with their data. To be sure, organizations are more mindful than ever of data security and data privacy concerns. Stricter regulations mean organizations must handle their data within constraints—yet this often means enhancing it with metadata necessary to meet rules and regulations—which further compounds the size and scale of the data to be stored and analyzed.  

I found this point from Ponder’s post particularly compelling: Forward-thinking companies can’t afford to delete data or only store aggregates, because you don’t know what future you will need to know. As markets evolve, businesses need to be able to slice and dice their data in new ways to define new metrics and gain perspectives on new concerns. And, once again, more data leads to better models and more accurate insights. Losing the full richness of data risks losing all of this future insight and value. On the flip side, having the full breadth and depth of retrospective data to work with creates a future-ready foundation for agility and innovation. 

Which leads to the new opportunities driving us beyond big data into hyperscale data processing and analysis… 

AI and ML breathe new life into Big Data 

The simple truth is it’s hard for humans to even fathom hyperscale data. Hyperscale typically involves extremely complex data types or vast amounts (e.g., petabytes) of data. For the sake of argument, let’s say that a petabyte represents a trillion rows in a data table. If you printed it out, that table would circle the earth—73 times. It would take you centuries just to scroll through that table—even at speeds where you’d hardly be able to see each row. 

This ungraspable quantity of data alludes to the biggest limiting factor in realizing the promise of Big Data: it’s often hard to know where to start—what challenges can be solved, what questions can be answered, and what insights can be uncovered from the Big Data inside your organization. When data is too complex and too dimensional for us to easily scroll through, filter, or explore, it can become a challenge to comprehend many of the potential connections and correlations (and the value within them).  

And while industries implement the predictive capabilities of ML and AI, in the last year, Large Language Models (LLMs) have proven to be an industrywide gamechanger that brings new usability. This new ease of use and creative potential combined with the analytical capabilities of ML marks a new beginning to solve the challenges of multi-dimensional analysis so critical for hyperscale applications in AdTech.  

Hyperscale data analysis engines fuel new AI & ML possibilities 

At the same time, we’re seeing rapid adoption of the hyperscale data processing solutions needed to feed these advanced AI and ML tools. For example, one of the challenges is taking terabits per second of input data—which is never in an easy-to-digest state—and getting that transformed from some semi-structured, messy state into relational schema that is indexed, compressed, encrypted, and queryable within seconds. That just has not been possible with legacy technology.  

But new tech breakthroughs now power software data analysis engines that can finally enable this kind of real-time, hyperscale analysis. The difference is the evolution of NVMe solid state drives (SSDs), boosted by high-core-count CPUs and 100-gig NIC connections everywhere. These are the building blocks of Ocient’s hyperscale data processing and analysis solutions, which deliver 10x—and often even 100x—the price performance that was previously available.  

This level of processing and analysis power means sophisticated ML tools cannot just do real-time analysis of the data that’s pouring in (to tell AdTech companies what’s happening), but to deliver prescriptive insights (to tell them how to deal with it).  

The most exciting part here isn’t what specific insights are being unlocked; it’s how these tools are uncovering entirely new questions to ask and identifying new correlational relationships to mine for actionable insight and value. For example, taking historical data and enriching it with third party data or other sources—piling data upon new data to unlock even more value and opportunity for AdTech companies and advertisers. 

This really goes back to the original foundation on which the AdTech industry was built: creating value from new and better ways to process and analyze data, and delivering differentiated products that unravel that complexity. 

The hyperscale future is here in AdTech—are you ready? 

There’s no doubt that Big Data is an overused buzzword with a troubled past: so much promise; so much frustration. Yet an honest analysis shows that frustration can’t be blamed on Big Data—it’s just taken our tech stacks this long to catch up.   

But today, AdTech companies finally have access to hyperscale data processing solutions that can ingest, prepare, and process petabytes of data in real time. And they can pair that with a fast-growing generation of AI and ML tools that can make sense of that hyperscale data in all its richness and dimensionality—to find completely unconsidered correlations, and increasingly deliver prescriptive insights. 

With hard learned lessons from the recent past, there has never been a more exciting time to be building AdTech solutions for hyperscale data sets. Leaders should build and maintain modern tech stacks designed to thrive in the hyperscale future, then prepare for new sources of value to emerge from their hyper-dimensional data.

If you’d like to learn more about our capabilities at Ocient, reach out and ask for a demo.