Product >
The OcientAIQ™ Unified Data Platform brings AI directly to petabyte-scale enterprise data so agents, analysts, and applications get trusted answers without moving data across fragmented systems.
Solutions >
OcientAIQ™ Solutions deliver trusted, production-grade agentic AI outcomes described in the language of your industry, built for the scale your operations require.
Company >
Founded in 2016, Ocient delivers trusted agentic AI solutions through OcientAIQ™, for the organizations that can't afford to get AI wrong.
Resources >
Explore in depth resources and perspectives, and learn how to get started with OcientAIQ™.
Published June 1, 2026

Graph analytics at scale without Spark bottlenecks

Run graph algorithms where the data already lives

Employee Spotlights: Patent Award WinnersBy Jason Arnold, Co-Founder and Distinguished Engineer

Welcome back. In previous posts for this “Inside the Ocient Engine” series, we’ve looked at casting and timestamps, arrays, tuples, and the entire surface of the Earth. So far the throughline has been syntax: how to ask the engine to do things using familiar Postgres-flavored SQL plus a few Ocient-specific extras.

Today I want to take a step away from pure syntax and look at something a little different: graph analytics. This is the one area in the series where the answer isn’t a SQL keyword. It’s an API. Specifically, I want to walk through the OCGraph library (our Spark GraphX–inspired framework for running large-scale graph algorithms directly inside the OcientAIQ™ Unified Data Platform), show you how easy it is to use if you’ve ever written a line of GraphFrames or GraphX code, and then dump some performance numbers we have been collecting that, frankly, surprised even me.

Two flavors of the library exist:

  • Java: ships as part of the Ocient JDBC driver, available on Maven Central as com.ocient:ocient-jdbc4. Pull in the JDBC dependency and the com.ocient.jdbc.graph.OCGraph class is right there alongside the SQL driver, with no separate artifact to manage.

  • Python: distributed on PyPI as ocient-graph. Run pip install ocient-graph and from ocient_graph import … and you’re done.

The two libraries expose identical sets of algorithms. Same names, same semantics, same outputs. I’ll use the Python flavor for the examples below because it’s shorter on the page, but every example has a one-to-one Java equivalent.

Let’s dig in.

The Mental Model: Graphs Are Just Two Tables

If you have used GraphX or GraphFrames, this part will feel familiar.

A graph in OCGraph is two relational tables:

  • A vertices table with a BIGINT NOT NULL column called id, plus any other columns you want (name, weight, attributes).

  • An edges table with BIGINT NOT NULL columns called srcid and destid, plus any other columns you want (weight, timestamp, type).

That’s it. There is no separate graph storage format, no specialized graph DDL, no second engine. Your graph is a pair of regular Ocient tables. That means you can index them, cluster them, join them with non-graph tables, query them with WHERE clauses, and so on. Everything you already know how to do with Ocient tables works on the graph.

Let’s set up a tiny example we can run actual algorithms against:

We populate them with 9 cities and 20 directed road segments. The first six cities are in the upper Midwest (Chicago, Milwaukee, Madison, Indianapolis, Detroit, Cleveland) and they’re all reachable from each other via interstate roads. The last three are towns on Hawaii’s Big Island (Hilo on the east coast, Kona on the west, Waimea in the north), and they’re connected to each other by Hawaii’s state highways. The Big Island is, well, an island, so the two clusters are physically disconnected: there is no road from Cleveland to Hilo. For this toy graph that is exactly what we want: two disconnected components, and within each component, three of the cities form a triangle.

OCGraph treats edges as directed. If your data is conceptually undirected, store each edge twice (one row each way). All of my road examples below have a row in each direction.

Algorithm #1: Connected Components

Connected components, sometimes called WCC (Weakly Connected Components), assign every vertex a label such that two vertices share a label if and only if there is a path between them. It is the bread and butter of identity stitching, fraud rings, telco subscriber graphs, and ad-tech audience consolidation.

Here it is in OCGraph:

That’s the entire program. Notice what isn’t there: no Spark session, no executor configuration, no driver memory tuning, no checkpoint directory, no HDFS staging, no spark-submit wrapper script. You have a database connection. Call a function. The function writes its result back into the database as a table you can query.

Let’s see what landed:

Two components. The Midwest cluster is labeled with the smallest vertex ID in the component (1, Chicago), and the Big Island cluster is labeled with the smallest Big Island ID (7, Hilo). That labeling convention (minimum vertex ID per component) matches GraphX.

Pro Tip: The library writes to a brand-new table that you name. It does not alter your cities table. This is intentional. Algorithms are pure functions over input tables: the input is never mutated, the output is a fresh table, and you can index that output up front via the result_vertices_indexes parameter. If you want WCC results joined back to your dimension table, write the JOIN yourself; the library is not in the business of guessing your schema.

Algorithm #2: PageRank

PageRank assigns each vertex a score representing its “importance”, i.e., how likely a random walker would land on it.

Chicago wins because it is the only Midwest city directly connected to four others. The three Big Island cities tie because their triangle is perfectly symmetric: each one has the same in-degree and out-degree, and they have no incoming PageRank from the Midwest to break the tie (the islands are disconnected from the mainland in our toy graph). Indianapolis and Cleveland tie because their roles in the Midwest graph are also symmetric.

If you want the Personalized PageRank variant (random walker always resets to a specific source vertex), just pass that vertex ID as personalization_src_id. Same function, same arguments, one extra integer. There is also a dynamic_page_rank for graphs where you want a tolerance-based stop condition instead of a fixed iteration count.

Algorithm #3: Triangle Count

Triangle count, for each vertex, returns the number of triangles that vertex participates in. It’s used for clustering coefficient calculations, bot detection, and characterizing graph density. It is also one of the algorithms that famously breaks Spark on large graphs. More on that in the performance section.

Six cities have one triangle, three have zero. The Midwest triangle (Chicago/Milwaukee/Madison) and the Big Island triangle (Hilo/Kona/Waimea), exactly as we expect. There is also triangle_count_pre_canonicalized for cases where your edges table is already canonicalized (each undirected edge appears in only one direction with srcid < destid); it skips the canonicalization step and runs faster.

Algorithm #4: Weighted Shortest Paths

Shortest paths is where the cracks in GraphFrames’ built-in API really show. GraphFrames’ shortestPaths algorithm does not natively support weighted edges; you have to drop down to GraphX (Scala) and roll your own Bellman-Ford to use a weight column. Hank Calzaretta from our customer solutions team had to do exactly this when benchmarking us against Spark on the LDBC Graphalytics SSSP test. He wrote a 100-line Scala program to do what GraphFrames couldn’t.

In OCGraph, weighted shortest paths is one function call, in Python, with the weight column passed by name:

Detroit is correctly computed as 470 miles via Indianapolis (180 + 290), not 515 via Cleveland (345 + 170). The three Big Island cities don’t appear because they’re unreachable from Chicago. Only finite distances are stored, which is exactly the behavior we want at scale.

The landmarks parameter takes a list, so landmarks=[1, 7] would compute distances to both Chicago and Hilo in a single call. The output is a long-form table with srcid, destid, and distance columns: easy to pivot, easy to filter, indexable.

Algorithm #5: Label Propagation

Label propagation is community detection: every vertex starts with its own label, and at each iteration adopts the most common label among its neighbors, ties broken by the smaller label. It converges to a partition of the graph into communities, useful for fraud rings, social network clusters, and weakly-modular structures where Louvain would be overkill.

In our toy graph, label propagation finds the same partition as connected components, which makes sense, because each component is roughly homogeneous. On a real social graph it will find the more interesting fine-grained communities inside each connected component.

A Quick Detour

Most graph workflows start with degree counts: find the hubs, find the dead ends, build a histogram. OCGraph has in_degrees, out_degrees, and degrees (which gives you both):

Chicago has degree 4 because four Midwest cities have direct roads to it. Everyone else is degree 2 because they sit in a triangle (each connects to two neighbors).

Why a Library, Not Just SQL?

A reasonable question at this point is: why is this a library at all? Recursive SQL exists, Common Table Expressions exist, Ocient has a powerful Dataflow runtime; couldn’t all of this just be a few WITH RECURSIVE queries?

In some cases, yes, and we use exactly that for some of these algorithms internally. But:

  1. Some workloads aren’t expressible in standard recursive SQL. ANSI WITH RECURSIVE requires monotonicity and linearity, which rules out things like Stable Marriage (which deletes intermediate state) and Bidirectional Search (which couples two recursive frontiers). Ocient’s Dataflows relax both constraints, but you still have to write the procedural code.

  2. PageRank needs floating-point convergence checks. Iterate until the delta drops below epsilon. There’s no clean way to express that in standard recursion.

  3. Even the algorithms that can be written in pure SQL are tedious and easy to get wrong. Triangle count via raw self-joins and group-bys is a 30-line query that is a magnet for off-by-one canonicalization bugs.

So the library exists for the same reason the GraphX library exists on top of Spark RDDs: to give you a clean, narrow, well-tested API for the algorithms that actually come up in production. The library translates each call into the appropriate sequence of Dataflow blocks (or, for legacy compatibility, JDBC-orchestrated SQL) and runs the work where the data already lives. This is the ELT (Extract-Load-Transform) pattern in its purest form: the data never moves, the computation goes to it.

Side-by-Side: GraphFrames vs. OCGraph

To make the API comparison concrete, here is what running PageRank on a graph looks like in PySpark+GraphFrames. This is essentially the program our customer solutions team used for the benchmark numbers I’ll show in a moment:

…plus a spark-submit wrapper that sets spark.executor.memory=200G, spark.executor.memoryOverhead=30G, spark.driver.memory=150G, spark.executor.cores=5, spark.executor.instances=5, spark.driver.cores=20, and spark.default.parallelism=500. These are the values our team had to hand-tune to get GraphFrames to even run on a 1.96-billion-edge graph without crashing.

Here is the OCGraph equivalent, assuming the data is already loaded into Ocient:

That’s it. No SparkSession, no checkpoint directory, no schema definitions, no CSV ingestion, no repartitioning, no cache management, no executor sizing. The cluster is already configured for the data; the data is already on it; the function just runs. Query the result table with SQL:

Both approaches conceptually do the same thing. One of them has 30+ lines of plumbing per algorithm. The other has 1.

Performance: Twitter Benchmark

So what does the speed look like compared to Spark on a real-sized graph?

Earlier this year, Ocient Engineer Hank Calzaretta and the customer solutions team benchmarked OCGraph against GraphFrames/GraphX on the LDBC Graphalytics XL twitter_mpi dataset (52.6M vertices, 1.96B edges). The Ocient cluster was 6 Foundation nodes (older Skylake, 40 hyperthreaded cores, and 768 GB of RAM each). The Spark cluster was 5 worker nodes plus a driver (16-core workers with 256 GB, plus a 52-core/192 GB driver). Spark also got hand-tuned executor/driver sizing (200 GB executor heap, 30 GB overhead, 150 GB driver, etc.). Ocient just had its data sitting in tables.

Each algorithm was run three times; the table reports the average elapsed wall-clock time:

A few notes on those numbers:

  • PageRank, CC, Label Propagation: all three are iterative whole-graph algorithms; OCGraph runs them roughly 7× to 17× faster on hardware that is, if anything, comparable to the Spark side, not lavish.

  • Triangle Count failed to complete on Spark. GraphFrames blew up trying to materialize adjacency lists for the high-degree vertices; GraphX (the older Scala API) hit the same out-of-memory wall on its exact triangle counting algorithm. The approximate sampling version did complete, but with materially wrong top-K results. OCGraph computed the exact answer in 31 minutes, validated against the LDBC ground truth file.

  • Shortest paths used a slightly different dataset (datagen-9.4-fb, ~29M vertices, ~5.2B bidirectional edges) because we needed weighted edges, and as I mentioned earlier, GraphFrames’ built-in shortestPaths doesn’t support weights. We had to drop down to a hand-written Scala GraphX/Bellman-Ford program for the Spark side, and even then it was 55× slower.

In every case where Spark did complete, OCGraph also matched the LDBC validation file for the result, so this is not a “fast but wrong” comparison. Same answers, much less time, much less plumbing.

Hyperscale: Where It Really Diverges

The benchmark above was at the “big enough to be uncomfortable for Spark” scale, about 2 billion edges. Internal testing has pushed Connected Components much further, on a higher-capacity cluster (dual EPYC 9654 / 2.3 TB RAM per node), to find out where each system falls off a cliff:

To be clear on the 100B-edge entry: Spark didn’t crash; it just kept going. After 13 hours we killed the run because we had no signal that it was going to finish in any reasonable timeframe. Ocient finished the same workload in 23 minutes.

Yes, the Ocient cluster has roughly 2× the hardware (11 nodes vs 5), and yes, that matters at small scales: at 100M edges the per-node throughputs are nearly identical (~0.14M edges/node-s either way). But normalize per-node and the per-node throughput crossover happens around 10B edges, where Ocient delivers about 2.3× per-node throughput. Past 100B edges Spark cannot complete at all in any timeframe we were willing to wait for, so per-node comparison stops being meaningful.

The reason isn’t a smarter algorithm. It is the architecture. Spark’s working-set spills to local disk via JVM serialization, which dominates execution time once you exceed RAM. Ocient’s Dataflow runtime spills intermediate state directly to NVMe in the same on-disk format the engine already uses, so on the latest software releases spilling to NVMe is typically less than 2× slower than staying in memory, small enough that we can keep adding workload without falling off a cliff. That difference is invisible at 100M edges and decisive at 100B+. We were able to run WCC against a literal trillion-edge graph in under 5 hours; on the Spark cluster we couldn’t even get the data loaded.

How This Stacks Up Against Unified Data Platforms

A reasonable next question is: how does this compare to graph features in other cloud data warehouses and operational databases? A handful of them have announced graph support recently. The patterns I see, in the order they tend to surprise people:

  • Algorithms come in different shapes. Some vendors expose graph features as a new query language dialect (typically GQL or a Cypher-flavored syntax), and you traverse the graph with MATCH-pattern syntax. Others (notably one of the big-three cloud warehouses’ operational DB sibling product) just ship the algorithms as stored procedures you call from regular SQL. OCGraph is closer to the second model (composable library calls from the host language), but it is not a SQL/stored-proc API. It is a Java / Python library that emits Dataflow blocks under the hood.

  • Most ship a small fixed set of algorithms. Typically you’ll find PageRank, connected components, shortest paths, and maybe a community detection or two. None of the warehouses I have looked at ship temporal graph algorithms, or k-shortest-paths, or matrix-style modularity optimization out of the box. If you need those, you write them yourself on top of whatever traversal primitives the vendor gives you.

  • Most are talking about scale at single-digit-billion or tens-of-billions of edges as a marquee number. A major cloud vendor recently announced their graph capabilities scale to tens of billions of edges and treated it like a major capability, which is fair, because it is a hard problem. But that is the same scale at which our customer solutions team’s twitter_mpi benchmark above was already running, and at which Spark is just starting to fall off, while internal testing already has Ocient running 100B-edge and 1T-edge workloads in minutes-to-hours, not days-or-never. The question shouldn’t just be “do you scale to tens of billions” (at this point that’s table stakes), it’s “what is your story at hundreds of billions, and what is your story at a trillion?”

It’s also worth being precise about what OCGraph is and isn’t. OCGraph is the graph analytics layer: iterative algorithms that compute over the whole graph or large subsets of it. That is a different concern from a graph query language layer (GQL, Cypher, SPARQL), where you express pattern matching, graph traversal, and pathfinding in a declarative dialect. The two are complementary: the query language is for “find me all customers who follow someone who bought product X in the last 30 days” and the analytics library is for “compute the PageRank of every customer, run Louvain over the entire follow graph, find the k-cores.” This post is about the analytics side.

Run Graph Algorithms Where the Data Already Lives

OCGraph (Java) and ocient_graph (Python) give Ocient users a Spark GraphX–shaped API for running graph algorithms directly inside Ocient: no Spark cluster, no executor tuning, no CSV staging, no separate engine path, no separate query language. The Java library ships as part of the Ocient JDBC driver on Maven Central; the Python library is on PyPI. Both expose the same set of algorithms.

Vertex and edge tables are regular Ocient tables. Algorithms are regular function calls. Results are regular tables. You can SQL-query everything in and out.

The library covers all the standard algorithms you’d recognize from GraphX, plus more than a dozen that GraphX never had: Louvain, Leiden, k-core, A*, Yen’s k-shortest, Jaccard/cosine similarity, fast random projection embeddings, max bipartite matching, stable marriage, cycle detection, and a small temporal family (time_respecting_shortest_path, temporal_page_rank) that nobody else seems to ship.

On the Graphalytics twitter_mpi benchmark (1.96B edges) we are seeing 6.6×–55× speedups over hand-tuned Spark, on comparable hardware, with same-as-Spark or better answer quality. On the trillion-edge scale we have run the algorithms that Spark couldn’t get past 13 hours on at one tenth the size, in under 5 hours.

If you’ve ever written a GraphFrames script and felt like you spent 80% of your time wrestling with executor configuration, give OCGraph a look. Pull the JDBC dependency from Maven Central for Java, or pip install ocient-graph for Python, point it at an Ocient connection, and start running algorithms.

Experiment with the cities/roads schema in this post against your own Ocient cluster; once you have the muscle memory for the API, swap in your real vertex/edge tables and turn the algorithm count knob.

The hardest part should be choosing which algorithm to run next—not configuring the infrastructure underneath it.

As always, you can find full documentation of Ocient here.