By Steve Sarsfield, Developer Advocate at Ocient
A key area of expertise among database professionals lies in their special knowledge of helping companies speed through tricky analytics. For example, comparing strings on many database platforms can use significant computing resources and even grind operations to a standstill.
This blog will explore one technique for speeding up data analysis, such as long strings, text, and domain names.
Querying Performance on Domain Names and Other Strings
Although they are less performant, string searches are necessary for many use cases. Database teams face huge challenges in managing large volumes of domain name data. For example, telecommunications companies must store years’ worth of CDR and domain data. This data is meant to meet regulatory requirements for the company to analyze, help catch criminals, and prevent national security events. Telcos must often meet requirements that require sub-second speed on petabytes of data for timely interventions.
In another example, security applications built on a database may require quick analysis of domain name data. If your company is working with law enforcement to find the activities of “bad guys,” searching through logs and strings will come into play. It’s crucial to find the bad players quickly and take action.
Of course, this is not an exhaustive list. Requirements from other teams may also require analysis of strings, domains, and so on. Your database needs efficient string search capabilities and optimizations that can help. One such optimization is n-gram indexes, which can significantly enhance the speed of this type of workload.
Why are Strings Slower than Numbers?
Handling strings requires more processing power due to a couple of factors. Strings can dramatically vary in length, requiring databases to manage variable storage, which adds overhead for the database. By comparison, numeric types generally have fixed sizes, making them faster to process and store. Databases use data types like VARCHAR or TEXT for strings, which take up more storage space compared to compact formats for numbers. This increases the amount of data that needs to be read from memory, increasing query times. String searches often involve complex matching techniques (like pattern matching and wildcards), which require more computational resources than exact numeric matches. Finally, strings are compared character by character, whereas numbers can be compared in a single operation. String comparisons are usually significantly slower, a fact that is amplified by big data.
What is an N-Gram Index?
N-gram indexes start by automatically breaking down strings into contiguous sequences of ‘n’ characters. For instance, the domain name “example.com” can be split into trigrams (n=3) as “exa”, “xam”, “amp”, “mpl”, “ple”, “le.”, “e.c”, “.co”, and “com”. These n-grams allow more efficient searching and matching, particularly in large datasets, by quickly eliminating rows that don’t match the pattern. Rather than matching on every string, n-grams will enable you to eliminate the time it takes to compare wildly different strings, say “example.com” to “facebook.com,” and allow your database CPU cycles to focus on similar ones.
The n-grams are automatically created and stored as secondary indexes within the Ocient Hyperscale Data Warehouse™. Like other indexes, it is used to quickly locate and access the desired rows without scanning the entire table, drastically reducing the number of comparisons needed to find a specific value.
Why Use N-Gram Indexes for Strings
Using n-grams to expedite access to strings makes sense in scenarios where you have large data and need to speed up the retrieval process by narrowing down the potential matches before applying more precise filtering. In our testing, n-gram indexes can speed up certain queries by 23X or more.
In the test scenario, we used queries to SELECT relevant data that contained “Google” in the domain name. For example, they followed the basic format of:
SELECT <columns> FROM <table> WHERE (DNS_Domainname LIKE (‘%Google%’))
Some queries had additional components on the WHERE clause because that’s what reality dictates. If you’d like to see our queries and data, or test this on your own data, please contact us.
Improving String Query Performance
The benefits of using n-gram indexes can reach far into your business with some stunning value.
- Enhanced Search Efficiency – Traditional database string matching can be slow, especially with large volumes of data. With n-gram indexes, the database can quickly compare these smaller n-grams rather than entire strings, speeding up the search process.
- Handle more data – At some point, data can become so large that it is complicated to muster all the resources needed to compare long strings. With n-gram indexes, memory, and other resources are less taxed during string comparison analytics, making bigger data sets manageable where they were untenable in the past.
- Fewer concurrency issues – When you use this technique, queries’ get out of the way’ for other queries in the queue behind it. Free up the database for other essential workloads.
- Lower costs – Traditional string comparison often demands additional servers, increasing costs. In contrast, utilizing n-gram indexes can reduce the number of servers needed, thereby lowering them.
What’s the Downside of N-Gram Indexes?
While indexes significantly improve query performance, they incur additional storage overhead and can slow down write operations, as the index must be updated whenever data is inserted or, updated, or deleted.
Database Administrators (DBAs) need to balance the performance benefits of n-gram indexes with the additional storage requirements. Tuning involves carefully selecting the value of ‘n’. n-grams (e.g., trigrams or higher) reduce the index size but may not provide the same level of search efficiency.
In our example, increasing the value of N (e.g., shifting from 3-grams to 4-grams or higher) typically enhances the accuracy of finding precise matches, as longer sequences capture more context from the data. However, this also results in larger index sizes since the length grows substantially, potentially consuming more memory and storage resources. Performance may also degrade with larger N values due to the increased search complexity through a more detailed index. Conversely, smaller N values produce smaller indexes and faster lookups but may lead to less accurate matching, especially with larger datasets or noisy data, since shorter n-grams may be less distinct. In our example above, some strings that would be further compared by a 3-gram include any URL ending in ORG, and that could be a large set, taking time and memory resources. By comparison, the 5-gram would only compare URLs ending in A.ORG, a much more refined set.
The key is to balance N according to the specific use case and data characteristics. To mitigate the storage impact, you can employ several optimization techniques, like selecting which strings you index and managing data lifecycle to tier off older or less used data.
Speeding-up String Operations with N-gram Indexes
N-gram indexes are a powerful tool for DBAs, offering substantial performance benefits in domain name analysis. While they come with minor storage implications, these can be managed through careful planning and optimization techniques. By incorporating n-gram indexes, organizations can achieve faster search capabilities, improved pattern matching, and better overall performance in their database operations, making it an essential technique for modern database management.
Ocient and N-gram Indexes
N-gram indexes are available in the Ocient Hyperscale Data Warehouse enabling customers to fly through queries and string operations solution. Our platform is engineered with features like n-gram indexes so that you can ingest, store, and analyze petabytes of data at unprecedented speeds. Indexes in a database application provide both performance and efficiency gains. By creating an index, any database can quickly locate specific rows in a table, significantly reducing the time required for query execution and allowing for faster data retrieval. This is particularly important in applications with large datasets, where searching through each row would otherwise take too long. With properly designed indexes, databases can handle more complex queries and support higher transaction loads, all while minimizing the need for extensive scanning.
N-gram indexes are special, however. Unlike most cloud data warehouses, Ocient N-gram indexes are just one of the ways that the Ocient Hyperscale Data Warehouse enables businesses to obtain fast results on time-critical queries. N-gram indexes are valuable for optimizing string search in databases, especially when dealing with large datasets containing text fields. The performance boost for string search often outweighs this cost, making them a powerful tool for text-heavy applications. Please get in touch with us for more information or to see how n-gram indexes can help you with your unique situation.