Skip to content
Paul Rogers edited this page Dec 7, 2019 · 3 revisions

Below are some highly personal observations after four years working with Drill, Impala and a proprietary query engine at a company that has a cloud-based application. Here is my own, very limited, opinions on where things stand. My view is from the limited perspective of a tools developer; the user or analyst view is likely much different.

TLDR: Drill was originally conceived as a data warehouse solution, but that market is long gone. I believe there is, instead, an opportunity for Drill to evolve to become the query component in any large-scale data application; to become the "Kafka" of query.

Fading of On-Prem Big Data

There have been multiple articles of late about the fate of Hadoop and of MapR and Cloudera. In short, on-prem Hadoop may have proven too complex and costly for all but the largest players such as FaceBook or Twitter. Couple that with the low return on investment reported by several sources and the future of on-prem big data appears to be dimming, at least for the data warehouse use case.

Instead, the cloud appears to be better solution. Whether it be EMR, Google's equivalent, DataBricks, Snowflake or other offerings, it is far easier to simply let someone else manage the complexity.

Economies of Scale

An unsurprising result is the rise of the proprietary solutions such as Athena, RedShift, BigQuery, Snowflake and more. My experience showed that creating a good distributed query engine is hard. Both Drill and Impala simplified the task by focusing on read-only applications. Hive has tried to add ACID support for updates, with the resulting Frankenstein hodge-podge of technologies.

Open source turns out to be a poor venue for tackling these complex challenges: there are too many solutions (Drill, Impala, Presto, Hive, Spark SQL) with too little support for most. The result is that Drill is good, but has many dark corners that are of dubious quality. Neither Drill nor Impala can add updates (ACID) without major architectural changes (or without adopting the Frankenstein approach of Hive.)

A successful proprietary solution can tackle the challenges if revenues are sufficient to fund the development staff needed to solve the complex problems. Oracle's been at it for 40 years. Databases are very complex, with many systems that must work well together. These cannot be built well by an understaffed community. Presto and Spark seem to have critical mass, Impala was loosing critical mass, Drill never really had it. Of course, critical mass is far easier for the cloud vendors: one can guess that Google's Big Query, for example, has far more paying users than any open source equivalent and can thus afford the investments needed to build a solid product.

Companies large enough to do on-prem big data tend to self-support. It appears Presto has emerged as the preferred query solution because it is not dominated by one of the major Hadoop vendors. During a recent job search, I came across Presto and Hive, but few folks had ever heard of Drill. The unfortunate conclusion is that, without MapR sponsorship, Drill probably cannot remain a healthy open source project.

Performance

The general consensus seems to be that, when it comes to a SQL engine, there are really only three features that matter: Performance, Performance and Performance. The reason is simple: most users of SQL are not engineers; they just need to power dashboards, Tableau or other BI tools. They need queries to "just work" following whatever SQL their tool creates. The only way that one engine appears different than another is performance. This is, sadly, the reality of SQL engines: they are highly complex, but the users are non-technical and only care about performance of their particular queries.

Impala had performance as its top three goals (probably more if one breaks performance down into query planning, query execution, code generation, file format, short-circuit reads, etc.) Once performance was achieved, attention shifted to stability, adding features and so on. The result was a pretty good run until Cloudera's recent merger and financial troubles.

Drill has many very clever ideas (schema-on-read, no Hive Metastore, Java execution engine, vectorized processing) all of which should help performance in theory, but which do the opposite in practice. Drill's code is very complex. Attempts at optimization (e.g. prefer very large functions over smaller ones to avoid function call overhead) just make the problem worse. Even vectorized processing never achieved its aims: internal processing is all row-based, causing each row loop to work with the entire "batch" of vectors, destroying any hope of CPU cache optimization, but CPU cache optimization was a key reason for vectorization. Exchanges are overly memory-intensive and costly because of the need to slice vectors into rows for distribution. And so on.

A related issue is that Drill's execution DAG is nice and clean: each relational operator has its own implementation: scanners, filters, projection, selection vector removal, and so on. While elegant, this approach is inefficient: we must read all data before we filter it. Each operator has overhead. By contrast, Impala combines functions. The scan operator, for example, filters out unwanted rows at read time, eliminating the need for a filter or selection-vector-removal operator. Row layout is worked out at plan time, and implemented in each operator, eliminating the need for a project operator. And so on.

The point is not to critique Drill's implementation: all code is our best attempt to solve complex problems given the time and resources available. The current versions mostly work, but are complex and are not performance-optimized. In a well-funded project, we learn from our current version to create a new, improved version that solves current problems. Insufficient development resources on Drill, however, mean that we cannot do so in Drill: we have to work with what we have, even though the approach

Schema-on-Read, An Idea Whose Time Has Passed (for Data Warehouse Data)

Perhaps the single largest flaw in Drill is the one thing that is supposed to differentiate it from similar projects: schema-on-read. The idea is wonderful: Drill reads any kind of file without the need for ETL or a defined schema. What's not to like? Think of Drill as a SQL-based, open source Splunk.

The problem is, the very concept has been wrong from the beginning, at least for data warehouse applications. (Yes, these are fighting words.)

We just suggested that performance is the most important feature of a SQL engine. Yet, Schema-on-read requires lots of run-time complexity to interpret data. A schema-based approach (such as Impala uses) makes such decisions during planning time (where they are done once) rather than run time where they must be done over and over, in each operator running in each distributed thread (minor fragment). That is, schema-on-read is the enemy of performance for the same reason that Python is not fast: Python is dynamic, doing type work over and over. C, on the other hand, is fast: type checking is done at compile time (at the cost of convenience.)

Stability is perhaps the second most important feature of a SQL engine. But, schema-on-read is vastly complex. Drill's first-generation support was already quite complex, but had the problems described in Chapter 8 of the "Learning Apache Drill" book. The recent schema-provisioning (EVF) feature added run-time schema support, which seems to work well, but is very, very complex. Given the lack of development resources, it is not clear that EVF can be rolled out to all data sources and to other operators. The result is that to achieve workable schema-on-read, Drill becomes too complex to maintain.

Once one starts reading "data in the wild", one realizes that data is messy. Canned rules (whether SQL casts expressed in views, or the features of schema provisioning) are insufficient to handle all cases. One, in fact, needs powerful tools and even a programming language. Further, messy data requires debugging, something that is very hard to do in a SQL engine.

Instead, to handle messy data, the old ETL solution really does work. Use Spark (or Hive or Informatica or whatever) to clean up the data into an efficient format such as Parquet or ORC. Then, create a simple, efficient runtime engine to read the known schema. The result is both better performance and simpler (easier to maintain) code.

The problem is split into two: ETL tools are good at handling messy data, the query engine is good at reading clean data. This is the approach taken by Impala and Snowflake. We now have enough experience with Drill to know that, as appealing as schema-on-read is as a marketing message, it is not technical feasible given the development resources available and the need for optimal performance.

Perhaps Drill might find its place not as a performance-critical data warehouse tool, but as a tool for exploring data: the old "data exploration" use case. Still, this does not address the messy data or debugging issues: it is very hard to debug data problems in Drill when all one gets is a large, uninformative stack trace. Writing custom code is arduous, and is not something the Drill user can realistically do. My advice? Use Spark for data exploration and for reading messy data. Spark has a huge community and lots of resources. Consider Databricks for enterprise-grade support.

The Cost of Table Scans

A fundamental premise of Drill (and Impala, Hive and Presto) is that one can support performance-sensitive data warehouse use cases by scanning all your data (typically with partition pruning to reduce scan scope.) The thing about big data is that it is big. We hear about TB table scans to answer a query.

Couple this with the desire to make full use of the hardware; to run hundreds or thousands of concurrent queries. This means concurrent scans over your TBs of data. Each query may touch large numbers of rows which must be sorted, aggregated and joined: all of which takes memory and network resources for exchanges.

Perhaps your 1000 queries need 1 GB of memory each per machine. That's 1 TB per node. Each query scans, say, 1 TB of data. That's 1 PB total. You want "sub-second" response time: that's 1 PB of data in a second, or 10K+ hard drives or, say, 1K SSDs. This is a large and expensive cluster.

Realistically, you might do 100 concurrent queries with smaller data, so perhaps you need 100 GB of memory per node and 1K HDD/ 100 SSD. Still a substantial investment.

In the cloud, data will reside in S3 or GCS. Accessing such data is slow and costly when doing table scans. Snowflake has long pointed to the correct solution: cache blocks of data on scan nodes; send queries to the nodes with data, or create a new cached copy if none exists. This works because users often hit the same data multiple times.

The fundamental conclusion is that table scans on large on-prem are not the way to achieve high query rates on large data sets. It is not clear what the solution is (many projects and companies have ideas), but this is not it. (It seems in-memory databases are making a comeback. Snowflake works around the issue with local caching. Etc.) And, of course, rather than owning all that hardware, it makes more sense to spin up a cluster in the cloud when you need it.

Table scans are fine for scheduled large batch jobs (classic Hadoop use case), but are usable only in very limited cases for interactive queries (such as when data is actually small and concurrency is actually low.)

It will be interesting to see how the industry ultimately solves this very difficult and challenging problem. The original Drill developers, who founded Dremio and created Apache Arrow, are pursuing one vision and we wish them success.

Final Thoughts on the Data Warehouse Use Case

Drill was a response to the Google Dremel paper and to Apache Impala that assumed on-prem Hadoop (ideally powered by the MapR file system.) Those assumptions and foundations are no longer valid. You can use Dremel (now available as Big Query.) Impala is faster (as long as you are a Cloudera customer.) The cloud vendors make distributed systems far easier than on-prem; so just use the query tools that your vendor provides. "Pure plays" such as Snowflake and DataBricks are far easier to use for the business user who just wants to get the job done and does not want to be a Hadoop admin.

Drill had a good run, but all good things must end and I believe that Drill's time is past. In summary:

  • Spark (perhaps from Databricks) for working with data in the wild.
  • Your vendor's query solution for high-performance queries.
  • ETL into the file format best optimized for your query tool.
  • Pre-defined schema to allow optimal run-time performance. (This means HMS, unfortunately.)

Even as one era of big data appears to draw to a close, another one is emerging based on specialized tools in the cloud. We hope Drill will find a role in this next step in evolution, but if not, users will still be the winner as new solutions emerge.

A New Era: Query as an Application Component

What does this mean for Drill? Is it a project to fade away into the Apache "attic" the way that Apache Apex (another technology innovator) has done?

A number of us "Drill veterans" have been slowly realizing that Drill could have a second life: as a component within a data-intensive application for data science or just for handling large amounts of data. Think how Kafka has come to dominate the message queue space because it is a simple solution to a complex problem.

Many applications need to solve one of two problems:

  • Query a large number of data sources (data science)
  • SQL queries over large volumes of data in a custom format (cloud-based applications)

Today, people build solutions (ad-hoc Python jobs for the first use case, home-grown query engines for the second.) What if there was a single Apache component that a project could drop into their stack and immediately solve their query problem. Multiple times now, some of us have seen that Drill is that solution. Of the open source query engines, Drill is the most flexible, the most able to act as an "SDK" for applications.

Several times I've heard of companies that need a query capability to query their own data formats, or to combine data from files and other sources (such as Cassandra, DynamoDB, etc.) Seems that some projects end up writing their own, limited, ad-hoc query engine because they can't find an open source tool to do the job.

Impala does one thing well: star-schema queries on Parquet. There has been noise that Cloudera/HortonWorks wants to merge Impala and Hive. That will tie Impala even closer to the Hadoop stack.

Presto is a very good choice for projects that have a defined schema. The code to add an adapter (the Presto term for "storage plugin") is very simple: primarily because the schema must be defined at plan time. Presto has huge traction: in use in many of the big internet companies (Facebook, Twitter), sponsored by Teradata, and is the foundation for Amazon Athena. In the data warehouse space, Presto has clearly won.

But, Presto cannot handle schema-on-read. I tried building a adapter for a proprietary query API, but could not even get started because those queries don't provide a schema until the first results arrive. Further, Presto trades off efficiency for simplicity: the internal data format is rows of Java Objects (though it seems one can use a vector-like format.) The same project was a breeze with Drill, because Drill handles schema discovery at runtime.

The cloud vendors provide query engines. But, for obvious reasons, teams cannot extend those engines with custom data sources, functions, planning rules or operators. Drill allows such extensions.

The Second Act

So, a question for the Drill community is whether there is a path forward to pivot Drill to be a versatile query engine that can be dropped into a larger application and immediately solve the query problem the way Kafka solves queueing or Kubernetes solves application deployment.

Clone this wiki locally