Big Data: We're Not There Yet - Part II
In an attempt to show one of our customers a better way to connect a legacy Postgres system to a large data set, a team of AE experts turned the discussion into an internal competition. Contenders Phoenix and Cassandra already lost the challenge, as you can read in Part I. In Part II we'll explore whether Drill or Impala can rise to the occasion.
Next up: Apache Drill. Drill is a scalable SQL query engine that allows to execute queries against a variety of data sources. The idea is to easily extract information from remote systems like flat files, Hadoop, MongoDB, ... Someone with little knowledge of these technologies should be able to use good old SQL to investigate the data. Drill also enables attachment by BI tools like Tableau or Qlikview via a JDBC connection. Since Drill can operate on a cluster, it might be a valid solution for our problem.
Querying our logs with Drill was very easy. You configure your data source in the web console as a new scheme. There, you define the kind of data you want to query. After that, Drill will expose an abstraction layer on top of the data source. To query a CSV file for example, you can simple use:
select * from dfs.myscheme.`mydata.tbl` order by mycolumn desc limit 10;
The next step is connecting to Drill from within Postgres. This looked somewhat problematic... As seen in the previous part of this blog post, you can set up such a connection using a foreign data wrapper. Unfortunately, there isn't such a thing as a Drill FDW (yet). So we went for the alternative of connecting to it using ODBC/JDBC. To do this, you can set up an ODBC driver for Drill as described in this manual. You can verify the setup by checking if your DSN (data source name) shows up in the results of the command:
You can connect and query the DSN using a simple ODBC tool:
$ isql -v "Sample MapR Drill DSN 64"
We verified that Drill is accessible using ODBC, so the remaining thing to do is attaching Postgres to it in this manner. This step is really poorly documented, but we managed to compile and install the ODBC FDW. The process followed is very analoguous to the one followed in the previous blog post; you add the FDW as an extension, create a FDW server, and finally a FDW table.
Querying this table however, returned no results nor errors, so we were left in the blind about what went wrong. This is hardly suprising since the only available ODBC drivers for Postgres are marked untested or unsupported for our version of Postgres. Furthermore, they were aimed at connecting to a relational database since you have to provide database specific parameters which aren't applicable to Drill.
We've put way too much time into this without success, so maybe Drill isn't our silver bullet either...
Last man standing: Apache Impala. This Cloudera project aims to provide real-time random access to a dataset in HDFS or HBase using standard SQL queries. It's been designed bottom-up to deliver response times of seconds up to minutes, instead of hours. It runs as several instances of a daemon directly on the cluster, rather then making use of the YARN resource manager. Impala makes use of the Hive meta store for table definitions and other metadata, which makes it painless to add to an existing architecture. Sounds good, right?!
The test setup included a Docker container running Postgres and a container in Azure for the Cloudera Quickstart machine, including Impala. The first results were very promising: running queries on Impala was indeed lightning fast, even on reasonably large amounts of data. Connecting to Impala using CLI, ODBC and JDBC was very easy and without any surprises. We soon ran into trouble, however, when we tried to connect from Postgres to Impala over ODBC.
The good folks over at CartoDB worked to get the only available ODBC foreign data wrapper compatible with Postgres 9.5+. To get it to work on 9.6 however, an extra patch was required. Compilation succeeded, FDW installed!
Next up: installing the Cloudera-provided Impala ODBC driver in our Postgres container. After filling out a form, we were able to download a Debian installation package. Installation was quite simple:
dpkg -i clouderaimpalaodbc_220.127.116.116-2_amd64.deb
We defined our drivers and connections in the ODBC configuration files:
[ODBC Drivers] Cloudera ODBC Driver for Impala 32-bit=Installed Cloudera ODBC Driver for Impala 64-bit=Installed [Cloudera ODBC Driver for Impala 32-bit] Driver=/opt/cloudera/impalaodbc/lib/32/libclouderaimpalaodbc32.so [Cloudera ODBC Driver for Impala 64-bit] Driver=/opt/cloudera/impalaodbc/lib/64/libclouderaimpalaodbc64.so
[impala] Driver = /opt/cloudera/impalaodbc/lib/64/libclouderaimpalaodbc64.so HOST = 18.104.22.168 PORT = 21050 DATABASE = test
Some environment variables proved to be necessary in order to get the Impala driver to work:
export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu/odbc export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libodbcinst.so
Connecting using the ODBC Impala settings now seemed to work. The only remaining step was the configuration of the FDW in Postgres and to start querying! The people of CartoDB were so friendly to provide some documentation on their Github page. From the psql console we tried the following in order to set up the FDW:
CREATE EXTENSION fdbc_odw; CREATE SERVER impala FOREIGN DATA WRAPPER odbc_fdw OPTIONS (dsn 'impala'); CREATE FOREIGN TABLE t1 (id INTEGER, auth TEXT, message TEXT) SERVER impala OPTIONS (odbc_DATABASE 'test', sql_query 'select id,author,message from `t1`'); CREATE USER MAPPING FOR postgres SERVER impala;
And now the moment of truth... From the psql console we tried querying Impala and received the following:
ERROR: 58000: Connecting to driver LOCATION: check_return, odbc_fdw.c:710
A quick lookup for error code 58000 only told us that we were dealing with a system error. Not much to go on. At this point, I've spent the best part of a day getting this to work, with only a vague error from an unofficial driver. I could have tried using the ODBC FDW to connect to another ODBC source, just to investigate where our problem was really located, but as for a smooth integration, we've seen enough.
And so Impala was equally out of the race...
Of our 4 contestants only one reached the finish,be it with a limp. But isn't competing better than winning anyway? The problem isn't the technologies themselves being immature, because they all show promise in delivering better results.
What's rarely mentioned by vendors and evangelists, however, is that in the real world they have to integrate (seamlessly please) with legacy technologies. As soon as Postgres came into play, we were doomed.
There are large communities out there, but apparently we're not at a point yet were efficient drivers (such as Postgres's Foreign Data Wrappers) allow for production-grade quality use of technologies such as Drill, Phoenix, Cassandra or Impala.
Of course this is just a very small example of where it didn't work. More and more success stories are happening as well, but at times like these we are confronted again with how much cutting edge often actually is bleeding edge.
We would very much like to hear your successes, failures or opinions related to this topic, so don't hesitate to let us know in the comments section below.
A special thanks to co-authors Gert Nelissen, Vincent Robaye, Niels Nuyttens and Kevin Stobbelaar. This post also appeared here.