I'm sure I'm not the only person on earth facing a complex and expensive analytical processing task. The one I've been working on for the past couple of years, runs on the GHTorrent 98.5 GB data set of GitHub process data. It comprises 99 SQL queries (2599 lines of SQL code in total) and takes more than 20 hours to run on a hefty server. To make the job's parts run efficiently and reliably I implemented simple-rolap, a bare-bones relational online analytical processing tool suite. To ensure the queries produce correct results, I wrote RDBUnit, a unit testing framework for relational database queries. Here is a quick overview on how to use the two.
You work with simple-rolap by writing small self-contained queries that create intermediate tables with the results required at each stage of processing or the final reports. You create one file for each query, and name the file after the table or report created by it. Here's an example.
-- Projects that have been forked create table stratsel.forked_projects AS select distinct forked_from as id from projects where forked_from is not null;
With a suitable setup, invoking the Unix make command will execute the queries whose results are missing or outdated as well as all cascading dependencies.
$ make rm -f ./.depend sh ../..//mkdep.sh >./.depend mkdir -p tables sh ../..//run_sql.sh forked_projects.sql >tables/forked_projects
This setup ensures the efficient and reliable execution of your processing; no chance of using stale data by accident. Other queries can also create reports or import data from external sources. You can add in the Makefile the dependencies for the data importing steps so that these are also always up-to-date.
I admit I initially had an error in the example query and it produced an extra row. This is a frightening concept in a complex task with tens of queries. If one of them has an error, wrong results can cascade downstream tainting the final results in insidious and difficult to detect ways. Over the past years we've learned that we can reduce the chance of such problems through unit testing.
To verify that a query does what it is supposed to do, you use RDBUnit to create and run a unit test. This contains three parts.
Here's a unit test example for the preceding SQL query. The types of the table columns are automatically deduced from the data you provide.
BEGIN SETUP projects: id forked_from 1 15 2 15 3 10 4 NULL END INCLUDE CREATE forked_projects.sql BEGIN RESULT stratsel.forked_projects: id 15 10 END
When you run the test, RDBUnit verifies that the specified results match what the query produced.
$ make test rm -f ./.depend sh ../..//mkdep.sh >./.depend ../..//run_test.sh ok 1 - forked_projects.rdbu: test_stratsel.forked_projects 1..1
Both simple-rolap and RDBUnit are available as open source software on GitHub. You can find a complete tutorial on using simple-rolap with RDBUnit for mining Git repositories in a technical briefing I presented at the 2017 International Conference on Software Engineering.Read and post comments, or share through
Last modified: Sunday, August 5, 2018 2:01 pm
Unless otherwise expressly stated, all original material on this page created by Diomidis Spinellis is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.