blog dds: 2018-08-05 — How I slashed a SQL query runtime from 380 hours to 12 with two Unix commands

I was trying to run a simple join query on MariaDB (MySQL) and its performance was horrendous. Here’s how I cut down the query’s run time from over 380 hours to under 12 hours by executing part of it with two simple Unix commands.

Below is the query, with forms part of a more complex GHTorrent analysis I implemented using the simple-rolap relational online analytical processing framework.

select distinct
  project_commits.project_id,
  date_format(created_at, '%x%v1') as week_commit
  from project_commits
  left join commits
  on project_commits.commit_id = commits.id;

Both join fields are indexed. However, MariaDB implements the join with a full scan of project_commits and an index lookup on commits. This can be seen in the output of EXPLAIN.

+------+-------------+-----------------+--------+---------------+---------+
| id   | select_type | table           | type   | possible_keys | key     |
+------+-------------+-----------------+--------+---------------+---------+
|    1 | SIMPLE      | project_commits | ALL    | NULL          | NULL    |
|    1 | SIMPLE      | commits         | eq_ref | PRIMARY       | PRIMARY |
+------+-------------+-----------------+--------+---------------+---------+

+---------+-------------------------------------+------------+-----------------+
| key_len | ref                                 | rows       | Extra           |
+---------+-------------------------------------+------------+-----------------+
| NULL    | NULL                                | 5417294109 | Using temporary |
| 4       | ghtorrent.project_commits.commit_id |          1 |                 |
+---------+-------------------------------------+------------+-----------------+

The sizes of the two tables are relatively large: project_commits contains 5 billion rows and commits 847 million rows. Also the server’s memory size is relatively small (16GB). This probably means that index lookups hit the (unfortunately magnetic) disk, and therefore performance took a heavy hit. According to the output of pmonitor run on the generated temporary table, the query, which at that point had already run for more than half a day, would take another 373 hours to complete.

    /home/mysql/ghtorrent/project_commits#P#p0.MYD 6.68% ETA 373:38:11

To me this seemed excessive, because the I/O time requirements for a sort-merge join were orders of magnitude below the projected execution time. ( An answer I got from a question I posted on dba.stackexchange.com helpfully offered numerous things to try, but no conviction regarding their efficacy. I tried the first suggestion, but the results weren’t promising. As experimenting with each suggestion could easily take at least half a day, I proceeded with a way I knew would work efficiently and reliably.

I exported the two tables into files, joined them with with the Unix join command, piped the result to uniq to remove the duplicate rows, and imported the result back into the database. The process started at 20:41 at the import (including index building) had finished by 9:53 of the next day. Here are the precise steps I followed.

1. Export the database tables as text files

I first exported the fields of the two tables I wanted to join sorted on the join field. To ensure the sort order was compatible with that used by the Unix tools, I cast the field into a character type.

I saved the output of the following SQL query into the file commits_week.txt.

select cast(id as char) as cid,
  date_format(created_at, '%x%v1') as week_commit
  from commits
  order by cid;

I also saved the output of the following SQL query into the file project_commits.txt:

select cast(commit_id as char) as cid, project_id
  from project_commits
  order by cid;

This generated the following files.

-rw-r--r-- 1 dds dds  15G Aug  4 21:09 commits_week.txt
-rw-r--r-- 1 dds dds  93G Aug  5 00:36 project_commits.txt

Crucially, I run the mysql client with the --quick option to avoid running out of memory as the client tried to gather all results before outputting them.

2. Process the files with Unix command-line tools

Second, I joined the two text files using the Unix join command. This scans linearly through both files and combines the records whose first field matches. As the files are already sorted this can be done very efficiently: at the speed of I/O. I also piped the output of join into uniq to eliminate duplicate records. This handled the SQL distinct clause in the original query. Again, on already sorted output this can be done through a simple linear scan.

Here is the Unix shell command I run.

join commits_week.txt project_commits.txt | uniq >joined_commits.txt

After the processing, which took just an hour, I had the desired result in a file.

-rw-r--r-- 1 dds dds 133G Aug  5 01:40 joined_commits.txt

3. Import the text file back into the database

Finally, I imported the text file back into the database as a table.

create table half_life.week_commits_all (
  project_id INT(11) not null,
  week_commit CHAR(7)) ENGINE=MyISAM;

load data local infile 'joined_commits.txt'
  into table half_life.week_commits_all
  fields terminated by ' ';

Moral

Ideally, MariaDB should support sort-merge joins and its optimizer should employ them when the runtime of alternative strategies is projected to be excessive. Until that time, using Unix shell commands designed in the 1970s can provide a huge performance boost.

Comments Post Toot! Tweet Share