Thursday, February 14, 2013

Get to know the Random Query Generator


In the MySQL QA teams in Oracle we have been using a tool called the Random Query Generator (or "RQG" for short) for some time now. The main RQG testing effort has been on new server development, including regression testing. The recent MySQL 5.6 GA is a result of a couple of years of hard work, including a lot of RQG testing and bugfixing as a result of that.

I can easily say that the RQG has helped making MySQL 5.6 a better release than it would otherwise be. It is of course not our only testing tool, but there are plenty of bugs this tool has uncovered that would likely not have been seen in our other testing. Such issues include:

  • crashes and asserts
  • memory management bugs (valgrind errors)
  • wrong results
The main clue to the power of the RQG is its ability to automatically generate SQL queries based on some grammar file. This makes it possible to produce SQL statements that you would not have been able to cook up manually even if you tried for years.

I plan to write a series of blog posts which go into some more details, including specific techniques and use of the tool. In the mean time, if you are interested I recommend that you spend some time reading more at the RQG's home at Launchpad,  or reading some getting started documentation on Github.

If you just want to get started right away, this is what you need:
  • Perl (version 5.10 or newer is recommended)
  • A MySQL Server installation (other databases are also supported to some degree, including Java DB, but MySQL has had the main focus)
  • Some Perl modules:
    • DBI
    • DBD::mysql
    • Digest::MD5
  • Note: For more than running just simple test runs, you need a few other Perl modules as well, see docs for details. Note that it pays off having a relatively recent version of Perl, as more modules are part of core Perl then, meaning you won't have to install that many yourself.
  • The Bazaar version control system (to get the most recent RQG code).

Once you think you have everything set up, branch the RQG code from launchpad, like this:

bzr branch lp:randgen

Then take the RQG for a test drive like this (unix style):

cd randgen

perl runall-new.pl \
  --grammar=conf/examples/example.yy \
  --queries=1000 \
  --threads=1 \
  --basedir=/path/to/your/mysql/installation \
  --vardir=$PWD/vardir

The test run should take less than a minute in total. You should see in the output that RQG starts a server, creates a test database, starts some validator (more on that later), starts running queries (the actual queries are not printed in this case), and at the end shuts down the server and reports the end result (should be STATUS_OK).

...
# 2013-02-14T08:18:57 Test completed successfully.
# 2013-02-14T08:18:57 GenTest exited with exit status STATUS_OK (0)
# 2013-02-14T08:18:57 Stopping server on port 10630
# 2013-02-14T08:18:57 [14332] runall-new.pl will exit with exit status STATUS_OK (0)

If you want to see what kind of queries the RQG actually produced, you can check out the file vardir/mysql.log (since the example grammar is relatively simple the SQL statements will be simple too).

6 Query     UPDATE `A` SET `pk` = 0 WHERE `col_varchar_key` < 9 LIMIT 6
6 Query     DELETE FROM `AA` WHERE `col_varchar_key` = 0 LIMIT 4
6 Query     UPDATE `C` SET `col_varchar_key` = 5 WHERE `col_varchar_key` < 8 LIMIT 0

You can also add the option --sqltrace=MarkErrors to the command line in order to make the RQG print all generated queries to standard output. This is what --sqltrace does, while the MarkErrors setting ensures that any failed statements (i.e. illegal statements that are produced from the grammar but refused by MySQL) are marked (prefixed) with "# [sqltrace] ERROR <errno>:". Example:

INSERT INTO `BB` ( `col_datetime_nokey` ) VALUES ( 1 );
# [sqltrace] ERROR 1062: UPDATE `DD` SET `pk` = 7 WHERE `col_int_key` < 7 LIMIT 4;

Error 1062 means "Duplicate entry for key", and is a perfectly valid error in this case.

Feel free to experiment, study the code, and try different settings. I hope to present some more specific details in this blog later.

18 comments:

  1. Any follow-ups coming? I'm new to RQG aqnd looking for anything to help me get started.

    ReplyDelete
    Replies
    1. Hi Tim,

      Yes, I was planning to post some follow-ups, but unfortunately I was delayed for a bit. Hoping to get around to it soon. If you have any particular sub-topics of interest, please let me know and I may prioritize it if I happen to know anything about it.

      By the way, great to see that someone has interest in blogs about the RQG :)

      Delete
  2. Very good for starter. Can I use RQG for my own data schema?

    ReplyDelete
    Replies
    1. Thank you.

      To some degree you can, and there are several tricks you can use to get as close to it as possible. Maybe this would be a good topic for a future blog post :) In short:

      - Try to use the "--gendata" option to specify table/data specs for random data generation. You can specify table names in the "names" attribute of the zz file. For more details see https://github.com/RQG/RQG-Documentation/wiki/RandomDataGenerator

      - If the above is not enough, you can define your own schema in the query grammar (.yy file) itself, in the "query_init" rule. Unlike the "query" rule, "query_init" will be used only once per test run, before the "query" rule (which is used once for each generated query). For a rather advanced example, see http://bazaar.launchpad.net/~randgen/randgen/rqg2/view/head:/conf/runtime/concurrency_1.yy

      - Alternatively you can define a schema independently of the RQG and either start/stop the server manually (use gentest.pl instead of runall-new.pl for running the RQG, as runall will also start a server), or use the "--start-dirty" option to run against an existing database/vardir.

      Delete
  3. Hi,
    I used the RQG for some extent after installing it on my PC.
    It generated 1000 queries, But not all queries are proper.

    I tried to get a inner join query as below,
    Say I have 2 tables t1 with 3 columns x,y,z and T2 with 3 columns a, b, c
    But the queries generated don't map the table column with it own columns.
    For Ex: It says T1.a=T2.x in where clause which is ambiguous to SQL engine!
    So this is just a random combination with existing table and all columns.

    I tried invariant too but no use. Any other way out to get proper SQL query from RQG?
    Or this work like this only?

    Since I am looking for creating complex queries say multilevel join, from TPCH tables, it may not help.
    Since TPCH has different data type columns and if it map randomly all column with all table it may not help me to get multi level join proper queries.

    The used grammar file looks like below,
    select:
    SELECT _table[invariant]._field
    FROM _table[invariant] AS X
    INNER JOIN _table AS Y
    ON _table[invariant]._field = Y._field ;


    ReplyDelete
  4. Hi Shiv,

    Good to see that you are trying out the RQG. Grammar development is a big and potentially complex topic, for which I may not always have the answers. I'm glad to give beginners some pointers, though.

    You are on to something when you are using aliases for your tables. The trick that is often used for JOINs is to use aliases to hide the real table names, and thus avoid the problem that the table name may be different each time you use the _table rule. This is mentioned in the docs, but it could perhaps be explained a bit better.

    Your issue seems to be that you use columns in the SELECT list (and elsewhere) from tables that are not in the FROM list. Once you assign an alias to a table, you must use the alias name when you refer to the same table in the SELECT list.

    Example:

    mysql> CREATE TABLE t1 (a INT, b INT);
    Query OK, 0 rows affected (0.66 sec)

    mysql> SELECT t1.a FROM t1 AS mytable;
    ERROR 1054 (42S22): Unknown column 't1.a' in 'field list'
    mysql> SELECT mytable.a FROM t1 AS mytable;
    Empty set (0.00 sec)


    The _table[invariant] trick would probably work if you avoid assigning the X alias to the first table. But I think using aliases for both tables is better, like this:

    select:
    SELECT X._field
    FROM _table AS X
    INNER JOIN _table AS Y
    ON X._field = Y._field ;

    ReplyDelete
  5. Hi John,
    Thanks for the reply.
    I tried many grammar options to get table to column mapping to get correct, but no success in that!

    The 2 table is as below,
    mysql> explain t1;
    +-------+---------+------+-----+---------+-------+
    | Field | Type | Null | Key | Default | Extra |
    +-------+---------+------+-----+---------+-------+
    | x | int(11) | YES | | NULL | |
    | y | int(11) | YES | | NULL | |
    | z | int(11) | YES | | NULL | |
    +-------+---------+------+-----+---------+-------+
    3 rows in set (0.11 sec)

    mysql> explain t2;
    +-------+---------+------+-----+---------+-------+
    | Field | Type | Null | Key | Default | Extra |
    +-------+---------+------+-----+---------+-------+
    | a | int(11) | YES | | NULL | |
    | b | int(11) | YES | | NULL | |
    | c | int(11) | YES | | NULL | |
    +-------+---------+------+-----+---------+-------+
    3 rows in set (0.08 sec)

    mysql>

    The grammar as you said is below,
    select:
    SELECT X._field
    FROM _table AS X
    INNER JOIN _table AS Y
    ON X._field = Y._field ;

    The output queries as below of no use!
    SELECT X.`y` FROM `t2` AS X INNER JOIN `t2` AS Y ON X.`b` = Y.`b`;
    SELECT X.`y` FROM `t1` AS X INNER JOIN `t2` AS Y ON X.`a` = Y.`b`;
    SELECT X.`y` FROM `t2` AS X INNER JOIN `t2` AS Y ON X.`b` = Y.`b`;
    SELECT X.`y` FROM `t1` AS X INNER JOIN `t2` AS Y ON X.`a` = Y.`b`;
    SELECT X.`y` FROM `t2` AS X INNER JOIN `t1` AS Y ON X.`x` = Y.`y`;
    SELECT X.`z` FROM `t1` AS X INNER JOIN `t2` AS Y ON X.`b` = Y.`c`;
    SELECT X.`x` FROM `t2` AS X INNER JOIN `t1` AS Y ON X.`z` = Y.`z`;
    SELECT X.`z` FROM `t1` AS X INNER JOIN `t1` AS Y ON X.`x` = Y.`z`;
    SELECT X.`y` FROM `t1` AS X INNER JOIN `t1` AS Y ON X.`z` = Y.`z`;
    SELECT X.`x` FROM `t1` AS X INNER JOIN `t1` AS Y ON X.`y` = Y.`x`;
    SELECT X.`y` FROM `t2` AS X INNER JOIN `t2` AS Y ON X.`a` = Y.`c`;
    SELECT X.`y` FROM `t2` AS X INNER JOIN `t1` AS Y ON X.`x` = Y.`x`;
    SELECT X.`x` FROM `t2` AS X INNER JOIN `t2` AS Y ON X.`c` = Y.`c`;
    SELECT X.`x` FROM `t1` AS X INNER JOIN `t2` AS Y ON X.`b` = Y.`b`;
    SELECT X.`x` FROM `t2` AS X INNER JOIN `t1` AS Y ON X.`y` = Y.`x`;
    SELECT X.`y` FROM `t1` AS X INNER JOIN `t1` AS Y ON X.`z` = Y.`y`;
    SELECT X.`x` FROM `t1` AS X INNER JOIN `t2` AS Y ON X.`a` = Y.`c`;
    SELECT X.`y` FROM `t1` AS X INNER JOIN `t1` AS Y ON X.`y` = Y.`x`;
    SELECT X.`y` FROM `t1` AS X INNER JOIN `t1` AS Y ON X.`z` = Y.`x`;
    SELECT X.`z` FROM `t1` AS X INNER JOIN `t2` AS Y ON X.`a` = Y.`b`;
    SELECT X.`y` FROM `t2` AS X INNER JOIN `t2` AS Y ON X.`c` = Y.`a`;
    SELECT X.`z` FROM `t1` AS X INNER JOIN `t1` AS Y ON X.`x` = Y.`x`;
    SELECT X.`x` FROM `t2` AS X INNER JOIN `t1` AS Y ON X.`z` = Y.`y`;
    SELECT X.`z` FROM `t2` AS X INNER JOIN `t2` AS Y ON X.`b` = Y.`c`;
    SELECT X.`x` FROM `t2` AS X INNER JOIN `t2` AS Y ON X.`b` = Y.`a`;
    SELECT X.`z` FROM `t1` AS X INNER JOIN `t1` AS Y ON X.`y` = Y.`y`;
    SELECT X.`z` FROM `t2` AS X INNER JOIN `t2` AS Y ON X.`c` = Y.`c`;
    SELECT X.`x` FROM `t1` AS X INNER JOIN `t1` AS Y ON X.`z` = Y.`y`;
    SELECT X.`z` FROM `t1` AS X INNER JOIN `t1` AS Y ON X.`y` = Y.`x`;
    SELECT X.`x` FROM `t1` AS X INNER JOIN `t1` AS Y ON X.`z` = Y.`z`;
    SELECT X.`z` FROM `t1` AS X INNER JOIN `t2` AS Y ON X.`c` = Y.`c`;
    SELECT X.`z` FROM `t2` AS X INNER JOIN `t1` AS Y ON X.`z` = Y.`x`;
    SELECT X.`y` FROM `t1` AS X INNER JOIN `t2` AS Y ON X.`a` = Y.`c`;
    SELECT X.`x` FROM `t2` AS X INNER JOIN `t1` AS Y ON X.`x` = Y.`x`;
    SELECT X.`z` FROM `t1` AS X INNER JOIN `t1` AS Y ON X.`x` = Y.`y`;
    SELECT X.`y` FROM `t2` AS X INNER JOIN `t2` AS Y ON X.`c` = Y.`b`;
    SELECT X.`y` FROM `t2` AS X INNER JOIN `t1` AS Y ON X.`z` = Y.`z`;
    SELECT X.`y` FROM `t2` AS X INNER JOIN `t1` AS Y ON X.`z` = Y.`z`;

    So not sure now how to go further in this?

    Do you have any pointer who can give proper ans for this problem?

    Regards,
    Shiv

    ReplyDelete
    Replies
    1. Hi Shiv,

      If by "of no use" you mean that it randomly picks fields that do not exist in some table, then I understand. If you mean something else, please specify.

      If you read the documentation for writing an RQG grammar, you will see:

      _field - returns the quoted name of a random field. If _table was previously used, the field will be from that table. Otherwise, it will be taken from the first table of the current database (or the database that was previously returned by _database).

      https://github.com/RQG/RQG-Documentation/wiki/RandomQueryGeneratorGrammar

      So you are probably affected by this logic.

      By default, when the RQG generates its own data, all tables have the same column names, only the number of rows and maybe some other things differ. So that avoids this problem. If you use your own data, you have to work around the problem somehow.

      One way is to rename columns, so that they have the same names.

      Another way is to hard-code the column names in different rules in the grammar, and maybe use some embedded Perl code to push such column names into some array, then pick randomly from that array in another rule, etc. Unfortunately I do not have the time to service you with a full solution at this time, but I recommend reading more in the RQG docs and possibly pick up some tips, looking at existing grammars to see how things are done, and/or asking on the mailing list.

      The best way to get help with use of the RQG is the mailing list on Launchpad. See

      https://launchpad.net/~randgen

      You must become a team member to send E-mails to the list. People will do their best to help if they are able to. However, "proper answers" is not necessarily a guarantee in open source communities ;)

      Delete
  6. Hi John,
    Yes, you are right. I meant same by no use, as logically generated queries are wrong.

    As you said I tried renaming the column to same name. But it only work when we have same data type for all column. And also multiple tables should have same number of columns!

    Basically I started this project to get complex multi level join queries from TPCH tables.
    Let me spend some more time on this to crack the grammar issues.
    Yes, will join the launchpad/randgen group as you said..to put my queries there.

    Thanks....

    Regards,
    Shiv

    ReplyDelete
  7. Hi!

    It's a really good thing I found your blog, since there are so few blogs out there about beginner tutorials on RQG.
    I tried the same commands on your post, but all I am getting is, "Segmentation fault (core dumped)" error.

    Here is the exact command I used:
    $ perl runall-new.pl --grammar=conf/examples/example.yy --queries=1000 --threads=1 --basedir=C:/Program Files/MySQL/MySQL Server 5.6 --vardir=$PWD/vardir
    Segmentation fault (core dumped)

    It would be great if you could let me know what I'm doing wrong. Is my MySQL path wrong?

    Thank you so much!
    Cherry

    ReplyDelete
  8. Hi Charissa,

    First, I am sorry it took a while to moderate and reply to your post - I have been offline for a while for various reasons.

    Thank you for reading my post. I'm sorry it does not work as well for you as you had hoped.

    The example command I used was for unix/linux. The RQG works best on this platform. If you are using the RQG on Windows, you are to a larger degree "on your own", and will likely need to deal with various quirks every now and then.

    First off, using "--vardir=$PWD/vardir" will not work as intended on Windows. The $VARIABLE syntax is valid in the unix shell, but on Windows the corresponding syntax is %VARIABLE%. Besides, the $PWD environment variable on unix points to the current working directory. On Windows, the equivalent seems to be %CD%. Try making these adjustments, and see if it helps.

    Secondly, it has been long since I last tried to run the RQG on Windows, so I am not sure how well spaces in paths in options are handled. If it still fails for you, you may want to try to put a copy of MySQL 5.6 in a folder with no paths, or surround the --basedir value with quotes.

    Last, but not least, the main reason I do not run the RQG directly on Windows anymore is that it has proven over time to be quite error prone. In my team we have seen mysterious issues, kernel panic in Perl, problems with concurrency and thread/process handling, etc on Windows. There are at least two well known Perl implementations on Windows, and neither of them work well with such advanced/heavy usage as what the RQG does. Perl is unfortunately not a platform independent scripting language.

    Instead, when we want to test MySQL on Windows using the RQG, we run the MySQL server on windows, and the RQG as a client on Unix/Linux. We use another framework called JET (https://kenai.com/projects/jet/pages/Home), written in JAVA, to facilitate that. However, I generally recommend RQG beginners to stick to Linux, to avoid potential pitfalls and to be able to get started quickly.

    ReplyDelete
    Replies
    1. Thank you so much for all the help! We were able to run RQG now. We used Windows Powershell though. It's okay since we won't be using it that extensively. ^^ Thanks so much again! :))

      Delete
  9. Hi John, this tool is definitely a must-have for testing database development. I'm wondering if this can be used as a general purpose tool to test against other database engines other than MySQL. If so, could you point out which components need to be modified? I'm sure the syntax part is one of them. Anywhere else? Thanks!

    ReplyDelete
  10. Hi!

    Thank you for your article, it was very helpful!
    I was wondering if there is any way to reproduce failures?

    ReplyDelete
    Replies
    1. Hi Sona,

      Indeed there is, but how easy it is depends on your test run.

      When you use a specific seed value (--seed=...), or use the default (--seed=1), then the random generator will generate the same queries and data in the same order when you repeat the same command, and most issues should be repeatable for that reason.

      Sometimes there may be internal concurrency or other non-determinism in the product you are testing which may complicate things, but mostly you will be fine...

      ... as long as you are using --threads=1, meaning a single client connecting against a server.

      If you are using threads > 1 (default is 10), then the concurrency aspects will make each run a little random, so you may not be able to reproduce immediately.

      Then my recommendation would be to try again, maybe if you repeat the run 10 times you will see the issue again.

      There are also some tools that may help you, for example tools for simplifying grammars (see "util" folder), or the "reducer" tool (which I have not used much, but is claimed to be helpful in this regard).

      You can also grab the vardir or datadir and manually start a server using it, and try to reproduce using SQL via a client application.

      Delete
  11. Hi, great article ! Is there a way to just generate the queries and not run it ?

    ReplyDelete
  12. Thanks you for the great article ! Is there a way to just produce queries and not run it ? I can see that gensql.pl can be used, but can it be done with runall.pl ?

    ReplyDelete
    Replies
    1. Hi सौरभ,

      As you correctly mention, it is possible to just generate queries, not execute them, by using the gensql.pl script.

      I don't think this can be done directly using runall.pl, but then again, why do you want to do that? Which part(s) of runall.pl do you need in addition to the queries?

      runall.pl has the --sqltrace feature which prints queries to the console when executing them.

      It may also be possible to tweak the executors to not actually execute or connect to a database, but I think that requires some modifications elsewhere in the code.

      Delete