SQL Disaggregation Techniques to “Structure” Unstructured Data

SQL Disaggregation with Clustrix

At Clustrix, we use Bugzilla as our bug database. Although it provides an xmlrpc API that claims a higher probability of compatibility across upgrades, we’re SQL people and can’t resist the temptation of querying its relational tables directly.

Our database consolidates data from various disparate sources where we use SQL to quickly develop complex reports spanning multiple departments. Our suite of tools and reports would take much longer if we had to develop JOINs across multiple different client APIs.

In Bugzilla, there is a table named “bugs,” that is clearly quite central to the schema. I’m going to talk here about two columns in that table: “keywords” and “whiteboard” (actually “status_whiteboard,” but we’ll abbreviate here).

How “Keywords” and “Whiteboard” Stack Up

The difference between these two terms is that “keywords” must be first created by a bugzilla admin, which stores them in the “keyword_defs” table. This is an example of structured data. When keywords are assigned to a bug (as comma-separated values in a text input field), the input is validated by the webui (i.e. that the keyword actually exists) and then the normalized relationship is stored in the “keywords” table. Additionally, Bugzilla re-materializes the denormalized keyword list as a comma-separated string into a “keyword” column in the “bugs” table, which was most likely done for performance, as it eliminates a correlated aggregate subquery to speed up its most frequently executed query: displaying bug(s).

The “whiteboard” field is located right above “keywords” in the Bugzilla WebUI, and is also sized for a single, short line of text. We use this field similarly to “keywords”: to hold a comma-separated list of tags, with the exception being that these have not yet reached the critical mass to promote them into formal “keywords.”

Bugzilla’s UI does not perform validation or provide a formal way to extract these tag-like elements for the purposes of joining and/or aggregating into reports, thus this data is effectively unstructured.

Clearly, the “keywords” table provides greater control via SQL as far as selecting bugs that have some tags but not others. But blocking human tasks on the availability of an admin to add new keywords can be a hassle; hence our “whiteboard” column gets used quite often.

Let’s consider the following example:

dogfood> CREATE TABLE bugs (id INT PRIMARY KEY AUTO_INCREMENT, whiteboard VARCHAR(100));

dogfood> INSERT INTO bugs (whiteboard) VALUES 
         ('one'),
         ('two '),
         (' three'),
         ('One, Two'),
         ('Two,  Three,'),
         (',One,   Three'),
         ('ONE,TWO,,THREE');

I could try to use the LIKE operator with wildcards to find bugs that do not have a specific tag:

dogfood> SELECT * FROM bugs WHERE LOWER(whiteboard) NOT LIKE '%three%' ORDER BY id;
+----+------------+
| id | whiteboard |
+----+------------+
|  1 | one        |
|  2 | two        |
|  4 | One, Two   |
+----+------------+

However that technique wouldn’t work as well if we had a tag called “twenty-three” which we did want to see. The REGEX operator can go a little further, but at the expense of the query becoming harder for humans to write/read and for the database to execute.

Now, suppose I want to strip out specific tags and return a count of the remaining tags – or perhaps reconstitute the csv string without them?

Introducing Disaggregation

By creating a table containing all the possible tags, we can do the opposite of aggregation, i.e. disaggregation. See the example below:

dogfood> CREATE TABLE whiteboard_tags (whiteboard_tag VARCHAR(20));

dogfood> INSERT INTO whiteboard_tags VALUES ('one'),('two'),('three');

dogfood> SELECT bugs.id,
                whiteboard_tags.whiteboard_tag
           FROM bugs
           JOIN whiteboard_tags
             ON (FIND_IN_SET(whiteboard_tags.whiteboard_tag,
                             REPLACE(LOWER(bugs.whiteboard),' ',''))
               > 0)
          ORDER BY bugs.id, whiteboard_tag;
+----+----------------+
| id | whiteboard_tag |
+----+----------------+
|  1 | one            |
|  2 | two            |
|  3 | three          |
|  4 | one            |
|  4 | two            | 
|  5 | three          |
|  5 | two            |
|  6 | one            |
|  6 | three          |
|  7 | one            |
|  7 | three          |
|  7 | two            |
+----+----------------+

Once we have disaggregated, we can filter and re-aggregate differently:

dogfood> SELECT bugs.id,
                COUNT(*),
                GROUP_CONCAT(whiteboard_tags.whiteboard_tag SEPARATOR ',')
           FROM bugs
           JOIN whiteboard_tags
             ON (FIND_IN_SET(whiteboard_tags.whiteboard_tag,
                            REPLACE(LOWER(bugs.whiteboard),' ',''))
               > 0)
          WHERE whiteboard_tags.whiteboard_tag != 'three'
          GROUP BY bugs.id
          ORDER BY bugs.id;
+----+----------+------------------------------------------------------------+
| id | COUNT(*) | GROUP_CONCAT(whiteboard_tags.whiteboard_tag SEPARATOR ',') |
+----+----------+------------------------------------------------------------+
|  1 |        1 | one                                                        |
|  2 |        1 | two                                                        |
|  4 |        2 | one,two                                                    |
|  5 |        1 | two                                                        |
|  6 |        1 | one                                                        |
|  7 |        2 | one,two                                                    |
+----+----------+------------------------------------------------------------+

Scrub that data!

As we saw above, mixed-case, stray whitespace, and blank elements can be addressed in the query, but misspellings can certainly happen when there’s no input validation. To somewhat mitigate that problem, we can periodically scrub the invalid data like so:

dogfood> INSERT INTO bugs (whiteboard) VALUES ('uno');

dogfood> SELECT id, whiteboard
           FROM bugs
           JOIN whiteboard_tags
          GROUP BY id, whiteboard
         HAVING SUM(FIND_IN_SET(whiteboard_tags.whiteboard_tag,
                    REPLACE(LOWER(bugs.whiteboard),' ','')))
                = 0
          ORDER BY id;
+----+------------+
| id | whiteboard |
+----+------------+
|  8 | uno        |
+----+------------+

Sounds like a Job for a Scalable Database

Obviously structured data is a big win, but sometimes it can make sense to quickly import data from unstructured sources. Or, perhaps, while evolving your applications, you could avoid prematurely over-designing by rolling a new feature out to beta and seeing how it is used, since some built-in SQL functions allow you to “structure” that data on-the-fly. Just be aware that by using those functions, you’re asking for more computation from the database for every row.