I recently wrote to bongo-devel about a problem in our store protocol: in the trunk version, some commands have a “query” parameter which is unspecified in our documentation. The reason it’s unspecified is that it turns out that the parameter just gets passed through to CLucene, which as well as being something of a security issue also makes it more difficult to tell people how to use it.
In the experimental Store (coming soon to a server near you!) the CLucene index has been removed, and the query interface to it therefore doesn’t exist any more. It’s quite possible that CLucene will make a come-back, but the query interface is something I want to get rid of: for no better reason than it requires us to duplicate information from the SQL database to the Lucene index.
Because we need a parser for these queries which is simple, safe and isn’t going to do nasty things with our memory allocator, I proposed a version of prefix notation / Polish notation, and have written some code which parses this into a tree structure stored in an array. The code is pretty straightforward, so I’ve documented the algorithm here – I’m sure it must be a known algorithm, but I’ve been unable to find any examples of anything similar.
In general, prefix notation isn’t used as commonly as postfix notation. This is mainly because it’s more efficient in terms of stack space to evaluate an expression in postfix notation, because the constants come before the operators and you can start calculating earlier. However, we’re not evaluating an expression – we’re going to turn it into SQL – so that isn’t a benefit, and having the operators come first means we can find places to later store the constants simply.
The next step is to transform this parsed expression into something we can throw at SQLite. This is actually much easier said than done: for example, we have to be able to handle arbitrary properties on documents, which are stored in a separate table with a one-to-many relationship between the document and its properties.
The problem with that is that it’s not that natural to say something like “give me rows from table A where there are rows in table B which relate to it and some of those rows contain X and Y”. So, this may turn out to be something quite hard or even impossible to provide a general solution to if the logic is quite deeply nested – but ideally, I want all of that nastiness hidden from the client.
Some readers may be asking how the current Store does this. Effectively, it runs the query against the index with the properties it has at hand, and retrieves a list of Store guids of the documents which have those properties set. It then queries the SQL database for each individual guid to see if any of the properties there match the query also, and if so it outputs the match.