This document presents a very straightforward parser designed for languages with pointer arithmetic, such as C. I'm sure this algorithm has been used before, but an internet search hasn't turned up anything so I've documented it here. The benefits of this algorithm are:
These benefits are basically why I wrote this code: I wanted code which would run iteratively without a working area, and where I can pre-allocate the memory for the result before starting and not need extra little allocations along the way so that the memory is easily and safely released at the end of the process.
My purposes for this code is to create arbitrary SQL queries from a simple expression language which is fully parsed by a server daemon, and which could handle such queries relatively safely without needing to trust clients to "behave".
A query in this language looks something like:
& | = object green = object red = type car
.. which is logically equivalent to ..
((object IS green) OR (object IS red)) AND (type IS car)
This algorithm relies on a basic structure which records the operator for any given sub-expression, and a left side and right side. Either side could be a pointer to a sub-expression, or to a constant. The whole expression is parsed by a tokeniser which delivers each token in order from left to right; in my C implementation I duplicate the original expression string and the tokeniser chops up this expression into zero-terminated tokens: pointers to these tokens are then stored as the constants.
There are four pointers needed: the start of an allocated array to store these structures, the end of that array, our current position within that array, and the furthest we have been in that array. To begin with, the current position is set to the start of the array.
The algorithm in pseudo-code:
struct expression {
op,
left,
right
}
expression start, end, current, last;
while token = next_token() {
if (current < start) error "Tokens remain but the expression is fully parsed"
if (is_operator(token)) {
if (current.op != NULL) {
expression next = last + 1
if (next > end) error "Allocated array not large enough"
if (current.left == NULL) {
current.left = next;
} else if (current.right == NULL) {
current.right = next;
} else {
error "Internal consistency error"
}
current = next
}
current.op = token
else {
if (current.op == NULL) error "Cannot set arguments before operator"
if (current.left == NULL) {
current.left = token
} else if (current.right == NULL) {
current.right = token
while current >= start & current.left != NULL & current.right != NULL {
// go back up the tree to find the next free slots
current--;
}
} else {
error "Too many arguments to operator"
}
}
}
This is a slightly simplified version of what I've posted elsewhere. The main addition needed is to be able to differentiate between constants and sub-expressions, as the algorithm above just sets pointers. In a practical version, each argument to an operator could be a pointer to a sub-expression, or a constant of some type such as a string or integer - however you wish to parse the token.
If anyone has a pointer to a better description of this algorithm, another implementation or (even better!) a name for it, I'd be very pleased to hear about it. I will later update this document to point to a practical implementation of this algorithm.