Implementing thread-safe scanners and parsers in PostgreSQL

I have been working recently on making various scanners and parsers in PostgreSQL thread-safe, and this article is a bit of a brain dump to record what I did, what the different steps were, because all of that was pretty difficult to piece together, and it might be worth recording somewhere what was found and learned.

Others have written about similar journeys before, such as here and here, and while those articles gave some useful hints, they didn’t contain all the context and details that I ultimately needed, so here is my own journey. This text is not specific to PostgreSQL, but it is informed by it.

Before we start, let’s sort out the adjectives. The reason for this work was to prepare the scanners and parsers for possibly using threads instead of processes in the PostgreSQL server in the future. Therefore, we want them to be thread-safe. By default, scanners created by Flex and parsers created by Bison use various global variables to store their internal state and to communicate between each other and the callers. Global variables like that aren’t thread-safe. One approach to fix that would be to mark all those global variables for thread-local storage. That would probably work, but unfortunately neither Bison nor Flex appear to provide an option to produce their output in such a way. (Also, thread-local storage is a relatively new C feature.) Another approach is to have the Bison and Flex outputs created in a way that they don’t use global variables. Such options exist. For Flex, this option is called “reentrant”, for Bison, this option is called “pure”. This difference is a bit annoying when you talk about it, but I suppose it is technically correct. (The Bison manual actually uses both terms, too.) A Bison parser produced with this option is a “pure function” in the sense that it only looks at its input to produce its output. It doesn’t have any state across calls or looks at or modifies any external state. A Flex scanner produced with the “reentrant” option is not a pure function, because it is passed a handle to state that it modifies. This is just different because the way you use the scanner is different from the parser: Calling the scanner returns one token at a time until it signals that the input is done, whereas the parser is just called once and parses the whole input.

For our goal of making thread-safe scanners and parsers, this is close enough, but it’s important to keep the difference in mind sometimes. For example, while the code generated by Bison and Flex will be pure and reentrant, respectively, the action code that you inject is up to you, it could be reentrant or not, or thread-safe or not. Also, you can make reentrant scanners without using the “reentrant” option. For example, the PostgreSQL configuration file parser (guc-file.l) was already reentrant before this, because it needs to process configuration files included from another file, but it did this just by saving and restoring the global variables around calling the scanner for the included file. That is reentrant just fine, but not thread-safe.

The PostgreSQL scanners and parsers

PostgreSQL is an SQL database management system, so it has a scanner and parser for SQL. But it also has a number of others, and they’re all a bit different, which is what makes all of this extra complicated. As I’m writing this, the PostgreSQL source tree contains 13 *.l files and 10 *.y files. Here is a summary of what these do:

A scanner/parser pair for the main SQL language.
A scanner/parser pair for the SQL-like language used by the replication protocol.
A scanner/parser pair for processing the synchronous replication configuration language.
A scanner/parser pair for the special bootstrap language.
Three scanner/parser pairs that process the input syntax of data types (jsonpath, cube, seg).
A scanner (only) for processing server configuration files (postgresql.conf).
A scanner/parser pair for the expression language used in pgbench.
Two scanners for use by psql: one for scanning SQL syntax, one for processing backslash commands (the former also used by pgbench).
A scanner for ECPG (embedded SQL in C), which has to scan both SQL and C.
(There is also a parser for ECPG, which is assembled on the fly out of various pieces and which is not counted with the *.y files.)
A parser for PL/pgSQL.
(There is also a scanner for PL/pgSQL, but that’s implemented as a wrapper around the main SQL scanner, so it’s not counted here, but it also needed to be modified extensively by this project.)
A scanner/parser pair for the isolation tester custom test description language.

These are all used in different contexts and have different requirements. Some are in the server, some in client programs, some in test drivers, they have different requirements for memory management, producing error messages, what special cases they need to deal with, where their input comes from. And they all had a different starting state; some had already used some or all of the options discussed below, some none.

Starting setup

Let’s build something up from scratch and learn as we go.

The starting setup is that you have:

A scanner file, say foo_scanner.l:

%{
#include "foo.h"

/* some C declarations */
%}

/* Flex definitions */

%%

/* rules (patterns and actions) */

%%

/* other C code */

A parser definition file, say foo_parser.y:

%{
#include "foo.h"

/* some C declarations */
%}

/* Bison declarations */

%union
{
    ...
}

%%

/* grammar rules and actions */

%%

void
yyerror(char const *message)
{
    fprintf(stderr, "%s\n", message);
}

/* other C code */

Note that Bison requires that the user supplies a yyerror() function. (In PostgreSQL code, the yyerror() function is typically in the scanner file (foo_scanner.l). This is convenient because then you can also call the same error handling function from the scanner. You then also need to put the yyerror() declaration into some header file such as foo.h (see below) so that the parser can get at it. But keep in mind that the invocation of yyerror() is determined by Bison; Flex doesn’t know about it and Flex-generated code does not call it, unless the user code does. I’m going to ignore this idiosyncrasy in this article to keep it simple.)

A header file for your project, say foo.h:

#ifndef FOO_H
#define FOO_H

extern int yylex(void);
extern int yyparse(void);

#endif /* FOO_H */

And some main program, say foo.c:

#include "foo.h"

/* stuff */

int
main(void)
{
    yyparse();

    return 0;
}

(We’ll skip most error handling in these examples. You should check the return value of yyparse().)

Let’s look at the header file. It declares yylex() and yyparse(), which are the main entry points for the generated scanner and parser. These functions are generated by Flex and Bison, respectively. When you take the foo_scanner.l file and run it through Flex, it generates essentially

#include "foo.h"

/* some C declarations */

int
yylex(void)
{
    /* magic and actions */
}

/* other C code */

and similarly when you run foo_parser.y through Bison it generates something like

#include "foo.h"

/* some C declarations */

YYSTYPE yylval;

int
yyparse(void)
{
    /* magic and actions */
    /* calls yylex() somewhere here */
}

void
yyerror(char const *message)
{
    fprintf(stderr, "%s\n", message);
}

/* other C code */

If you have both a parser and a scanner, then your main program will call the parser by calling yyparse(), and that will internally call yylex() as needed, so you don’t see the latter explicitly in your code. You can also have programs that only have a scanner, in which case your code will call yylex() directly.

Since yylex() and yyparser() are defined in separate files, you need to declare them so that they can be called from other files. It is possible to have Flex and Bison generate header files that contain these declarations, and then your program could include those generated header files. But those header files contain a bunch of other stuff, too, which you might not want to leak into your other code (such as definitions for token types). Those headers are useful in certain cases for communicating between the scanner and the parser, but the rest of the program usually doesn’t need or want most of that stuff. So it is more typical to write the declarations by hand as is shown here.

But you do want Bison to generate its header file (option bison -H), which will be called foo_parser.h in our example. It will contain something like this:

union YYSTYPE
{
    ...
};
typedef union YYSTYPE YYSTYPE;

extern YYSTYPE yylval;

And then you need to include that header file in foo_scanner.l, because it provides the yylval variable that the parser and scanner use to exchange information about the semantic values of tokens.

In the example constructed so far, the input to the scanner comes from the standard input. If that’s what you want, then you’re set. But in the cases I’ve been dealing with, reading from some sort of string in memory is more common (for example, an SQL statement or a data type input value). To make the scanner read from a string, call the function yy_scan_string() (or an alternative). This function is available in foo_scanner.l. It has external linkage, so you could also declare it and use it from elsewhere, such as your main program, but then you also need to do extra work to make its return type available on the outside, so I wouldn’t recommend that. Instead, it’s better to write a small wrapper function like

void
foo_scanner_init(const char *str)
{
    yy_scan_string(str);
}

in foo_scanner.l and call that from the main program.

Actually, let’s add another tweak to this. We want to namespace the generated function names. Because as it is, you can only have one scanner and one parser in your program. This doesn’t have anything to do with reentrancy or multithreading, it has to do with symbol names. There can only be one yylex and one yyparse symbol in a C program. It is good hygiene to avoid symbol clashes like that. Even if your program has only one language to parse, maybe it will support plugins that want to do their own parsing, or the program is actually a library and will be called from another program.

So what we’ll actually start with is:

foo_scanner.l:

%{
#include "foo.h"
#include "foo_parser.h"

/* some C declarations */
%}

%option prefix="foo_yy"

/* Flex definitions */

%%

/* rules (patterns and actions) */

%%

void
foo_scanner_init(const char *str)
{
    foo_yy_scan_string(str);
}

/* other C code */

foo_parser.y:

%{
#include "foo.h"
#include "foo_parser.h"

/* some C declarations */
%}

%name-prefix="foo_yy"

/* Bison declarations */

%union
{
    ...
}

%%

/* actions */

%%

void
foo_yyerror(char const *message)
{
    fprintf(stderr, "%s\n", message);
}

/* other C code */

foo.h:

#ifndef FOO_H
#define FOO_H

extern int foo_yylex(void);
extern int foo_yyparse(void);

extern void foo_scanner_init(const char *str);

#endif /* FOO_H */

foo.c:

#include "foo.h"

/* stuff */

int
main(void)
{
    foo_scanner_init("some string to parse");
    foo_yyparse();

    return 0;
}

The choice of the prefix is of course arbitrary; I’m stipulating here that our overall project is called “foo”. Note that if you want to keep the “yy” in the prefix, you need to specify that, otherwise you might end up with perhaps “foolex” and “fooparse”, which is fine, but not typical and perhaps a bit confusing. The “yy” is a good hint that it’s got something to do with Flex or Bison, so it’s good to keep that.

Below, I sometimes use yysomething and foo_yysomething interchangeably, to reduce the clutter in the text. Just keep in mind that in practice most (not all! see later) symbols should have a “foo_yy” prefix. Actually, inside foo_scanner.l and foo_parser.y you can use both interchangeably because there are macros that define one to the other. Only outside of those two files do you have to write the full prefixed names.

Now note in the main program: The call to foo_scanner_init() doesn’t return anything, and the call to yyparse() doesn’t take any arguments. All the information is kept in global variables. You couldn’t run two parsers like that concurrently. This is what we are trying to fix.

Reentrant scanner

Now we make the scanner reentrant. This is done with the Flex option %option reentrant. So the scanner source file now looks like this:

foo_scanner.l:

%{
#include "foo.h"
#include "foo_parser.h"

/* some C declarations */
%}

%option prefix="foo_yy"
%option reentrant

/* Flex definitions */

%%

/* rules (patterns and actions) */

%%

...

/* other C code */

When processing this file with Flex, the generated yylex() function now has an argument of type yyscan_t that represents a sort of handle for the scanner instance. So the generated file now notionally looks like this:

#include "foo.h"
#include "foo_parser.h"

/* some C declarations */

int
foo_yylex(yyscan_t yyscanner)
{
    /* magic and actions */
}

/* other C code */

Also, all the other utility functions that Flex provides, such as yy_scan_string(), will now take an additional yyscan_t argument where before they would just operate on the one global scanner instance (see example below).

This also means that the declaration of foo_yylex() in foo.h needs to be updated accordingly. But where does the type yyscan_t come from? Inside the generated foo_scanner.c, the type definition is provided by the code generated by Flex. Outside, we need to make it ourselves. yyscan_t is actually just void *, so that’s easy. So foo.h could look like this:

#ifndef FOO_H
#define FOO_H

typedef void *yyscan_t;

extern int foo_yylex(yyscan_t yyscanner);
extern int foo_yyparse(void);  /* XXX we'll also adjust this in a minute */

extern void foo_scanner_init(const char *str);  /* XXX ditto */

#endif /* FOO_H */

A small side problem here: Since foo.h is also included into foo_scanner.c, you will have multiple instances of typedef void *yyscan_t. In C11 and later, it is ok to have multiple definitions of a type (as long as they agree), but older compilers might complain. Then you should write

#ifndef YY_TYPEDEF_YY_SCANNER_T
#define YY_TYPEDEF_YY_SCANNER_T
typedef void *yyscan_t;
#endif

This is the same incantation that the Flex-generated code uses internally, and so then there will be only one definition.

Also note that the type name doesn’t have to be yyscan_t, as long as it’s typedef’ed to void *. yyscan_t is just what the Flex documentation calls it. (Also, maybe it should be foo_yyscan_t, but it seems people don’t use that.)

(You could also free-style it and just declare yylex as int foo_yylex(void *), but I wouldn’t recommend that, because it reduces readability.)

Inside the actions that you write in foo_scanner.l, you will typically use some variables provided by Flex, such as yytext (the text matched by the pattern) and yyleng (its length) and a few more. In a non-reentrant scanner, these are global variables, like

char *yytext;
yy_size_t yyleng;  /* usually same as size_t */

(and then there a preprocessor defines to turn these into the namespaced foo_yytext etc.). In a reentrant scanner, these get turned into preprocessor magic like this:

#define yyleng yyg->yyleng_r
#define yytext yyg->yytext_r

and in the generated yylex() there is a local variable definition like

struct yyguts_t * yyg = (struct yyguts_t*)yyscanner;

So inside yylex(), this will work transparently, and you don’t need to change any uses of yytext etc. But if you have a helper function that accesses yytext directly, this will not work; you’ll get a confusing compiler error about yyg not being known.

The official way to get access to these variables from outside yylex() is to use helper functions like yyget_text(), yyget_leng(). Alternatively, or if you want to avoid an extra function call for performance reasons, you could also just copy the above definition of yyg into your code. (Or maybe you could redefine yytext to something like yyget_text(yyscanner)? There might be various possibilities.)

Now with the reentrant scanner, we need to also initialize the handle before we can start the scanner. That is done by the function yylex_init(). And then you can also clean it up afterwards using yylex_destroy(). So the notional use looks something like this:

/* local variable, this is the scanner handle */
yyscan_t scanner;

yylex_init(&scanner);

...
yylex(scanner);
...

yylex_destroy(scanner);

You need to be really careful here that yylex_init() takes &scanner but the other functions take scanner. If you get this wrong, the compiler isn’t going to complain, since these are all void * pointers. (You might get a warning about scanner being used before being initialized.)

In our example, the call to yylex_init() is best put into our scanner initialization function:

void
foo_scanner_init(const char *str, yyscan_t *yyscannerp)
{
    yyscan_t yyscanner;

    yylex_init(yyscannerp);

    yyscanner = *yyscannerp;

    yy_scan_string(str, yyscanner);
}

(This assignment from yyscannerp to yyscanner is just my idea to help keep these straight. You might have different stylistic preferences.)

Also, let’s put the yylex_destroy() call into a corresponding clean up function:

void
foo_scanner_finish(yyscan_t yyscanner)
{
    yylex_destroy(yyscanner);
}

We’ll need both of these functions later on to add more things to them, so it’s good to set them up now.

With all this, you can now initialize multiple scanners and run them in overlapping ways or in multiple threads and so on. Good.

In this small code snippet, we call yylex() directly, but in our overall example we are calling the parser and the parser calls the scanner internally. So what we need to do is initialize the scanner, then pass the scanner handle to the parser, and then have the parser pass the scanner handle to the scanner. To do this, we need to tell Bison in the parser definition file that both yyparse() and yylex() have an additional argument. This is done with the following declarations:

%parse-param {yyscan_t yyscanner}
%lex-param   {yyscan_t yyscanner}

Note that %parse-param also affects the argument list of yyerror().

So in total foo_parser.y will be

%{
#include "foo.h"
#include "foo_parser.h"

/* some C declarations */
%}

%name-prefix="foo_yy"
%parse-param {yyscan_t yyscanner}
%lex-param   {yyscan_t yyscanner}

/* Bison declarations */

%%

/* actions */

%%

void
foo_yyerror(yyscan_t yyscanner, char const *message)
{
    fprintf(stderr, "%s\n", message);
}

/* other C code */

And the generated code will effectively be something like this:

#include "foo.h"
#include "foo_parser.h"

/* some C declarations */

...

int
foo_yyparse(yyscan_t yyscanner)
{
    /* magic and actions */
    /* calls foo_yylex(yyscanner) somewhere here */
}

void
foo_yyerror(yyscan_t yyscanner, char const *message)
{
    fprintf(stderr, "%s\n", message);
}

/* other C code */

Note that the argument list of yyerror() is yyerror(yyscan_t yyscanner, const char *message), not the other way around. The compiler won’t diagnose this, because of the void * pointers involved. In the case of yyparse(), the %parse-param arguments are added at the end, but for yyerror(), the message is always last. See the Bison manual for further details about that.

(%parse-param and %lex-param don’t actually know that they are passing down a scanner handle. This is a more general facility that allows you to pass down arbitrary data. We’ll see some more uses of them later.)

We also update the declaration of yyparser() in foo.h:

#ifndef FOO_H
#define FOO_H

#ifndef YY_TYPEDEF_YY_SCANNER_T
#define YY_TYPEDEF_YY_SCANNER_T
typedef void *yyscan_t;
#endif

extern int foo_yylex(yyscan_t yyscanner);
extern int foo_yyparse(yyscan_t yyscanner);

extern void foo_scanner_init(const char *str, yyscan_t *yyscannerp);
extern void foo_scanner_finish(yyscan_t yyscanner);

#endif /* FOO_H */

Putting it all together, the top-level invocation in foo.c is now:

#include "foo.h"

/* stuff */

int
main(void)
{
    yyscan_t scanner;

    foo_scanner_init("some string to parse", &scanner);
    foo_yyparse(scanner);
    foo_scanner_finish(scanner);

    return 0;
}

Pure parser

Now we make the parser pure. This is done with the Bison option %define api.pure full (or %pure-parser in older Bison versions). This doesn’t change anything about the yyparse() invocation, since a parser doesn’t need a state handle.

But something needs to be done about yylval, which has been a global variable to communicate between the Flex-generated and the Bison-generated code.

With the pure parser option, yylval becomes a local variable of the yyparse() function, and yyparse() expects to call yylex() as

yylex(&yylval, yyscanner);

to pass down the place where the scanner code should put its yylval information.

So the yylex() prototype should, as far as Bison is concerned, be

int foo_yylex(YYSTYPE *yylval, yyscan_t yyscanner);

Now we need to tell Flex about this. This is done with the Flex option %option bison-bridge. This option is in my opinion documented a bit confusingly. The option is not necessary for a plain, not reentrant, not pure scanner and parser, like the ones we started out with. (Maybe there is some scenario where %option bison-bridge could also be used there while keeping the scanner non-reentrant? Not sure.) But the option is effectively required when combining a pure Bison parser with a Flex scanner.

We also need to update the declaration of foo_yylex() in foo.h. The trick here is where to get the YYSTYPE type definition from. There is a definition of that in the generated foo_parser.h, but as explained earlier, we don’t want to include that into foo.h or some other header file that the whole program might want to use.

If you use a %union declaration in the Bison parser, as I have shown here, then you can just declare an incomplete union type, like this:

union YYSTYPE;
int foo_yylex(union YYSTYPE *yylval, yyscan_t yyscanner);

If you don’t use %union, then you can put a #define YYSTYPE typehere in the header file. But then you need to make the type typehere available. This can be a bit tricky to arrange in complex situations. Also note that YYSTYPE is not namespaced with foo_yy, so this arrangement might be problematic if you have multiple parsers in a program. Using unions even if you only need one semantic type is better in my view.

Another option is to move the whole yylex() declaration out of the foo.h header file and into the C declarations section of foo_parser.y. In that case, the generated foo_parser.c provides the definition of YYSTYPE and you don’t need to provide it yourself. This works if your parser is the only thing that calls yylex(), which is the normal case. But sometimes you want to call yylex() directly, maybe to implement some kind of look-ahead functionality. (Some parsers in PostgreSQL do that.) If you need to call yylex() directly and want to avoid YYSTYPE clashes, then you might need to rearrange your header files very carefully, and make a header file that is only used by your scanner and parser code, which can then include foo_parser.h. Overall, I have found this part to be very tricky in some cases. (In PostgreSQL code, src/backend/parser/gramparse.h is an example of such an internal header file.)

Finally, note that the type of yylval changes from being a union (or some other type if you don’t use %union) to being a pointer. Where before in your Flex actions you wrote perhaps:

    yylval.intval = atoi(...);

you now have to write:

    yylval->intval = atoi(...);

Also, you now need to write yylval, not foo_yylval because the yylex() function argument is always called yylval now, not namespaced. This is lightly confusing.

(It’s also confusing that in the Bison-generated code, the local variable yylval is a union, but in the Flex-generated code, it is a pointer. But this is not so relevant in practice, since in Bison code you don’t access yylval directly, since that is done through $1, $2, etc.)

Let’s review our code so far:

foo_scanner.l:

%{
#include "foo.h"
#include "foo_parser.h"

/* some C declarations */
%}

%option prefix="foo_yy"
%option reentrant
%option bison-bridge

/* Flex definitions */

%%

/* rules (patterns and actions) */

%%

/* foo_scanner_init() */
/* foo_scanner_finish() */

/* other C code */

foo_parser.y:

%{
#include "foo.h"
#include "foo_parser.h"

/* some C declarations */
%}

%name-prefix="foo_yy"
%parse-param {yyscan_t yyscanner}
%lex-param   {yyscan_t yyscanner}
%define api.pure full

/* Bison declarations */

%%

/* actions */

%%

void
foo_yyerror(yyscan_t yyscanner, char const *message)
{
    fprintf(stderr, "%s\n", message);
}

/* other C code */

foo.h:

#ifndef FOO_H
#define FOO_H

union YYSTYPE;
#ifndef YY_TYPEDEF_YY_SCANNER_T
#define YY_TYPEDEF_YY_SCANNER_T
typedef void *yyscan_t;
#endif

extern int foo_yylex(union YYSTYPE *yylval, yyscan_t yyscanner);
extern int foo_yyparse(yyscan_t yyscanner);

extern void foo_scanner_init(const char *str, yyscan_t *yyscannerp);
extern void foo_scanner_finish(yyscan_t yyscanner);

#endif /* FOO_H */

If your Bison parser uses the %locations option (to track token locations, perhaps for improved error messages), then a non-pure parser also has a global variable yylloc, which with a pure parser turns into a local variable that is passed to yylex(). To tell Flex about this, you additionally need the Flex option %option bison-locations, and then the effective yylex() prototype is like

extern int foo_yylex(union YYSTYPE *yylval,
                     YYLTYPE *yylloc,
                     yyscan_t yyscanner);

(And YYLTYPE similarly needs to be defined somewhere.)

The locations feature is independent of whether the scanner or parser are reentrant. But if you use it you have another value to pass around next to yylval.

Extra scanner state

What we have discussed so far might be enough, but in some cases there is additional global state lingering around scanners in particular. A typical situation is using a global variable to collect semantic data that is assembled across several rules using start conditions. That might look something like this:

foo_scanner.l

%{
static char *scanbuf;
%}

%x quoted

%%

{doublequote}           {
                            /* start quoted string */
                            BEGIN(quoted);
                            scanbuf = NULL;
                        }

<quoted>{doublequote}   {
                            /* end quoted string */
                            yylval->str = scanbuf;
                            BEGIN(INITIAL);
                            return STRING;
                        }

<quoted>{text}          {
                            /* collect quoted string content */
                            scanbuf = concat(scanbuf, yytext);
                        }

%%

...

Here, scanbuf is a global variable used internally by the scanner.

To make this scanner thread-safe, we could just make that variable thread-local. But then it’s still not reentrant. The proper way to do this is to set up a struct that is allocated in a local variable that is then passed to the scanner. (It doesn’t have to be a struct, but even if you only need one variable I think it’s probably better to start with a struct so that it’s easier to add more variables later on.) Then the scanner uses that instance of the local variable instead of a global variable to store its extra state.

%{
struct foo_yy_extra_type
{
    char *scanbuf;
};
%}

...

%option extra-type="struct foo_yy_extra_type *"

%x quoted

%%

{doublequote}           {
                            /* start quoted string */
                            BEGIN(quoted);
                            yyextra->scanbuf = NULL;
                        }

<quoted>{doublequote}   {
                            /* end quoted string */
                            yylval->str = yyextra->scanbuf;
                            BEGIN(INITIAL);
                            return STRING;
                        }

<quoted>{text}          {
                            /* collect quoted string content */
                            yyextra->scanbuf = concat(yyextra->scanbuf, yytext);
                        }

%%

...

The variable yyextra is magically available inside the rule actions inside yylex(). It points to an instance of struct foo_yy_extra_type. Outside of yylex(), you can get at this area using the function foo_yyget_extra(). The yyextra pointer is internally stored inside what yyscan_t points to, so it can be available anywhere the scanner handle is available. (Actually, the yyextra variable works exactly like for example yytext, with the various ways of accessing it discussed earlier.)

And then you need to set up this yyextra area using the function yyset_extra() before calling the scanner. Our scanner initialization function is a good place to put this:

void
foo_scanner_init(const char *str, yyscan_t *scannerp)
{
    struct foo_yy_extra_type yyext;

    yylex_init(scannerp);
    yyset_extra(&yyext, scannerp);
    yy_scan_string(str, scannerp);
}

Haha, this doesn’t work! yyext is a local variable in foo_scanner_init(), and yyset_extra() will save its address internally in the yyscan_t value, but once foo_scanner_init() exist that address is garbage. The compiler doesn’t warn about this, because it doesn’t know what yyset_extra() does internally. (Maybe a static analyzer would have a better chance.) Instead, the extra area needs to be allocated with malloc(), or perhaps some other means that survives outside the function, for example:

void
foo_scanner_init(const char *str, yyscan_t *scannerp)
{
    struct foo_yy_extra_type *yyextp = malloc(sizeof(struct foo_yy_extra_type));

    yylex_init(scannerp);
    yyset_extra(yyextp, scannerp);
    yy_scan_string(str, scannerp);
}

void
foo_scanner_destroy(yyscan_t scanner)
{
    struct foo_yy_extra_type *yyextp = yyget_extra(scanner);

    free(yyextp);
    yylex_destroy(scanner);
}

Returning the parse result

Another common source of non-reentrancy happens at the top of the call stack. How do you get the parse result out of yyparse()? The function yyparse() only returns whether parsing was successful or not, it doesn’t return what it actually parsed. In general, there might not be anything else to return: yyparse() could execute actions as it parses and be done. (The PostgreSQL bootstrap parser works like that. So it’s more like an interpreter.) But in practice, you’ll want to get back some kind of syntax tree or other structure with some information about what was parsed. By default, the signature of yyparse() doesn’t provide a place to return that. So it is common to stick that result into a global variable. For example:

foo.h

extern FooSomeType* foo_parse_result;

foo_parser.y

...

%%
statement: something    {
                            foo_parse_result = $1;
                        }
    ;

...
%%

...

foo.c

#include "foo.h"

FooSomeType* foo_parse_result;

/* stuff */

int
main(void)
{
    yyscan_t scanner;

    foo_scanner_init("some string to parse", &scanner);
    foo_yyparse(scanner);
    foo_scanner_finish(scanner);

    /* do something with foo_parse_result here */

    return 0;
}

To improve that, we can use the %parse-param directive that we saw earlier. That way, we can add additional parameters to the signature of the yyparse() function:

foo_parser.y

...

%parse-param {FooSomeType **foo_parse_result_p}

%%
statement: something    {
                            *foo_parse_result_p = $1;
                        }
    ;

...
%%

...

The generated yyparse() function prototype then looks like this:

int foo_yyparse(FooSomeType **foo_parse_result_p, yyscan_t yyscanner);

The main program can then look like this:

foo.c

#include "foo.h"

/* stuff */

int
main(void)
{
    yyscan_t scanner;
    FooSomeType *parse_result;

    foo_scanner_init("some string to parse", &scanner);
    foo_yyparse(&parse_result, scanner);
    foo_scanner_finish(scanner);

    /* do something with parse_result here */

    return 0;
}

This way, the parse result is stored in a local variable whose address is passed directly to yyparse(). Note that here again the type of the “parse result” variable has changed to be a pointer to the thing it was before.

Passing context information to scanner and parser

Sometimes you need to pass context information to both the scanner and the parser. This could be any kind of program state or a session handle or whatever. (In the PostgreSQL code, we sometimes need to pass down some context information for error handling.) Most easily, this could just be a global variable in your program that everything has access to. But if we want to make our code multi-thread-safe, then this might not work.

Let’s say as an example, we want to pass down a locale handle of type locale_t.

In the previous section, we saw how we can pass down additional data to the parser using the directive %parse-param. So we could do something like:

foo_parser.y

...

%parse-param {locale_t loc}

%%
bip: bop                {
                            /* make use of passed-in loc */
                            $$ = something($1, loc);
                        }
    ;

...
%%

...

And the main program could do something like:

foo.c

int
main(void)
{
    ...
    locale_t mylocale;

    ...
    mylocale = newlocale(...);
    ...
    foo_yyparse(mylocale, ...);
    ...
}

This gets the data into the parser. To have the parser pass it down to the scanner, we use the directive %lex-param that we also already saw earlier.

foo_parser.y

...

%parse-param {locale_t loc}
%lex-param   {locale_t loc}

%%
bip: bop                {
                            /* make use of passed-in loc */
                            $$ = something($1, loc);
                        }
    ;

...
%%

...

Note that the argument names in the parameter declarations have to be the same. That is because the generated code will end up something like this:

int
yyparse(..., locale_t loc, ...)  /* parse-param */
{
    ...
    yylex(..., loc, ...);  /* lex-param */
    ...
}

But this is only the Bison side: It tells the code generated by Bison to call yylex() with an additional argument loc. But it doesn’t actually alter the code generated by Flex to accept that argument. To achieve that, we need to change the declaration of yylex() in the scanner definition, too.

The facility in Flex for that is different (arguably not as nice). You define the macro YY_DECL to be the complete declaration of yylex(). So for example

#define YY_DECL int foo_yylex(union YYSTYPE *yylval, \
                              locale_t loc, \
                              yyscan_t yyscanner)

Note the order: yylval and yylloc (if used) are first, then arguments added by %lex-param (in the order declared), then yyscan_t last. Also, no semicolon at the end.

The code generated from foo_scanner.l then uses that macro to provide the actual prototype of foo_yylex(), overriding the built-in default.

Where do you put this definition? I think Flex assumes that you put it into the scanner definition file, like this:

foo_scanner.l

%{
#include "foo.h"
#include "foo_parser.h"

#define YY_DECL int foo_yylex(union YYSTYPE *yylval, locale_t loc, yyscan_t yyscanner)

/* some C declarations */
%}

%option prefix="foo_yy"
%option reentrant
%option bison-bridge

/* Flex definitions */

%%

/* rules (patterns and actions) */

%%

...

/* other C code */

And then you need to update the declaration in foo.h to match:

foo.h

#ifndef FOO_H
#define FOO_H

...

extern int foo_yylex(union YYSTYPE *yylval, locale_t loc, yyscan_t yyscanner);
extern int foo_yyparse(locale_t loc, yyscan_t yyscanner);

...

#endif /* FOO_H */

(Since foo_scanner.l includes foo.h, the compiler will check that the two match.)

Another pattern that I have seen is to put the definition of YY_DECL into the header file such as foo.h, like this:

foo.h

#ifndef FOO_H
#define FOO_H

...

#define YY_DECL int foo_yylex(union YYSTYPE *yylval, locale_t loc, yyscan_t yyscanner)
YY_DECL;
extern int foo_yyparse(locale_t loc, yyscan_t yyscanner);

...

#endif /* FOO_H */

This way, you use the YY_DECL macro itself to create the declaration in the header file, and so you save having to write it twice.

But the big problem with this is that YY_DECL is not namespaced. There is no “FOO_YY_DECL”. So putting this into a header file could leak to other users of the header file that might define their own scanners. (This has actually happened in my current work on PostgreSQL. The effects are very confusing.) So I don’t recommend this. An alternative would be to create a separate header file used only by the scanner and parser and put it there. See also above for the discussion about where to put the yylex() declaration.

Conclusion and tips

It all makes sense in retrospect! ;-)

To conclude, here are a few general tips that I learned along the way:

Obviously, consult the Bison manual and the Flex manual. They also have specific sections on pure parsers and reentrant scanners.
Keep straight what belongs to Bison and what belongs to Flex. These are separate tools. They both deal with “yystuff”, and there are some defined interface points such as yylex and yylval, but all the other things belong to only one or the other. In some cases, you need to do corresponding changes on both sides, but the way to do that or the option names will usually be different. When in doubt, check the respective manuals. The manuals also have indexes that list for example function or symbol names that the respective tool deals with. If something is not listed there, then it might actually belong to the other tool.
If you are doing such a conversion, approach it incrementally step-by-step as I’ve shown here. Each step leaves you with a working state. (Some of the other guides that I have shown mixed all of this together and then it was hard to find where the fault was when it went wrong.)
If in doubt, read the output files generated from Flex or Bison. They are a bit complicated to read, and you don’t need to understand everything. But the authoritative information about what the code does is in there. For example, if you change an option and get a mismatch of a function declaration, you can find the expected declaration in the generated file and then make your own declaration or definition match that.
Some of the option names or option syntax I have talked about here have changed a few times over different versions of Bison or Flex. I think what I have written here is the most modern variant. In PostgreSQL, we try to support a fairly large range of Bison and Flex versions, so the actual code there might look different. If you also want to do that, you might need to do some additional research in the Bison and Flex change logs.