Implementing thread-safe scanners and parsers in PostgreSQL
I have been working recently on making various scanners and parsers in PostgreSQL thread-safe, and this article is a bit of a brain dump to record what I did, what the different steps were, because all of that was pretty difficult to piece together, and it might be worth recording somewhere what was found and learned.
Others have written about similar journeys before, such as here and here, and while those articles gave some useful hints, they didn’t contain all the context and details that I ultimately needed, so here is my own journey. This text is not specific to PostgreSQL, but it is informed by it.
Before we start, let’s sort out the adjectives. The reason for this work was to prepare the scanners and parsers for possibly using threads instead of processes in the PostgreSQL server in the future. Therefore, we want them to be thread-safe. By default, scanners created by Flex and parsers created by Bison use various global variables to store their internal state and to communicate between each other and the callers. Global variables like that aren’t thread-safe. One approach to fix that would be to mark all those global variables for thread-local storage. That would probably work, but unfortunately neither Bison nor Flex appear to provide an option to produce their output in such a way. (Also, thread-local storage is a relatively new C feature.) Another approach is to have the Bison and Flex outputs created in a way that they don’t use global variables. Such options exist. For Flex, this option is called “reentrant”, for Bison, this option is called “pure”. This difference is a bit annoying when you talk about it, but I suppose it is technically correct. (The Bison manual actually uses both terms, too.) A Bison parser produced with this option is a “pure function” in the sense that it only looks at its input to produce its output. It doesn’t have any state across calls or looks at or modifies any external state. A Flex scanner produced with the “reentrant” option is not a pure function, because it is passed a handle to state that it modifies. This is just different because the way you use the scanner is different from the parser: Calling the scanner returns one token at a time until it signals that the input is done, whereas the parser is just called once and parses the whole input.
For our goal of making thread-safe scanners and parsers, this is close
enough, but it’s important to keep the difference in mind sometimes.
For example, while the code generated by Bison and Flex will be pure
and reentrant, respectively, the action code that you inject is up to
you, it could be reentrant or not, or thread-safe or not. Also, you
can make reentrant scanners without using the “reentrant” option. For
example, the PostgreSQL configuration file parser (guc-file.l
) was
already reentrant before this, because it needs to process
configuration files included from another file, but it did this just
by saving and restoring the global variables around calling the
scanner for the included file. That is reentrant just fine, but not
thread-safe.
The PostgreSQL scanners and parsers
PostgreSQL is an SQL database management system, so it has a scanner
and parser for SQL. But it also has a number of others, and they’re
all a bit different, which is what makes all of this extra
complicated. As I’m writing this, the PostgreSQL source tree contains
13 *.l
files and 10 *.y
files. Here is a summary of what these
do:
-
A scanner/parser pair for the main SQL language.
-
A scanner/parser pair for the SQL-like language used by the replication protocol.
-
A scanner/parser pair for processing the synchronous replication configuration language.
-
A scanner/parser pair for the special bootstrap language.
-
Three scanner/parser pairs that process the input syntax of data types (
jsonpath
,cube
,seg
). -
A scanner (only) for processing server configuration files (
postgresql.conf
). -
A scanner/parser pair for the expression language used in pgbench.
-
Two scanners for use by psql: one for scanning SQL syntax, one for processing backslash commands (the former also used by pgbench).
-
A scanner for ECPG (embedded SQL in C), which has to scan both SQL and C.
-
(There is also a parser for ECPG, which is assembled on the fly out of various pieces and which is not counted with the
*.y
files.) -
A parser for PL/pgSQL.
-
(There is also a scanner for PL/pgSQL, but that’s implemented as a wrapper around the main SQL scanner, so it’s not counted here, but it also needed to be modified extensively by this project.)
-
A scanner/parser pair for the isolation tester custom test description language.
These are all used in different contexts and have different requirements. Some are in the server, some in client programs, some in test drivers, they have different requirements for memory management, producing error messages, what special cases they need to deal with, where their input comes from. And they all had a different starting state; some had already used some or all of the options discussed below, some none.
Starting setup
Let’s build something up from scratch and learn as we go.
The starting setup is that you have:
A scanner file, say foo_scanner.l
:
%{
#include "foo.h"
/* some C declarations */
%}
/* Flex definitions */
%%
/* rules (patterns and actions) */
%%
/* other C code */
A parser definition file, say foo_parser.y
:
%{
#include "foo.h"
/* some C declarations */
%}
/* Bison declarations */
%union
{
...
}
%%
/* grammar rules and actions */
%%
void
yyerror(char const *message)
{
fprintf(stderr, "%s\n", message);
}
/* other C code */
Note that Bison requires that the user supplies a
yyerror()
function. (In PostgreSQL code, the yyerror()
function is typically
in the scanner file (foo_scanner.l
). This is convenient because
then you can also call the same error handling function from the
scanner. You then also need to put the yyerror()
declaration into
some header file such as foo.h
(see below) so that the parser can
get at it. But keep in mind that the invocation of yyerror()
is
determined by Bison; Flex doesn’t know about it and Flex-generated
code does not call it, unless the user code does. I’m going to ignore
this idiosyncrasy in this article to keep it simple.)
A header file for your project, say foo.h
:
#ifndef FOO_H
#define FOO_H
extern int yylex(void);
extern int yyparse(void);
#endif /* FOO_H */
And some main program, say foo.c
:
#include "foo.h"
/* stuff */
int
main(void)
{
yyparse();
return 0;
}
(We’ll skip most error handling in these examples. You should check
the return value of yyparse()
.)
Let’s look at the header file. It declares yylex()
and yyparse()
,
which are the main entry points for the generated scanner and parser.
These functions are generated by Flex and Bison, respectively. When
you take the foo_scanner.l
file and run it through Flex, it
generates essentially
#include "foo.h"
/* some C declarations */
int
yylex(void)
{
/* magic and actions */
}
/* other C code */
and similarly when you run foo_parser.y
through Bison it generates
something like
#include "foo.h"
/* some C declarations */
YYSTYPE yylval;
int
yyparse(void)
{
/* magic and actions */
/* calls yylex() somewhere here */
}
void
yyerror(char const *message)
{
fprintf(stderr, "%s\n", message);
}
/* other C code */
If you have both a parser and a scanner, then your main program will
call the parser by calling yyparse()
, and that will internally call
yylex()
as needed, so you don’t see the latter explicitly in your
code. You can also have programs that only have a scanner, in which
case your code will call yylex()
directly.
Since yylex()
and yyparser()
are defined in separate files, you
need to declare them so that they can be called from other files. It
is possible to have Flex and Bison generate header files that contain
these declarations, and then your program could include those
generated header files. But those header files contain a bunch of
other stuff, too, which you might not want to leak into your other
code (such as definitions for token types). Those headers are useful
in certain cases for communicating between the scanner and the parser,
but the rest of the program usually doesn’t need or want most of that
stuff. So it is more typical to write the declarations by hand as is
shown here.
But you do want Bison to generate its header file (option bison -H
),
which will be called foo_parser.h
in our example. It will contain
something like this:
union YYSTYPE
{
...
};
typedef union YYSTYPE YYSTYPE;
extern YYSTYPE yylval;
And then you need to include that header file in foo_scanner.l
,
because it provides the yylval
variable that the parser and scanner
use to exchange information about the semantic values of tokens.
In the example constructed so far, the input to the scanner comes from
the standard input. If that’s what you want, then you’re set. But in
the cases I’ve been dealing with, reading from some sort of string in
memory is more common (for example, an SQL statement or a data type
input value). To make the scanner read from a string, call the
function yy_scan_string()
(or an
alternative).
This function is available in foo_scanner.l
. It has external
linkage, so you could also declare it and use it from elsewhere, such
as your main program, but then you also need to do extra work to make
its return type available on the outside, so I wouldn’t recommend
that. Instead, it’s better to write a small wrapper function like
void
foo_scanner_init(const char *str)
{
yy_scan_string(str);
}
in foo_scanner.l
and call that from the main program.
Actually, let’s add another tweak to this. We want to namespace the
generated function names. Because as it is, you can only have one
scanner and one parser in your program. This doesn’t have anything to
do with reentrancy or multithreading, it has to do with symbol names.
There can only be one yylex
and one yyparse
symbol in a C program.
It is good hygiene to avoid symbol clashes like that. Even if your
program has only one language to parse, maybe it will support plugins
that want to do their own parsing, or the program is actually a
library and will be called from another program.
So what we’ll actually start with is:
foo_scanner.l
:
%{
#include "foo.h"
#include "foo_parser.h"
/* some C declarations */
%}
%option prefix="foo_yy"
/* Flex definitions */
%%
/* rules (patterns and actions) */
%%
void
foo_scanner_init(const char *str)
{
foo_yy_scan_string(str);
}
/* other C code */
foo_parser.y
:
%{
#include "foo.h"
#include "foo_parser.h"
/* some C declarations */
%}
%name-prefix="foo_yy"
/* Bison declarations */
%union
{
...
}
%%
/* actions */
%%
void
foo_yyerror(char const *message)
{
fprintf(stderr, "%s\n", message);
}
/* other C code */
foo.h
:
#ifndef FOO_H
#define FOO_H
extern int foo_yylex(void);
extern int foo_yyparse(void);
extern void foo_scanner_init(const char *str);
#endif /* FOO_H */
foo.c
:
#include "foo.h"
/* stuff */
int
main(void)
{
foo_scanner_init("some string to parse");
foo_yyparse();
return 0;
}
The choice of the prefix is of course arbitrary; I’m stipulating here that our overall project is called “foo”. Note that if you want to keep the “yy” in the prefix, you need to specify that, otherwise you might end up with perhaps “foolex” and “fooparse”, which is fine, but not typical and perhaps a bit confusing. The “yy” is a good hint that it’s got something to do with Flex or Bison, so it’s good to keep that.
Below, I sometimes use yysomething
and foo_yysomething
interchangeably, to reduce the clutter in the text. Just keep in mind
that in practice most (not all! see later) symbols should have a
“foo_yy” prefix. Actually, inside foo_scanner.l
and foo_parser.y
you can use both interchangeably because there are macros that define
one to the other. Only outside of those two files do you have to
write the full prefixed names.
Now note in the main program: The call to foo_scanner_init()
doesn’t
return anything, and the call to yyparse()
doesn’t take any
arguments. All the information is kept in global variables. You
couldn’t run two parsers like that concurrently. This is what we are
trying to fix.
Reentrant scanner
Now we make the scanner reentrant. This is done with the Flex option
%option reentrant
. So the scanner source file now looks like this:
foo_scanner.l
:
%{
#include "foo.h"
#include "foo_parser.h"
/* some C declarations */
%}
%option prefix="foo_yy"
%option reentrant
/* Flex definitions */
%%
/* rules (patterns and actions) */
%%
...
/* other C code */
When processing this file with Flex, the generated yylex()
function
now has an argument of type yyscan_t
that represents a sort of
handle for the scanner instance. So the generated file now notionally
looks like this:
#include "foo.h"
#include "foo_parser.h"
/* some C declarations */
int
foo_yylex(yyscan_t yyscanner)
{
/* magic and actions */
}
/* other C code */
Also, all the other utility functions that Flex provides, such as
yy_scan_string()
, will now take an additional yyscan_t
argument
where before they would just operate on the one global scanner
instance (see example below).
This also means that the declaration of foo_yylex()
in foo.h
needs
to be updated accordingly. But where does the type yyscan_t
come
from? Inside the generated foo_scanner.c
, the type definition is
provided by the code generated by Flex. Outside, we need to make it
ourselves. yyscan_t
is actually just void *
, so that’s easy. So
foo.h
could look like this:
#ifndef FOO_H
#define FOO_H
typedef void *yyscan_t;
extern int foo_yylex(yyscan_t yyscanner);
extern int foo_yyparse(void); /* XXX we'll also adjust this in a minute */
extern void foo_scanner_init(const char *str); /* XXX ditto */
#endif /* FOO_H */
A small side problem here: Since foo.h
is also included into
foo_scanner.c
, you will have multiple instances of typedef void
*yyscan_t
. In C11 and later, it is ok to have multiple definitions
of a type (as long as they agree), but older compilers might complain.
Then you should write
#ifndef YY_TYPEDEF_YY_SCANNER_T
#define YY_TYPEDEF_YY_SCANNER_T
typedef void *yyscan_t;
#endif
This is the same incantation that the Flex-generated code uses internally, and so then there will be only one definition.
Also note that the type name doesn’t have to be yyscan_t
, as long as
it’s typedef’ed to void *
. yyscan_t
is just what the Flex
documentation calls it. (Also, maybe it should be foo_yyscan_t
, but
it seems people don’t use that.)
(You could also free-style it and just declare yylex
as int
foo_yylex(void *)
, but I wouldn’t recommend that, because it reduces
readability.)
Inside the actions that you write in foo_scanner.l
, you will
typically use some variables provided by Flex, such as yytext
(the
text matched by the pattern) and yyleng
(its length) and a few more.
In a non-reentrant scanner, these are global variables, like
char *yytext;
yy_size_t yyleng; /* usually same as size_t */
(and then there a preprocessor defines to turn these into the
namespaced foo_yytext
etc.). In a reentrant scanner, these get
turned into preprocessor magic like this:
#define yyleng yyg->yyleng_r
#define yytext yyg->yytext_r
and in the generated yylex()
there is a local variable definition
like
struct yyguts_t * yyg = (struct yyguts_t*)yyscanner;
So inside yylex()
, this will work transparently, and you don’t need
to change any uses of yytext
etc. But if you have a helper function
that accesses yytext
directly, this will not work; you’ll get a
confusing compiler error about yyg
not being known.
The official way to get access to these variables from outside
yylex()
is to use helper functions like yyget_text()
,
yyget_leng()
. Alternatively, of if you want to avoid an extra
function call for performance reasons, you could also just copy the
above definition of yyg
into your code. (Or maybe you could
redefine yytext
to something like yyget_text(yyscanner)
? There
might be various possibilities.)
Now with the reentrant scanner, we need to also initialize the handle
before we can start the scanner. That is done by the function
yylex_init()
. And then you can also clean it up afterwards using
yylex_destroy()
. So the notional use looks something like this:
/* local variable, this is the scanner handle */
yyscan_t scanner;
yylex_init(&scanner);
...
yylex(scanner);
...
yylex_destroy(scanner);
You need to be really careful here that yylex_init()
takes
&scanner
but the other functions take scanner
. If you get this
wrong, the compiler isn’t going to complain, since these are all void
*
pointers. (You might get a warning about scanner
being used
before being initialized.)
In our example, the call to yylex_init()
is best put into our
scanner initialization function:
void
foo_scanner_init(const char *str, yyscan_t *yyscannerp)
{
yyscan_t yyscanner;
yylex_init(yyscannerp);
yyscanner = *yyscannerp;
yy_scan_string(str, yyscanner);
}
(This assignment from yyscannerp
to yyscanner
is just my idea to
help keep these straight. You might have different stylistic
preferences.)
Also, let’s put the yylex_destroy()
call into a corresponding clean
up function:
void
foo_scanner_finish(yyscan_t yyscanner)
{
yylex_destroy(yyscanner);
}
We’ll need both of these functions later on to add more things to them, so it’s good to set them up now.
With all this, you can now initialize multiple scanners and run them in overlapping ways or in multiple threads and so on. Good.
In this small code snippet, we call yylex()
directly, but in our
overall example we are calling the parser and the parser calls the
scanner internally. So what we need to do is initialize the scanner,
then pass the scanner handle to the parser, and then have the parser
pass the scanner handle to the scanner. To do this, we need to tell
Bison in the parser definition file that both yyparse()
and
yylex()
have an additional argument. This is done with the
following declarations:
%parse-param {yyscan_t yyscanner}
%lex-param {yyscan_t yyscanner}
Note that %parse-param
also affects the argument list of
yyerror()
.
So in total foo_parser.y
will be
%{
#include "foo.h"
#include "foo_parser.h"
/* some C declarations */
%}
%name-prefix="foo_yy"
%parse-param {yyscan_t yyscanner}
%lex-param {yyscan_t yyscanner}
/* Bison declarations */
%%
/* actions */
%%
void
foo_yyerror(yyscan_t yyscanner, char const *message)
{
fprintf(stderr, "%s\n", message);
}
/* other C code */
And the generated code will effectively be something like this:
#include "foo.h"
#include "foo_parser.h"
/* some C declarations */
...
int
foo_yyparse(yyscan_t yyscanner)
{
/* magic and actions */
/* calls foo_yylex(yyscanner) somewhere here */
}
void
foo_yyerror(yyscan_t yyscanner, char const *message)
{
fprintf(stderr, "%s\n", message);
}
/* other C code */
Note that the argument list of yyerror()
is yyerror(yyscan_t
yyscanner, const char *message)
, not the other way around. The
compiler won’t diagnose this, because of the void *
pointers
involved. In the case of yyparse()
, the %parse-param
arguments
are added at the end, but for yyerror()
, the message is always last.
See the Bison
manual
for further details about that.
(%parse-param
and %lex-param
don’t actually know that they are
passing down a scanner handle. This is a more general facility that
allows you to pass down arbitrary data. We’ll see some more uses of
them later.)
We also update the declaration of yyparser()
in foo.h
:
#ifndef FOO_H
#define FOO_H
#ifndef YY_TYPEDEF_YY_SCANNER_T
#define YY_TYPEDEF_YY_SCANNER_T
typedef void *yyscan_t;
#endif
extern int foo_yylex(yyscan_t yyscanner);
extern int foo_yyparse(yyscan_t yyscanner);
extern void foo_scanner_init(const char *str, yyscan_t *yyscannerp);
extern void foo_scanner_finish(yyscan_t yyscanner);
#endif /* FOO_H */
Putting it all together, the top-level invocation in foo.c
is now:
#include "foo.h"
/* stuff */
int
main(void)
{
yyscan_t scanner;
foo_scanner_init("some string to parse", &scanner);
foo_yyparse(scanner);
foo_scanner_finish(scanner);
return 0;
}
Pure parser
Now we make the parser pure. This is done with the Bison option
%define api.pure full
(or %pure-parser
in older Bison versions).
This doesn’t change anything about the yyparse()
invocation, since a
parser doesn’t need a state handle.
But something needs to be done about yylval
, which has been a global
variable to communicate between the Flex-generated and the
Bison-generated code.
With the pure parser option, yylval
becomes a local variable of the
yyparse()
function, and yyparse()
expects to call yylex()
as
yylex(&yylval, yyscanner);
to pass down the place where the scanner code should put its yylval
information.
So the yylex()
prototype should, as far as Bison is concerned, be
int foo_yylex(YYSTYPE *yylval, yyscan_t yyscanner);
Now we need to tell Flex about this. This is done with the Flex
option %option bison-bridge
. This option is in my opinion
documented a bit confusingly. The option is not necessary for a
plain, not reentrant, not pure scanner and parser, like the ones we
started out with. (Maybe there is some scenario where %option
bison-bridge
could also be used there while keeping the scanner
non-reentrant? Not sure.) But the option is effectively required
when combining a pure Bison parser with a Flex scanner.
We also need to update the declaration of foo_yylex()
in foo.h
.
The trick here is where to get the YYSTYPE
type definition from.
There is a definition of that in the generated foo_parser.h
, but as
explained earlier, we don’t want to include that into foo.h
or some
other header file that the whole program might want to use.
If you use a %union
declaration in the Bison parser, as I have shown
here, then you can just declare an incomplete union type, like this:
union YYSTYPE;
int foo_yylex(union YYSTYPE *yylval, yyscan_t yyscanner);
If you don’t use %union
, then you can put a #define YYSTYPE
typehere
in the header file. But then you need to make the type
typehere
available. This can be a bit tricky to arrange in complex
situations. Also note that YYSTYPE
is not namespaced with foo_yy
,
so this arrangement might be problematic if you have multiple parsers
in a program. Using unions even if you only need one semantic type is
better in my view.
Another option is to move the whole yylex()
declaration out of the
foo.h
header file and into the C declarations section of
foo_parser.y
. In that case, the generated foo_parser.c
provides
the definition of YYSTYPE
and you don’t need to provide it yourself.
This works if your parser is the only thing that calls yylex()
,
which is the normal case. But sometimes you want to call yylex()
directly, maybe to implement some kind of look-ahead functionality.
(Some parsers in PostgreSQL do that.) If you need to call yylex()
directly and want to avoid YYSTYPE
clashes, then you might need to
rearrange your header files very carefully, and make a header file
that is only used by your scanner and parser code, which can then
include foo_parser.h
. Overall, I have found this part to be very
tricky in some cases. (In PostgreSQL code,
src/backend/parser/gramparse.h
is an example of such an internal
header file.)
Finally, note that the type of yylval
changes from being a union (or
some other type if you don’t use %union
) to being a pointer. Where
before in your Flex actions you wrote perhaps:
yylval.intval = atoi(...);
you now have to write:
yylval->intval = atoi(...);
Also, you now need to write yylval
, not foo_yylval
because the
yylex()
function argument is always called yylval
now, not
namespaced. This is lightly confusing.
(It’s also confusing that in the Bison-generated code, the local
variable yylval
is a union, but in the Flex-generated code, it is a
pointer. But this is not so relevant in practice, since in Bison code
you don’t access yylval
directly, since that is done through $1
,
$2
, etc.)
Let’s review our code so far:
foo_scanner.l
:
%{
#include "foo.h"
#include "foo_parser.h"
/* some C declarations */
%}
%option prefix="foo_yy"
%option reentrant
%option bison-bridge
/* Flex definitions */
%%
/* rules (patterns and actions) */
%%
/* foo_scanner_init() */
/* foo_scanner_finish() */
/* other C code */
foo_parser.y
:
%{
#include "foo.h"
#include "foo_parser.h"
/* some C declarations */
%}
%name-prefix="foo_yy"
%parse-param {yyscan_t yyscanner}
%lex-param {yyscan_t yyscanner}
%define api.pure full
/* Bison declarations */
%%
/* actions */
%%
void
foo_yyerror(yyscan_t yyscanner, char const *message)
{
fprintf(stderr, "%s\n", message);
}
/* other C code */
foo.h
:
#ifndef FOO_H
#define FOO_H
union YYSTYPE;
#ifndef YY_TYPEDEF_YY_SCANNER_T
#define YY_TYPEDEF_YY_SCANNER_T
typedef void *yyscan_t;
#endif
extern int foo_yylex(union YYSTYPE *yylval, yyscan_t yyscanner);
extern int foo_yyparse(yyscan_t yyscanner);
extern void foo_scanner_init(const char *str, yyscan_t *yyscannerp);
extern void foo_scanner_finish(yyscan_t yyscanner);
#endif /* FOO_H */
If your Bison parser uses the
%locations
option (to track token locations, perhaps for improved error
messages), then a non-pure parser also has a global variable yylloc
,
which with a pure parser turns into a local variable that is passed to
yylex()
. To tell Flex about this, you additionally need the Flex
option %option bison-locations
, and then the effective yylex()
prototype is like
extern int foo_yylex(union YYSTYPE *yylval,
YYLTYPE *yylloc,
yyscan_t yyscanner);
(And YYLTYPE
similarly needs to be defined somewhere.)
The locations feature is independent of whether the scanner or parser
are reentrant. But if you use it you have another value to pass
around next to yylval
.
Extra scanner state
What we have discussed so far might be enough, but in some cases there is additional global state lingering around scanners in particular. A typical situation is using a global variable to collect semantic data that is assembled across several rules using start conditions. That might look something like this:
foo_scanner.l
%{
static char *scanbuf;
%}
%x quoted
%%
{doublequote} {
/* start quoted string */
BEGIN(quoted);
scanbuf = NULL;
}
<quoted>{doublequote} {
/* end quoted string */
yylval->str = scanbuf;
BEGIN(INITIAL);
return STRING;
}
<quoted>{text} {
/* collect quoted string content */
scanbuf = concat(scanbuf, yytext);
}
%%
...
Here, scanbuf
is a global variable used internally by the scanner.
To make this scanner thread-safe, we could just make that variable thread-local. But then it’s still not reentrant. The proper way to do this is to set up a struct that is allocated in a local variable that is then passed to the scanner. (It doesn’t have to be a struct, but even if you only need one variable I think it’s probably better to start with a struct so that it’s easier to add more variables later on.) Then the scanner uses that instance of the local variable instead of a global variable to store its extra state.
%{
struct foo_yy_extra_type
{
char *scanbuf;
};
%}
...
%option extra-type="struct foo_yy_extra_type *"
%x quoted
%%
{doublequote} {
/* start quoted string */
BEGIN(quoted);
yyextra->scanbuf = NULL;
}
<quoted>{doublequote} {
/* end quoted string */
yylval->str = yyextra->scanbuf;
BEGIN(INITIAL);
return STRING;
}
<quoted>{text} {
/* collect quoted string content */
yyextra->scanbuf = concat(yyextra->scanbuf, yytext);
}
%%
...
The variable yyextra
is magically available inside the rule actions
inside yylex()
. It points to an instance of struct
foo_yy_extra_type
. Outside of yylex()
, you can get at this area
using the function foo_yyget_extra()
. The yyextra
pointer is
internally stored inside what yyscan_t
points to, so it can be
available anywhere the scanner handle is available. (Actually, the
yyextra
variable works exactly like for example yytext
, with the
various ways of accessing it discussed earlier.)
And then you need to set up this yyextra
area using the function
yyset_extra()
before calling the scanner. Our scanner
initialization function is a good place to put this:
void
foo_scanner_init(const char *str, yyscan_t *scannerp)
{
struct foo_yy_extra_type yyext;
yylex_init(scannerp);
yyset_extra(&yyext, scannerp);
yy_scan_string(str, scannerp);
}
Haha, this doesn’t work! yyext
is a local variable in
foo_scanner_init()
, and yyset_extra()
will save its address
internally in the yyscan_t
value, but once foo_scanner_init()
exist that address is garbage. The compiler doesn’t warn about this,
because it doesn’t know what yyset_extra()
does internally. (Maybe
a static analyzer would have a better chance.) Instead, the extra
area needs to be allocated with malloc()
, or perhaps some other
means that survives outside the function, for example:
void
foo_scanner_init(const char *str, yyscan_t *scannerp)
{
struct foo_yy_extra_type *yyextp = malloc(sizeof(struct foo_yy_extra_type));
yylex_init(scannerp);
yyset_extra(yyextp, scannerp);
yy_scan_string(str, scannerp);
}
void
foo_scanner_destroy(yyscan_t scanner)
{
struct foo_yy_extra_type *yyextp = yyget_extra(scanner);
free(yyextp);
yylex_destroy(scanner);
}
Returning the parse result
Another common source of non-reentrancy happens at the top of the call
stack. How do you get the parse result out of yyparse()
? The
function yyparse()
only returns whether parsing was successful or
not, it doesn’t return what it actually parsed. In general, there
might not be anything else to return: yyparse()
could execute
actions as it parses and be done. (The PostgreSQL bootstrap parser
works like that. So it’s more like an interpreter.) But in practice,
you’ll want to get back some kind of syntax tree or other structure
with some information about what was parsed. By default, the
signature of yyparse()
doesn’t provide a place to return that. So
it is common to stick that result into a global variable. For
example:
foo.h
extern FooSomeType* foo_parse_result;
foo_parser.y
...
%%
statement: something {
foo_parse_result = $1;
}
;
...
%%
...
foo.c
#include "foo.h"
FooSomeType* foo_parse_result;
/* stuff */
int
main(void)
{
yyscan_t scanner;
foo_scanner_init("some string to parse", &scanner);
foo_yyparse(scanner);
foo_scanner_finish(scanner);
/* do something with foo_parse_result here */
return 0;
}
To improve that, we can use the %parse-param
directive that we saw
earlier. That way, we can add additional parameters to the signature
of the yyparse()
function:
foo_parser.y
...
%parse-param {FooSomeType **foo_parse_result_p}
%%
statement: something {
*foo_parse_result_p = $1;
}
;
...
%%
...
The generated yyparse()
function prototype then looks like this:
int foo_yyparse(FooSomeType **foo_parse_result_p, yyscan_t yyscanner);
The main program can then look like this:
foo.c
#include "foo.h"
/* stuff */
int
main(void)
{
yyscan_t scanner;
FooSomeType *parse_result;
foo_scanner_init("some string to parse", &scanner);
foo_yyparse(&parse_result, scanner);
foo_scanner_finish(scanner);
/* do something with parse_result here */
return 0;
}
This way, the parse result is stored in a local variable whose address
is passed directly to yyparse()
. Note that here again the type of
the “parse result” variable has changed to be a pointer to the thing
it was before.
Passing context information to scanner and parser
Sometimes you need to pass context information to both the scanner and the parser. This could be any kind of program state or a session handle or whatever. (In the PostgreSQL code, we sometimes need to pass down some context information for error handling.) Most easily, this could just be a global variable in your program that everything has access to. But if we want to make our code multi-thread-safe, then this might not work.
Let’s say as an example, we want to pass down a locale handle of type
locale_t
.
In the previous section, we saw how we can pass down additional data
to the parser using the directive %parse-param
. So we could do
something like:
foo_parser.y
...
%parse-param {locale_t loc}
%%
bip: bop {
/* make use of passed-in loc */
$$ = something($1, loc);
}
;
...
%%
...
And the main program could do something like:
foo.c
int
main(void)
{
...
locale_t mylocale;
...
mylocale = newlocale(...);
...
foo_yyparse(mylocale, ...);
...
}
This gets the data into the parser. To have the parser pass it down
to the scanner, we use the directive %lex-param
that we also already
saw earlier.
foo_parser.y
...
%parse-param {locale_t loc}
%lex-param {locale_t loc}
%%
bip: bop {
/* make use of passed-in loc */
$$ = something($1, loc);
}
;
...
%%
...
Note that the argument names in the parameter declarations have to be the same. That is because the generated code will end up something like this:
int
yyparse(..., locale_t loc, ...) /* parse-param */
{
...
yylex(..., loc, ...); /* lex-param */
...
}
But this is only the Bison side: It tells the code generated by Bison
to call yylex()
with an additional argument loc
. But it doesn’t
actually alter the code generated by Flex to accept that argument.
To achieve that, we need to change the declaration of yylex()
in the
scanner definition, too.
The facility in Flex for that is different (arguably not as nice).
You define the macro YYD_DECL
to be the complete declaration of
yylex()
. So for example
#define YY_DECL int foo_yylex(union YYSTYPE *yylval, \
locale_t loc, \
yyscan_t yyscanner)
Note the order: yylval
and yylloc
(if used) are first, then
arguments added by %lex-param
(in the order declared), then
yyscan_t
last. Also, no semicolon at the end.
The code generated from foo_scanner.l
then uses that macro to
provide the actual prototype of foo_yylex()
, overriding the built-in
default.
Where do you put this definition? I think Flex assumes that you put it into the scanner definition file, like this:
foo_scanner.l
%{
#include "foo.h"
#include "foo_parser.h"
#define YY_DECL int foo_yylex(union YYSTYPE *yylval, locale_t loc, yyscan_t yyscanner)
/* some C declarations */
%}
%option prefix="foo_yy"
%option reentrant
%option bison-bridge
/* Flex definitions */
%%
/* rules (patterns and actions) */
%%
...
/* other C code */
And then you need to update the declaration in foo.h
to match:
foo.h
#ifndef FOO_H
#define FOO_H
...
extern int foo_yylex(union YYSTYPE *yylval, locale_t loc, yyscan_t yyscanner);
extern int foo_yyparse(locale_t loc, yyscan_t yyscanner);
...
#endif /* FOO_H */
(Since foo_scanner.l
includes foo.h
, the compiler will check that
the two match.)
Another pattern that I have seen is to put the definition of YY_DECL
into the header file such as foo.h
, like this:
foo.h
#ifndef FOO_H
#define FOO_H
...
#define YY_DECL int foo_yylex(union YYSTYPE *yylval, locale_t loc, yyscan_t yyscanner)
YY_DECL;
extern int foo_yyparse(locale_t loc, yyscan_t yyscanner);
...
#endif /* FOO_H */
This way, you use the YY_DECL
macro itself to create the declaration
in the header file, and so you save having to write it twice.
But the big problem with this is that YY_DECL
is not namespaced.
There is no “FOO_YY_DECL
”. So putting this into a header file could
leak to other users of the header file that might define their own
scanners. (This has actually happened in my current work on
PostgreSQL. The effects are very confusing.) So I don’t recommend
this. An alternative would be to create a separate header file used
only by the scanner and parser and put it there. See also above for
the discussion about where to put the yylex()
declaration.
Conclusion and tips
It all makes sense in retrospect! ;-)
To conclude, here are a few general tips that I learned along the way:
-
Obviously, consult the Bison manual and the Flex manual. They also have specific sections on pure parsers and reentrant scanners.
-
Keep straight what belongs to Bison and what belongs to Flex. These are separate tools. They both deal with “yystuff”, and there are some defined interface points such as
yylex
andyylval
, but all the other things belong to only one or the other. In some cases, you need to do corresponding changes on both sides, but the way to do that or the option names will usually be different. When in doubt, check the respective manuals. The manuals also have indexes that list for example function or symbol names that the respective tool deals with. If something is not listed there, then it might actually belong to the other tool. -
If you are doing such a conversion, approach it incrementally step-by-step as I’ve shown here. Each step leaves you with a working state. (Some of the other guides that I have shown mixed all of this together and then it was hard to find where the fault was when it went wrong.)
-
If in doubt, read the output files generated from Flex or Bison. They are a bit complicated to read, and you don’t need to understand everything. But the authoritative information about what the code does is in there. For example, if you change an option and get a mismatch of a function declaration, you can find the expected declaration in the generated file and then make your own declaration or definition match that.
-
Some of the option names or option syntax I have talked about here have changed a few times over different versions of Bison or Flex. I think what I have written here is the most modern variant. In PostgreSQL, we try to support a fairly large range of Bison and Flex versions, so the actual code there might look different. If you also want to do that, you might need to do some additional research in the Bison and Flex change logs.