(Yet Another Bayes’ Theorem Emulator)

Version 1.1

Internal Documentation

by Bill Seymour

Copyright Bill Seymour 2013, 2014.
Distributed under the Boost Software License, Version 1.0.
(See accompanying file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)


This paper paper provides some internal documentation and rationale for Bill Seymour’s open-source Yet Another Bayes’ Theorem Emulator.

The intent is that this code provide a library that can be used to write a complete Bayesian calculator with the addition only of a graphical user interface (GUI) front end.

Also supplied with the distribution is front-end code for a complete program that runs as a CGI on the author’s server. This is provided primarily as an example of the kinds of things that an off-line version would do. See http://www.cstdbill.com/bt/yabte.html for user documentation and http://www.cstdbill.com/cgi-bin/yabte for the current working model.

There’s a brief glossary in the user documentation.

The source code is available in yabte.tar.gz and yabte.zip; and there are links to individual source files below.



The library is written almost entirely in C++ and is intended to be compiled by any implementation that conforms to C++98. The only exceptions are the scanner and parser for probability expressions which are intended to be the input to flex and Bison, GNU’s versions of lex and yacc. It’s intended that the scanner generate plain old C89 code. A C binding is provided for front ends that aren’t written in C++.

The library reserves all identifiers that begin with “YABTE_” for use as macros.

The distribution comprises twenty-one files:

yabte.hpp:  This is the library’s principal header. [source]

yabte.cpp:  This is all the code for the author’s CGI front end except for the code that returns the HTML to the client. [source]

yabte_show.cpp:  This is the code in the author’s CGI front end that returns the HTML page to the client. [source]

yabte_globals.cpp:  This code maintans some globals that are used in various parts of the program. [source]

yabte_data.cpp:  This is the code for reading and writing the user’s data. [source]

yabte_prob.hpp and yabte_prob.cpp:  These two files define how a probability is represented. [source: .hpp, .cpp]

yabte_comp.cpp:  This is the code for performing the actual Bayes’ Theorem calculation. [source]

yabte_parse.ypp:  This is the Bison-compatible parser for probability expressions. [source]

yabte_lex.l:  This is the flex-compatible scanner for probability expressions. [source]

yabte_localizable_scanner.cpp:  This is an alternate probability-expression scanner that uses the current global locale. [source]

yabte_wrapper.h and yabte_wrapper.cpp:  These two files, not used in the CGI version, provide a C binding to various internals for use in front ends not written in C++. [source: .h, .cpp]

libyabte.a and libyabtei18n.a:  two static libraries containing all the object files except for the author’s CGI front end. (libyabte.a includes the flex scanner, yabte_lex.o; libyabtei18n.a includes yabte_localizable_scanner.o. There’s a separate library for each version of the scanner since they have the same entry points.)

Makefile:  for building the static libraries.

Makefile.cgi:  for building the author’s CGI using libyabte.a, GCC and make.

Archives:  a makefile for creating yabte.tar.gz and yabte.zip.

yabte.html and yabte_internals.html:  the user documentation and this file.

LICENSE_1_0.txt:  the Boost Software License.

It’s intended that yabte.cpp and yabte_show.cpp be the only files that need to be replaced to turn the existing CGI into a downloadable version for running off-line; and yabte_data.cpp is intended to contain all the code that needs to change if it’s desired to store users’ data in some form other than a flat file on the box on which the executable is running.

The current versions of yabte.cpp and yabte_show.cpp assume that they will never be multi-threaded and so lack any concurrency controls. The rest of the code is intended to be fully reentrant and multi-threadable although it might need a little help.


Aside from the unsurprising bad_alloc if we run out of memory, the parser can throw a runtime_error when parsing a probability entry fails; or a range_error can be thrown in yabte_comp.cpp when a computed probability is not between 0 and 1 (and NaNs aren’t).

In the CGI version, the only place an exception gets caught is in main(); and that just returns the exception’s what() to the client in a JavaScript alert() window.


struct hyp contains the user’s prior, consequent and description strings for an individual hypothesis.

typedef std::vector<hyp> hyps; declares the type of a complete hypothesis.

struct globvars contains several globals that different parts of the program need or that, for whatever reason, might need to be namespace-scope objects in a single TU.

See the yabte_globals.cpp section of this paper if you’d like to construct this object in thread-local storage.

Everything else in this header is just a (presumably unsurprising) forward reference to various externals. The error-reporting functions are extern "C" because they’re callbacks to the UI code which might not be written in C++.

Note that yabte.hpp #includes yabte_prob.hpp which defines what a struct probability is.

Also, there’s no declaration of show(globvars&) because that’s just an artifact of the author’s CGI version.


This file contains main() and functions that write error messages. This CGI version is aware that it receives the user’s entered data in a POSTed HTML form and that it’s running in its own process (i.e., it’s not multi-threaded), and so the form is just a singleton at namespace scope.

Note that all error messages are “fatal” in the sense that they abort whatever the program is doing and return the message to the user; but the argument to exit() is always EXIT_SUCCESS because some HTML containing that message was successfully sent back to the client.

The form data

An instance of class posted_form is a namespace-scope object. Its default constructor eagerly reads the HTML form data.

The main complication (aside from URI encoding which is well understood) is that the HTTP specification says that any spaces in query strings or posted form data be converted to plus signs. This is a serious problem for any HTTP-based application that wants to deal with arithmetic expressions since ' ' and '+' need to be distinct characters (unless you want to say that such expressions can’t have any spaces in them, which would be a pointlessly draconian limitation for the user).

Our workaround is to have on the client some JavaScript (desribed below) that changes plus signs to horizontal tabs just before the form is posted; and posted_form’s default constructor then changes those characters back. (It’s reasonable to consider the tab to be outside the domain of input characters since it can’t easily be entered into a text field…hitting the tab key just moves to the next form element.)

Note that any '\r' in the form data is simply ignored. This presumably will happen only if the user has entered actual newlines in the description <textarea>s, in which case we just keep them as '\n's internally.

Getting the hypotheses out of the form

main(), once we know that we have a posted form, the hypothesis has a name, and we haven’t just deleted it or refreshed the screen by hitting the Reset button, it’s time to load the hypothesis data into the hyps collection. This is done by get_hypotheses().

That function shouldn’t be difficult to understand by itself; but the Pathological Coupling Department has asked me to say loudly that the HTML form’s description, prior, and consequent fields all have names like Dn, Pn, and Cn, respectively, where n is 0 for the main hypothesis and x for alternate x.


This file contains the definition of show(const globvars&) which returns the next HTML page to the client. It’s in its own TU to remove from the larger program the complexities of programatically generating the HTML.

Note that the hypothesis name’s <input> tag contains

which is what clicks the Reset button to reload the page when you move away from the hypothesis name field.

Also, the user name’s <input> tag contains

Setting that hidden form element to 'Y' tells code in main() that we have a new user and so might have to create a new directory.

Similarly, the description <textarea> tags, and the prior and consequent <input> tags, contain

Setting that hidden form element to 'Y' is what tells the code in main() that handles the Compute button that something has changed since the last save.

And I’ll bore you by pointing out yet again that the description, prior, and consequent elements all have names like Dn, Pn, and Cn, respectively, where n is 0 for the main hypothesis and x for alternate x.

Note that, when writing the text for the description <textarea>s, we need to put the '\r's back in front of any '\n's we find in the string.

Also, the <body>’s onload event sets the initial focus to either the user name (if the hypothesis name is blank) or the main hypothesis’ description (if we have a hypothesis name); and it might also pop up an alert() window if we have an error message for the user.

Some JavaScript

Reformatted to make it more readable by H. sapiens:
    function nonplus()
        var elems = document.forms[0].elements;
        var len = elems.length;

        for (var i = 0; i < len; ++i)
            var elem = elems[i];
            var nam = elem.name;

            if (nam != null && nam.length > 0)
                var n0 = nam.substring(0,1);
                if (n0 == 'D' || n0 == 'P' || n0 == 'C')
                    elem.value = elem.value.replace(/[+]/g, '\t');

        return true;
That function, called by the form’s onsubmit event, is what changes the plus signs to tabs. (Do I need to repeat the bit about the names of the description, prior and consequent elements?)
    function verify_delete()
        var val = document.forms[0].elements.hypname.value;
        return confirm('Are you sure you want to delete ' + val + '?');
That function is called by the Delete button’s onclick event.
    function init()
        // and might contain:
        alert(/*an error message*/);
That function is called for the <body>’s onload event. It initializes the focus() to either the user name (if the hypothesis name is blank…typically the first time the page loads), or the main hypothesis’ description (if we have a hypothesis name). It will also pop up the modal alert() window if !globals().errmsg.empty().


This file contains the definitions of globvars::globvars(), globvars::cleanup() and globals(). It’s in its own TU to keep OS-specific code relating to thread-local storage separate from the rest of the library.

In complete programs that are intended to be multi-threaded, the globvars object can be constructed in thread-local storage by defining one of the macros:

These macros affect yabte_globals.cpp only, and so changing which is defined won’t result in violating the ODR. If you define none of them, the globvars object will just use the canonical block-scope-static implementation of the singleton pattern. In any event, the globvars object will be constructed on the first call of yabte::globals() (in each thread or the only one) using the default values specified in the user documentation.

If you define YABTE_GCC_THREADS or YABTE_MS_THREADS, you must call globvars::cleanup() on thread exit, otherwise you’ll get a memory leak (because, in these two cases, thread-local objects are required to be PODs, so they’re just pointers to globvars objects constructed on the heap). With any other of those macros defined, or none of them, cleanup() does nothing at all; so it’s safe to call it in any event.

The code has not been compiled or tested with YABTE_STD_THREADS defined because the author lacks a compiler that can handle it.

The code has been compiled with YABTE_BOOST_THREADS defined using a boost/thread/tss.hpp header that this author wrote based on the definition in the Boost documentation; but it has not been tested because the author has no interest in downloading the whole .1G Boost library just for that. If anyone who already has the Boost library installed wishes to test this, please do so and let was at pobox dot com know of any bugs that are found.


This file contains the definitions of:
    bool read(const string& hypname, hyps& hypdata, bool compute_only);
    void save(const string& hypname);
    void save(const string& hypname, const hyps& hypdata);
    void delete_hypothesis(const string& hypname);
along with the miscellaneous functions:
    bool data_saved(const string& hypname);
    void create_user(const string& username);
The intent is that these be all the functions that deal with reading and writing user data.

The file format

The flat file in which the user data is saved looks generally like:
    last computed value
    main prior expression
    main consequent expression
    alternate 1 prior expression
    alternate 1 consequent expression
    alternate n prior expression (as needed)
    alternate n consequent expression (as needed)
    record separator
    main description (as many lines as needed)
    record separator
    alternate 1 description (as many lines as needed)
    record separator
    alternate n description (as needed)
    record separator        (as needed)
The first line is always the most recently computed value, or is a blank line if no value has been computed yet. If it’s not blank, it always has the form min|max even if the two values are the same. For example, if the computed value is 1/2, then the first line will be
The vertical bar was chosen as the separator because it matches the meaning of the '|' operator in the probability expressions and so is easy to remember. The value will always be written in the "C" locale, so the decimal point will always be the period. This is unrelated to the permission given to users to use commas in their input.

The last computed value is followed by pairs of lines:  prior, consequent, …, prior, consequent … as many pairs as are needed. Note that the file contains the text of the probability expressions that the user entered. These expressions don’t get parsed until the Compute button is pressed.

Following the probability expressions are as many descriptions as are needed. Since descriptions can go on for multiple lines of text, they’re delimited by the ASCII record separator, U+001E, on its own line.

Note that, if a prior or consequent hasn’t been entered, there will be a blank line for it. If there’s a blank description, on the other hand, there will just be two consecutive U+001E delimiters.

The read function

    bool read(const string& hypname, hyps& hypdata, bool compute_only);
This function loads the data from the file hypname into hypdata, assigns the last computed value (if any) to globals().posterior, and returns whether it actually loaded anything into hypdata.

Here’s the logic:

conpute_only hypname has
been computed
(first line of file
is non-blank)
loads into
assigns to
No Don’t care Not applicable Blank main
and alternate
−1.0 true
Yes false No All
Yes Computed
true No Probabilities
Yes Nothing Computed

The idea is that, if compute_only is true, the only reason for reading the file is to do the Bayes’ Theorem calculation…we’re not displaying anything on the user’s screen. This can happen only if some prior or consequent expression that we’re parsing contains the name of some other hypothesis. In this case, if that other hypothesis’ posterior has already been computed, there’s no reason to read its file past the first line; and even if it hasn’t been computed yet, we still need only the prior and consequent expressions…we needn’t bother reading any of the (potentially lengthy) descriptions.

The save functions

    void save(const string& hypname);
    void save(const string& hypname, const hyps& hypdata);
These functions save the user’s hypothesis data for later.

The two-argument version is the one normally called. If the hypothesis has already been saved, the new data is written to a temporary file; then if the output was successful, the existing file is deleted and the temporary file is renamed.

The one-argument version is called only by the do_compute function in yabte_prob.cpp which, in turn, is called only from the parser when a prior or consequent string contains the name of another hypothesis. In this case, the hyps object won’t have any descriptions in it, as explained above; and so after writing the newly computed posterior to the first line of the file, we just copy the rest from the existing file.

It’s in the save functions where we make sure that we’re writing the first line in the "C" locale.

The delete_hypothesis function

    void delete_hypothesis(const string& hypname);
This function permanently deletes the hypothesis hypname. Note that it’s not an error if no such file exists…the function quietly does nothing when there’s nothing to do.

The miscellaneous functions

    bool data_saved(const string& hypname);
This function returns whether the file hypname exists; or it displays an error message if hypname exists and either isn’t a regular file or lacks read/write permissions.
    void create_user(const string& username);
This function is called for every new username. It creates a directory named username if such a directory doesn’t already exist; or it can display an error message instead if username exists but either isn’t a directory or lacks read/write/execute permissions.

yabte_prob.hpp and yabte_prob.cpp

These two files define how a probability is represented, which is actually a pair of doubles since we want to allow for a range like “20% to 30%”.

The probability class needs to be the type of the parser’s non-terminals, which means that it needs to be a member of the Bison parser’s %union, which in turn means that it has to be a POD. (There’s a way around that in Bison, but we choose not to use it for reasons explained later.)

So probability may not have any non-trivial constructors, which is why we have, instead of constructors, a bunch of static member functions that return probability objects created from the functions’ arguments. There’s also a requirement that all members have the same visibility, which is why we make it a struct and just leave everything public.

We’d like probability to be almost a numeric type, at least to the point of our being able to add, subtract, multiply and divide them, and to mix them with doubles in such expressions; but we shouldn’t need any other numerical capabilities.

The user can enter a probability expression that contains a unary “complement” operator (1.0 minus the operand), and so we could overload either ~ or ! for that; but either would seem to violate the principle of least astonishment:  ~ is expected to be a bitwise operation, and ! is expected to yield a bool.

Note that probability::compute(const string&, const string&) uses RAII to handle pushing and popping the user name because the parser can throw an exception.

The do_compute function at the end of the anonymous namespace in yabte_prob.cpp is where we call the one-argument save function to just save a newly computed value.

Note that probabilities are equality comparable; but because they’re pairs of values, they can’t be ordered for the same reason that complex numbers can’t be:  there’s more than one thing that “less than” could mean.

Three member functions generate a text version of the value for use in other parts of the program:


This file contains the definition of
    void compute(const hyps& hypdata)
which computes the posterior probability given hypdata and assigns that value to globals().posterior.

As stated in the user documentation, the Bayes’s Theorem calculation is always done twice, once to create a minimum and again to create a maximum, even if there’s really only one value being computed. (That’s not an efficiency concern:  the calculation is very fast compared to all the other stuff that the program does.)

For each of the two runs, we’ll have a std::vector<term> where a term is a pair of doubles representing a particular prior and its consequent. There’s a one-to-one correspondence between the elements in the vector<term> and the elements in hypdata.

Pushing a term onto the vector calls value(const string&); and it’s in that function where we finally get around to parsing a probability expression that the user entered.

Note that this function can throw any of the exceptions mentioned above:  either it can throw a range_error directly, or a runtime_error can pass through from the parser.

yabte_lex.l and yabte_parse.ypp

These two files make up the code that translates the user’s prior and consequent strings into numeric values. The files are intended to be run through the precompilers, flex and Bison.

One noteworthy thing about the scanner is that, since we’re creating a reentrant parser, the scanner’s entry point is yylex(YYSTYPE*); and yylval is the pointer passed to that function, which is why we use -> instead of . to select a particular member of the union.

It should probably also be mentioned that the scanner allows both the period and the comma as the decimal point, which should take care of most Occidental locales; and alternate operators, which are Latin-1 characters greater than '\x7F', are allowed to arrive in either straight Latin-1 or UTF-8.

Two interesting complications for the parser are, first, that we want our probability class to be the type of the parser’s non-terminals, and second, that because any term in a probability expression can be the name of another hypothesis, the parser will sometimes be called recursively.

There are two ways in Bison to assign types to symbols using %token<> and %type<>. One is to put the type itself in the angle brackets; but that turns out to be viral in the sense that it infects the scanner which we want to keep simple. We choose the traditional mechanism of putting the %union member names in the angle brackets. But that means that the %union has to have a probability member, so not only does the probability type need to be a POD as mentioned above, but also has to somehow be taken out of the %union before it gets to the scanner which doesn’t know what a probability is. We accomplish the latter with the YABTE_INCLUDE_PROB_IN_UNION macro which is defined inside yabte_parse.ypp but doesn’t get put into y.tab.h, the file in which the parser’s %token and %union declarations get communicated to the scanner.

Making the parser reentrant is easy: that’s just Bison’s %define api.pure full directive; but we still have to use the scanner recursively somehow. As it happens, flex already provides several functions for that, two of which are of interest to this parser:

    extern "C"  // stuff defined in flex
        struct yy_buffer_state;

        yy_buffer_state* yy_scan_string(const char*);
        void yy_delete_buffer(yy_buffer_state*);
The yy_scan_string function tells the scanner to scan the string passed as an argument instead of reading from the standard input. Cool! That’s just what we want to do in any event! The yy_scan_string function returns a pointer to something called a yy_buffer_state which needs to be freed when we’re done with it by calling the yy_delete_buffer function. Note that the parser never tries to do any arithmetic with that pointer, nor to dereference it, so it’s OK for yy_buffer_state to be an incomplete type in the parser.

In yabte.hpp, we have

    typedef std::vector<void*> state_stack;
and have two such things, parse_states and lex_states, in the globvars object (so that they’ll automatically be put into thread-local storage if the other globals are).

The former’s elements are actually pointers to a

    struct parse_state
        yy_buffer_state* lexbuf;    // passed back to yy_delete_buffer()
        yabte::probability* result; // where we put the parsed value
        // ...
which holds state information for the current, possibly recursive, parse.

An instance of struct new_parse (at the end of the anonymous namespace in yabte_parse.ypp) is a poltergeist in the parser’s entry point,

    parse(const char*, probability*);
to handle the pushing and popping cleanly using RAII. This is important because the parser reports all errors it detects by throwing an exception.


This file, not used in the author’s CGI version, is a drop-in replacement for the flex scanner. It’s intended for use in front ends that wish to interpret the user’s prior and consequent expressions in accordance with the current global locale.

Three macros allow customizing the scanner for various internal and external character types:

If neither character type is specified, the scanner assumes that the external encoding is UTF-8; but it uses straight Latin-1 internally and so creates a codecvt<char,char,mbstate_t>-like facet of its own. If either character type is specified, the scanner just uses a
and so the current global locale had better have such a beast along with all of If you want the encoding throughout your program to be straight
Latin-1, just define YABTE_LEX_CHAR_TYPE to be char and leave it at that: you’ll get a std::codecvt<char,char,mbstate_t> which doesn’t actually do any conversion.

The yy_scan_string function will throw a runtime_error if codecvt<>::in() returns error or partial. The return value from out() isn’t checked since it’s called only to copy a user or hypothesis name to the memory allocated for yylval->s, and we know that in() has already run successfully.

Except for the decimal point (now limited to the decimal point character that we get from the numpunct facet, although you’ve probably already figured that out), the allowable prior and consequent expressions are strict supersets of what’s stated in the user documentation: decimal numbers may have any form that the current global locale’s num_get facet understands. (The allowable operators are still just the documented Latin-1 characters.)

In principle, the scanner will also allow user and hypothesis names to begin with any alphabetic character, with subsequent characters being any alphanumeric character or the underscore; but note that, using the supplied yabte_data.cpp, those strings must be directory and file names that are acceptable to the user’s operating system. A version of yabte_data.cpp that stores the user’s data in an SQL database is comming Real Soon Now.

As of this writing, this file has been successfully compiled with various internal and external character types with both MSVC on Windows and GCC on a Unix-like box; but it has been tested only in the "C" and Danish locales with neither character-type macro defined.

yabte_wrapper.h and yabte_wrapper.cpp

These two files, not used in the CGI version, provide a C binding to the library for use in front ends written in a language other than C++. (For example, a GUI that runs on OSX would probably be written in Objective-C.)

The header declares three opaque “handle” types (all actually just void*s):

GLOB_HANDLE yabte_globals(void) returns the handle to the globvars object.

void free_globals(void) is a thin wrapper around globvars::cleanup().

There are also getters and setters for individual globals:

const char* get_user_name(GLOB_HANDLE);
void set_user_name(GLOB_HANDLE, const char*);

const char* get_hyp_name(GLOB_HANDLE);
void set_hyp_name(GLOB_HANDLE, const char*);

const char* get_output_format(GLOB_HANDLE);  /* viz. "dec", "pct", "odds" */
void set_output_format(GLOB_HANDLE, const char*);

int get_output_precision(GLOB_HANDLE);
void set_output_precision(GLOB_HANDLE, int);

int get_validation_accuracy(GLOB_HANDLE);
void set_validation_accuracy(GLOB_HANDLE, int);

double get_epsilon(GLOB_HANDLE);

double get_minimum_posterior(GLOB_HANDLE);
double get_maximum_posterior(GLOB_HANDLE);

const char* get_error_message(GLOB_HANDLE);
void set_error_message(GLOB_HANDLE, const char*);
Note that there’s no setter for globvars::epsilon nor for the posterior probability. The former is computed by the set_validation_accuracy function; and the latter is set internally by the library. The posterior’s minimum and maximum values are broken out because code that would use this C binding presumably won’t know what a yabte::probability is.

HYPS_HANDLE get_complete_hypothesis(GLOB_HANDLE) returns a handle to the hyps object; and that handle can be used in:

The handle returned by the get_term function can be used by the getters and setters:
const char* get_prior(TERM_HANDLE);
void set_prior(TERM_HANDLE, const char*);

const char* get_consequent(TERM_HANDLE);
void set_consequent(TERM_HANDLE, const char*);

const char* get_description(TERM_HANDLE);
void set_description(TERM_HANDLE, const char*);

void read_hypothesis(const char*, HYPS_HANDLE) is a fairly thin wrapper around read(const std::string&,hyps&,bool). It always passes false as the third argument and returns no value. (Reading only for the purpose of computing the posterior is something done only internally by the expression parser.)

void save_hypothesis(const char*, HYPS_HANDLE) is a thin wrapper around save(const std::string&,const hyps&).

void delete_hypothesis(const char*) is a thin wrapper around delete_hypothesis(const std::string&).

void create_user(const char*) is a thin wrapper around create_user(const std::string&).

int data_saved(const char*) is a thin wrapper around data_saved(const std::string&). It returns 0 if the latter returns false, 1 if the latter returns true.

void compute_hypothesis(HYPS_HANDLE) is a wrapper around compute(const hyps&). It will never throw an exception as the function it wraps can; rather, if some error occurs, it will call a new error-reporting function declared in the C binding, compute_error(const char*,int). That function will be passed the exception’s what() as the first argument and one of YABTE_NOMEM, YABTE_RANGE_ERROR, YABTE_PARSE_ERROR, or YABTE_MISC_ERROR as the second.

Finally, the compute_error function is declared along with the possible values for its second argument:

void compute_error(const char*, int);

#define YABTE_NOMEM       0
and the declarations of the fatal_error and io_error functions are repeated since code that includes yabte_wrapper.h probably won’t also include yabte.hpp.

You should never get YABTE_MISC_ERROR…if you do, there’s almost certainly something seriously wrong with this library itself.

All suggestions and corrections will be welcome; all flames will be amusing.
Mail to was at pobox dot com.