A Simple Statistics Library in C++

Bill Seymour
2023-09-17


Introduction

This paper documents an open-source C++ template library for computing a few really simple statistics, just the minimum, maximum, median, mean, variance and standard deviation of elements in a C++ container of some kind, the container elements having some numeric type.

These functions are trivial to write; but the author found himself writing them repeatedly, so he decided to write them just once.  The fun was keeping the library simple for use with built-in types while making it usable for user-defined types with only a little bit of extra work.

A precondition is that the container be non-empty and already sorted ascending.

The code is distributed under the Boost Software License.  (This is not part of Boost.  The author just likes their open-source license.)

The library is just a single header file.  That header, this document, and the Boost license are available here.

My goal was to write a library that could be compiled with any C++ implementation that conforms to C++11 or later; but that goal was not entirely achieved.


The basic idea for built-in types

If the container is a std::vector or std::deque (or behaves like one), and the data points are some built-in numeric type (or behave like one), then the library is simple and obvious.
namespace simpstat {

template<typename ContainerT> auto minimum(const ContainerT&);
template<typename ContainerT> auto maximum(const ContainerT&);
template<typename ContainerT> auto median(const ContainerT&);
template<typename ContainerT> auto mean(const ContainerT&);
template<typename ContainerT> auto variance(const ContainerT&);
template<typename ContainerT> auto stddev(const ContainerT&);

template<typename ContainerT> auto meanvardev(const ContainerT&);

} // namespace simpstat
All but minimum() and maximum() have additional template parameters shown below, but they all default, and ContainerT is deduced.

The first six templates all return ContainerT::value_type; meanvardev() returns a std::tuple of three ContainerT::value_types.

variance() calls mean(), and stddev() calls variance(); so to keep from doing the same work multiple times when you want more that one of those, the meanvardev() template is provided to calculate all three in a single call.  It doesn’t try to calculate the mean and the variance in a single loop; but it calls mean() only once; and the standard deviation is just the square root of the variance.

The actual work when dealing with built-in types is done in a long double in the hope of avoiding overflow in the variance calculation; and the result is cast to ContainerT::value_type, first rounding to nearest using std::round() if the value type is an integer.


With user-defined containers

The library will work with any container that has front(), back(), size(), begin(), and end() member functions, overloads the [] operator, and has a member type called value_type.

The library doesn’t call begin() or end() or use the container’s iterators explicitly; but it has a couple of range-based for loops, which is why begin() and end() are mentioned.  (And note that you’ll probably need begin() and end() returning random access iterators anyway since you’ll likely need to sort the container.  The library assumes that the container is already sorted.)

It’s not clear why you’d want to write your own container, but you can do it if you really need to provided that it works like a vector or a deque in the few ways specified above.


With user-defined value types

If the value type isn’t one of the built-in types, the library is more complicated; but the complications are intended to make it reasonably easy to use the library as long as your type overloads the usual arithmetic operators (including unary -) and the < operator.  It will also need to be both copyable and moveable and have an explicit conversion from int.

The complications are all in subnamespaces that users don’t really need to care about (but see the actual source file if you need to do anything fancier than what’s described below).

The general structure of the header is:

namespace simpstat {

namespace dflt
{
    // defaults, mostly for built-in types
}

namespace frac
{
    // the function templates that do the actual work using some
    // “fractional type” which will be long double for built-in
    // value types, which we hope is big enough to mitigate the
    // overflow problem given the sum-of-squares in the variance
    // calculation, and which, for integer value types, can hold
    // values between two adjacent value type values
}

// the function templates described above

} // namespace simpstat
Let’s say that you have a rational number with at least the following public API:
class rational
{
public:
    explicit rational(int);

    // the canonical special member functions

    rational& operator+=(const rational&);
    rational& operator-=(const rational&);
    rational& operator*=(const rational&);
    rational& operator/=(const rational&);

    bool operator<(const rational&) const;

    // ...
};

rational operator-(const rational&);

rational operator+(const rational&, const rational&);
rational operator-(const rational&, const rational&);
rational operator*(const rational&, const rational&);
rational operator/(const rational&, const rational&);
You might also need a “fractional type” if you don’t want it to default to long double; and if your value type and fractional type are different types, you’ll need explicit conversions between the value type and the fractional type.  (In this example, we’ll just use rational as both the value type and the fractional type, and so no conversions are required.)

And if you want a standard deviation, you’ll also need to extract the square root of the variance while it still has your fractional type.  If that’s a built-in floating point type, the square root operation will happily default to std::sqrt(); otherwise, the library provides in the simpstat::dflt namespace:

template<typename FracT, int ErrV = 1000, double FixedErrV = 1.0>
class NewtonSqrt
{
    //
    // Note that we require an explicit conversion from int to FracT.
    // We hope that it’s a constant expression but can’t count on it.
    //
    static const FracT zero = static_cast<FracT>(0);
    static const FracT one  = static_cast<FracT>(1);
    static const FracT two  = static_cast<FracT>(2);
    static const FracT err  = static_cast<FracT>(ErrV);

    static FracT absval(FracT val)
    {
        return val < zero ? -val : val;
    }

public:
    static FracT sqrt(FracT val)
    {
        FracT error = INT_MIN == ErrV ? FracT(FixedErrV) : val / err;
        FracT root = one;
        while (error < absval(val - root * root))
        {
            // Newton’s method:
            root = (val / root + root) / two;
        }
        return root;
    }
};
which is what you’ll get, with the second and third template arguments defaulting as shown, if you don’t do anything else.

It computes a square root by successive approximation using Newton’s method and terminates when the difference between the value and the square of the trial root is not greater than some error.  The allowable error will default to the passed value divided by the second template parameter, which parameter in turn defaults to 1000, thus giving an answer within ±0.1% of the passed value.  You can specify other values for the second template argument if you want some other accuracy; but note that the template argument will be a divisor, not a multiplier (so a bigger argument yields a smaller error); and since it’s a non-type template argument, it must be a constant expression.

Note that the NewtonSqrt template has a non-type template parameter of type double which requires at least C++20.  If that’s a problem, you can make the FixedErrV parameter be an int whose value is the reciprocal of the desired error (like I did for the ErrV parameter), and change the expression after the ? in the first line of the sqrt function to FracT(1.0/double(FixedErrV)).  In any event, a user-defined fractional type will need an explicit conversion from double in addition to the conversion from int that’s needed for both the value type and the fractional type.

If that’s not good enough for you, you’ll need to write your own; but that might not be as daunting as it sounds.  Let’s say that you’re happy with Newton’s method but you want one additional decimal digit of accuracy:

struct ratsqrt
{
    static rational sqrt(const rational& val)
    {
        // Just specify a 2nd template argument 10x the default:
        return simpstat::dflt:NewtonSqrt<rational,10000>::sqrt(val);
    }
};

Or maybe you want some fixed error that doesn’t depend on what the passed value is, in which case you can use INT_MIN as the second template argument and specify the fixed error as the third.  Let’s say that your ultimate value type is an integer, so you’ll be happy with an error of ±½:

struct my_sqrt
{
    static my_frac sqrt(const my_frac& val)
    {
        return simpstat::dflt:NewtonSqrt<my_frac,INT_MIN,0.5>::sqrt(val);
    }
};

Other than the square root (needed only for standard deviations) and the fractional type (needed for medians and standard deviations), using the library should be only a matter of specifying the actual template arguments.  The most “interesting” function to call is:

template<typename ContainerT,
         typename FracT = dflt::DefaultFrac,
         typename RootT = dflt::DefaultRoot<FracT>,
         typename CastT =
             dflt::DefaultCaster<typename ContainerT::value_type,FracT>>
/*
std::tuple<typename ContainerT::value_type,
           typename ContainerT::value_type,
           typename ContainerT::value_type>
*/
auto meanvardev(const ContainerT&);
Given:
std::vector<rational> data;
//load data with at least one value and sort it
you can use the library with only a little more work:
using namespace simpstat; // for simplicity herein

rational min = minimum(data);
rational max = maximum(data);

//explicitly specify <     ContainerT      ,  FracT >
rational med = median<std::vector<rational>,rational>(data);

                           // <     ContainerT      ,  FracT , RootT >
const auto others = meanvardev<std::vector<rational>,rational,ratsqrt>(data);
rational avg = std::get<0>(others);
rational var = std::get<1>(others);
rational dev = std::get<2>(others);
The minimum() and maximum() templates have only a single template parameter which can be deduced.  For all the others, you’ll need to specify most or all of your template arguments explicitly.  In our example, since the value type and fractional type are the same type (both rational), the CastT parameter can default.  (It will assume that it’s dealing with a user-defined type and default to just a copy or a move as appropriate.)

Only the stddev() and meanvardev() templates have the RootT parameter.

Note that the RootT and CasT parameters are expected to be class templates with static sqrt(FracT) and cast(FracT) members, respectively.
They’re not just function templates since they’d need to be non-type template parameters with pointer-to-function type.  That’s doable, but ugh!


All the template parameters

namespace simpstat {

template<typename ContainerT>
auto minimum(const ContainerT&);

template<typename ContainerT>
auto maximum(const ContainerT&);

template<typename ContainerT,
         typename FracT = dflt::DefaultFrac,
         typename CastT =
             dflt::DefaultCaster<typename ContainerT::value_type,FracT>>
auto median(const ContainerT&);

template<typename ContainerT,
         typename FracT = dflt::DefaultFrac,
         typename CastT =
             dflt::DefaultCaster<typename ContainerT::value_type,FracT>>
auto mean(const ContainerT&);

template<typename ContainerT,
         typename FracT = dflt::DefaultFrac,
         typename CastT =
             dflt::DefaultCaster<typename ContainerT::value_type,FracT>>
auto variance(const ContainerT&);

template<typename ContainerT,
         typename FracT = dflt::DefaultFrac,
         typename RootT = dflt::DefaultRoot<FracT>,
         typename CastT =
             dflt::DefaultCaster<typename ContainerT::value_type,FracT>>
auto stddev(const ContainerT&);

template<typename ContainerT,
         typename FracT = dflt::DefaultFrac,
         typename RootT = dflt::DefaultRoot<FracT>,
         typename CastT =
             dflt::DefaultCaster<typename ContainerT::value_type,FracT>>
auto meanvardev(const ContainerT&);

} // namespace simpstat


All suggestions and corrections will be welcome; all flames will be amusing.
Mail to was at pobox dot com.