Category: general

f32, u32, and const

Some time ago, I wrote “floats, bits, and constant expressions” about converting floating point number into its representative ones and zeros as a C++ constant expression – constructing the IEEE 754 representation without being able to examine the bits directly.

I’ve been playing around with Rust recently, and rewrote that conversion code as a bit of a learning exercise for myself, with a thoroughly contrived set of constraints: using integer and single-precision floating point math, at compile time, without unsafe blocks, while using as few unstable features as possible.

I’ve included the listing below, for your bemusement and/or head-shaking, and you can play with the code in the Rust Playground and rust.godbolt.org

Modern C++ Randomness

This thread happened…

Choose:

— Elan Ruskin (@despair) May 24, 2018

What is the type of x?

— Jonathan Adamczewski (@twoscomplement) May 24, 2018

std::optional<std::any>

— Andreas Fredriksson (@deplinenoise) May 24, 2018

I’ll see your C++17, and raise you C89:#define x rand()

— Jonathan Adamczewski (@twoscomplement) May 24, 2018

auto x = []() { return rand(); }

— Andreas Fredriksson (@deplinenoise) May 24, 2018

tsk tsk… that’s not mordern enough: auto x = []() { std::random_device rd; std::mt19937 gen(rd()); std::uniform_int_distribution<> dis(0, 32767); return dis(gen); } <— FIXED IT FOR YOU!

— Bobby Anguelov (@BobbyAnguelov) May 24, 2018

So, rand() assembles to ‘jmp rand’. That abomination on the other hand.. My eyes! https://t.co/ujNkLrzavs

— Andreas Fredriksson (@deplinenoise) May 24, 2018

So I did a little digging to satisfy my own curiosity about the “modern C++” version, and have learned a few things that I didn’t know previously…

(this is a manual unrolled twitter thread that starts here, with slight modifications)

Nearly all of this I gleaned from the invaluable http://cppreference.com and http://godbolt.org . Comments about implementation refer specifically to the gcc-8.1 C++ standard library, examined using Compiler Explorer and the -E command line option.

std::random_device is a platform-specific source of entropy.

std: mt19937 is a parameterized typedef of std::mersenne_twister_engine

specifically:
std::mersenne_twister_engine<uint_fast32_t, 32, 624, 397, 31, 0x9908b0df, 11, 0xffffffff, 7, 0x9d2c5680, 15, 0xefc60000, 18, 1812433253>
(What do those number mean? I don’t know.)

And std::uniform_int_distribution produces uniformly distributed random numbers over a specified range, from a provided generator.

The default constructor for std::random_device takes an implementation-defined argument, with a default value.

The meaning of the argument is implementation-defined – but the type is not: std::string. (I’m not sure why a dynamically modifiable string object was the right choice to be the configuration parameter for an entropy generator.)

There are out-of-line private functions for much of this implementation of std::random_device. The constructor that calls the out-of-line init function is itself inline – so the construction and destruction of the default std::string param is also generated inline.

Also, peeking inside std::random_generator, there is a union with two members:

void* _M_file, which I guess would be used to store a file handle for /dev/urandom or similar.

std::mt19937 _M_mt, which is a … parameterized std::mersenne_twister_engine object.

So it seems reasonable to me that if you can’t get entropy* from outside your program, generate your own approximation. It looks like it is possible that the entropy for the std::mersenne_twister_engine will be provided by a std::mersenne_twister_engine.

Unlike std::random_device, which has its implementation out of line, std::mersenne_twister_engine‘s implementation seems to be all inline. It is unclear what benefits this brings, but it results in a few hundred additional instructions generated.

And then there’s std::uniform_int_distribution, which seems mostly unsurprising. It is again fully inline, which (from a cursory eyeballing) may allow a sufficiently insightful compiler to avoid a couple of branches and function calls.

The code that got me started on this was presented in jest – but (std::random_device + std::mt19937 + std::uniform_int_distribution) is a commonly recommended pattern for generating random numbers using these modern C++ library features.

My takeaways:
std::random_device is potentially very expensive to use – and doesn’t provide strong cross-platform guarantees about the randomness it provides. It is configured with an std::string – the meaning of which is platform dependent. I am not compelled to use this type.

std::mt19937 adds a sizeable chunk of codegen via its inline implementation – and there are better options than Mersenne Twister.

Bottom line: I’m probably going to stick with rand(), and if I need something a little fancier, http://www.pcg-random.org/ or one of the other suggestions provided as replies to the twitter thread.

Addition: the code I was able to gather, representing some relevant parts

Watch as the OS rewrites my buggy program.

I didn’t know that SetErrorMode(SEM_NOALIGNMENTFAULTEXCEPT) was a thing, until I wrote a bad test that wouldn’t crash.

Digging into it, I found that a movaps instruction was being rewritten as movups, which was a thoroughly confusing thing to see.

The one clue I had was that a fault due to an unaligned load had been observed in non-test code, but did not reproduce when written as a test using the google-test framework. A short hunt later (including a failed attempt at writing a small repro case), I found an explanation: google test suppresses this class of failure.

The code below will successfully demonstrate the behavior, printing out the SIMD load instruction before and after calling the function with an unaligned pointer.

[Gist]

TransientFunction: A light-weight alternative to std::function [C++11]

[Gist]

Priorities for my team

(unthreaded from here)

During the day, I’m a Lead of a group of programmers. We’re responsible for a range of tools and tech used by others at the company for making games.

I have a list of the my priorities (and some related questions) of things that I think are important for us to be able to do well as individuals, and as a team:

Treat people with respect. Value their time, place high value on their well-being, and start with the assumption that they have good intentions
(“People” includes yourself: respect yourself, value your own time and well-being, and have confidence in your good intentions.)
When solving a problem, know the user and understand their needs.
- Do you understand the problem(s) that need to be solved? (it’s easy to make assumptions)
- Have you spoken to the user and listened to their perspective? (it’s easy to solve the wrong problem)
- Have you explored the specific constraints of the problem by asking questions like:
  - Is this part needed? (it’s easy to over-reach)
  - Is there a satisfactory simpler alternative? (actively pursue simplicity)
  - What else will be needed? (it’s easy to overlook details)
- Have your discussed your proposed solution with users, and do they understand what you intend to do? (verify, and pursue buy-in)
- Do you continue to meet regularly with users? Do they know you? Do they believe that you’re working for their benefit? (don’t under-estimate the value of trust)
Have a clear understanding of what you are doing.
- Do you understand the system you’re working in? (it’s easy to make assumptions)
- Have you read the documentation and/or code? (set yourself up to succeed with whatever is available)
- For code:
  - Have you tried to modify the code? (pull a thread; see what breaks)
  - Can you explain how the code works to another programmer in a convincing way? (test your confidence)
  - Can you explain how the code works to a non-programmer?
When trying to solve a problem, debug aggressively and efficiently.
- Does the bug need to be fixed? (see 1)
- Do you understand how the system works? (see 2)
- Is there a faster way to debug the problem? Can you change code or data to cause the problem to occur more quickly and reliably? (iterate as quickly as you can, fix the bug, and move on)
- Do you trust your own judgement? (debug boldly, have confidence in what you have observed, make hypotheses and test them)
Pursue excellence in your work.
- How are you working to be better understood? (good communication takes time and effort)
- How are you working to better understand others? (don’t assume that others will pursue you with insights)
- Are you responding to feedback with enthusiasm to improve your work? (pursue professionalism)
- Are you writing high quality, easy to understand, easy to maintain code? How do you know? (continue to develop your technical skills)
- How are you working to become an expert and industry leader with the technologies and techniques you use every day? (pursue excellence in your field)
- Are you eager to improve (and fix) systems you have worked on previously? (take responsibility for your work)

The list was created for discussion with the group, and as an effort to articulate my own expectations in a way that will help my team understand me.

Composing this has been useful exercise for me as a lead, and definitely worthwhile for the group. If you’ve never tried writing down your own priorities, values, and/or assumptions, I encourage you to try it :)

A little bit of floating point in a memory allocator — Part 2: The floating point

[Previously]

This post contains the same material as this thread of tweets, with a few minor edits.

In IEEE754, floating point numbers are represented like this:

±2ⁿⁿⁿ×1.sss…

nnn is the exponent, which is floor(log2(size)) — which happens to be the fl value computed by TLSF.

sss… is the significand fraction: the part that follows the decimal point, which happens to be sl.

And so to calculate fl and sl, all we need to do is convert size to a floating point value (on recent x86 hardware, that’s a single instruction). Then we can extract the exponent, and the upper bits of the fractional part, and we’re all done :D

That can be implemented like this:

double sf = (int64_t)size;
uint64_t sfi;
memcpy(&sfi, &sf, 8);
fl = (sfi >> 52) - (1023 + 7);
sl = (sfi >> 47) & 31;

There’s some subtleties (there always is). I’ll break it down…

double sf = (int64_t)size;

Convert size to a double, with an explicit cast. size has type size_t, but using TLSF from github.com/mattconte/tlsf, the largest supported allocation on 64bit architecture is 2^32 bytes – comfortably less than the precision provided by the double type. If you need your TLSF allocator to allocate chunks bigger than 2^53, this isn’t the technique for you :)

I first tried using float (not double), which can provide correct results — but only if the rounding mode happens to be set correctly. double is easier.

The cast to (int64_t) results in better codegen on x86: without it, the compiler will generate a full 64bit unsigned conversion, and there is no single instruction for that.

The cast tells the compiler to (in effect) consider the bits of size as if they were a two’s complement signed value — and there is an SSE instruction to handle that case (cvtsi2sdq or similar). Again, with the implementation we’re using size can’t be that big, so this will do the Right Thing.

uint64_t sfi;
memcpy(&sfi, &sf, 8);

Copy the 8 bytes of the double into an unsigned integer variable. There are a lot of ways that C/C++ programmers copy bits from floating point to integer – some of them are well defined :) memcpy() does what we want, and any moderately respectable compiler knows how to select decent instructions to implement it.

Now we have floating point bits in an integer register, consisting of one sign bit (always zero for this, because size is always positive), eleven exponent bits (offset by 1023), and 52 bits of significant fraction. All we need to do is extract those, and we’re done :)

fl = (sfi >> 52) - (1023 + 7);

Extract the exponent: shift it down (ignoring the always-zero sign bit), subtract the offset (1023), and that 7 we saw earlier, at the same time.

sl = (sfi >> 47) & 31;

Extract the five most significant bits of the fraction – we do need to mask out the exponent.

And, just like that*, we have mapping_insert(), implemented in terms of integer -> floating point conversion.

* Actual code (rather than fragments) may be included in a later post…

A little bit of floating point in a memory allocator — Part 1: Background

This post contains the same material as this thread of tweets, with a few minor edits.

Over my holiday break at the end of 2017, I took a look into the TLSF (Two Level Segregated Fit) memory allocator to better understand how it works. I’ve made use of this allocator and have been impressed by its real world performance, but never really done a deep dive to properly understand it.

The mapping_insert() function is a key part of the allocator implementation, and caught my eye. Here’s how that function is described in the paper A constant-time dynamic storage allocator for real-time systems:

I’ll be honest: from that description, I never developed a clear picture in my mind of what that function does.

(Reading it now, it seems reasonably clear – but I can say that only after I spent quite a bit of time using other methods to develop my understanding)

Something that helped me a lot was by looking at the implementation of that function from github.com/mattconte/tlsf/. There’s a bunch of long-named macro constants in there, and a few extra implementation details. If you collapse those it looks something like this:

void mapping_insert(size_t size, int* fli, int* sli)
{ 
  int fl, sl;
  if (size < 256)
  {
    fl = 0;
    sl = (int)size / 8;
  }
  else
  {
    fl = fls(size);
    sl = (int)(size >> (fl - 5)) ^ 0x20;
    fl -= 7;
  }
  *fli = fl;
  *sli = sl;
}

It’s a pretty simple function (it really is). But I still failed to *see* the pattern of results that would be produced in my mind’s eye.

I went so far as to make a giant spreadsheet of all the intermediate values for a range of inputs, to paint myself a picture of the effect of each step :) That helped immensely.

Breaking it down…

There are two cases handled in the function: one for when size is below a certain threshold, and on for when it is larger. The first is straightforward, and accounts for a small number of possible input values. The large size case is more interesting.

The function computes two values: fl and sl, the first and second level indices for a lookup table. For the large case, fl (where fl is “first level”) is computed via fls(size) (where fls is short for “find last set” – similar names, just to keep you on your toes).

fls() returns the index of the largest bit set, counting from the least significant slbit, which is the index of the largest power of two. In the words of the paper:

“the instruction fls can be used to compute the ⌊log2(x)⌋ function”

Which is, in C-like syntax: floor(log2(x))

And there’s that “fl -= 7” at the end. That will show up again later.

For the large case, the computation of sl has a few steps:

sl = (size >> (fl – 5)) ^ 0x20;

Depending on shift down size by some amount (based on fl), and mask out the sixth bit?

(Aside: The CellBE programmer in me is flinching at that variable shift)

It took me a while (longer than I would have liked…) to realize that this
size >> (fl – 5) is shifting size to generate a number that has exactly six significant bits, at the least significant end of the register (bits 5 thru 0).

Because fl is the index of the most significant bit, after this shift, bit 5 will always be 1 – and that “^ 0x20” will unset it, leaving the result as a value between 0 and 31 (inclusive).

So here’s where floating point comes into it, and the cute thing I saw: another way to compute fl and sl is to convert size into an IEEE754 floating point number, and extract the exponent, and most significant bits of the mantissa. I’ll cover that in the next part, here.

Aside: Over-engineered Min() [C++, variadic templates, constexpr, fold left]

Q: Given a function constexpr int Min(int a, int b), construct a function constexpr int Min(Args... args) that returns the minimum of all the provided args. Fail to justify your over-engineering.

A: Rename Min(int, int) as MinImpl(int, int) or stick it in a namespace. Overloading the function is not only unnecessary, it gets in the way of the implementation.

constexpr int MinImpl(int a, int b)
{
  return a < b ? a : b;
}

Implement a constexpr fold left function. If we can use it for Min(), we should be able to do the same for Max(), and other similar functions. Should we be able to find any (#prematuregeneralization).

template<typename ArgA, typename ArgB, typename Func>
constexpr auto foldl(Func func, ArgA a, ArgB b)
{
  return func(a, b);
}

template<typename ArgA, typename ArgB,
typename Func, typename ...Args>
constexpr auto foldl(Func func, ArgA a, ArgB b, Args... args)
{
  return foldl(func, func(a, b), args...);
}

Combine the two.

template<typename ...Args>
constexpr auto Min(Args... args)
{
  return foldl(MinImpl, args...);
}

Add the bare minimum amount of testing for a constexpr function: slap a static_assert() on it.

static_assert(Min(6, 4, 5, 3, 9) == 3), "Nope");

I did so with Visual Studio 2015 Update 2. It did not object.

Addendum: Some discussion with @nlguillemot and @DrPizza led to this attempt to do something similar with a C++17/C++1z fold-expression:

#include <limits.h>

constexpr int MinImpl1(int a, int b) { return a < b ? a : b; }

constexpr void MinImpl2(int* m, int a, int b) { *m = a < b ? a : b; }

template<typename ...Args>
constexpr int Min(Args... args)
{
  int m = INT_MAX;
  // a binary expression in an operand of a fold-expression 
  // is not allowed, so this won't compile:
  //((m = MinImpl1(m, args), ...);

  // But this does:
  (MinImpl2(/*out*/&m, m, args), ...);
  return m;
}

int main()
{
  static_assert(Min(3,4,5) == 3, "nope");
}

This compiles with a gcc-6 pre-release snapshot.

Update: Here’s a further updated version, based on a refinement by @dotstdy.

floats, bits, and constant expressions

Can you access the bits that represent an IEEE754 single precision float in a C++14 constant expression (constexpr)?

(Why would you want to do that? Maybe you want to run a fast inverse square root at compile time. Or maybe you want to do something that is actually useful. I wanted to know if it could be done.)

For context: this article is based on experiences using gcc-5.3.0 and clang-3.7.1 with -std=c++14 -march=native on a Sandy Bridge Intel i7. Where I reference sections from the C++ standard, I’m referring to the November 2014 draft.

Before going further, I’ll quote 5.20.6 from the standard:

Since this International Standard imposes no restrictions on the accuracy of floating-point operations, it is unspecified whether the evaluation of a floating-point expression during translation yields the same result as the evaluation of the same expression (or the same operations on the same values) during program execution.⁸⁸

88) Nonetheless, implementations are encouraged to provide consistent results, irrespective of whether the evaluation was performed during translation and/or during program execution.

In this post, I document things that worked (and didn’t work) for me. You may have a different experience.

Methods of conversion that won’t work

(Error text from g++-5.3.0)

You can’t access the bits of a float via a typecast pointer [which is undefined behavior, and covered by 5.20.2.5]:

  constexpr uint32_t bits_cast(float f)
  {
    return *(uint32_t*)&f; // [2]
  }

error:
  accessing value of 'f' through a 'uint32_t {aka unsigned int}' glvalue
  in a constant expression

You can’t convert it via a reinterpret cast [5.20.2.13]

constexpr uint32_t bits_reinterpret_cast(float f) 
{ 
  const unsigned char* cf = reinterpret_cast<const unsigned char*>(&f); 
  // endianness notwithstanding
  return (cf[3] << 24) | (cf[2] << 16) | (cf[1] << 8) | cf[0]; 
}

error:
  '*(cf + 3u)' is not a constant expression

(gcc reports an error with the memory access, but does not object to the reinterpret_cast. clang produces a specific error for the cast.)

You can’t convert it through a union [gcc, for example, permits this for non-constant expressions, but the standard forbids it in 5.20.2.8]:

constexpr uint32_t bits_union(float f) 
{ 
  union Convert { 
    uint32_t u;
    float f;
    constexpr Convert(float f_) : f(f_) {}
  };
  return Convert(f).u;
}

error:
  accessing 'bits_union(float)::Convert::u' member instead of 
  initialized 'bits_union(float)::Convert::f' member 
  in constant expression

You can’t use memcpy() [5.20.2.2]:

constexpr uint32_t bits_memcpy(float f) 
{
  uint32_t u = 0;
  memcpy(&u, &f, sizeof f);
  return u;
}

error:
  'memcpy(((void*)(&u)), ((const void*)(&f)), 4ul)' 
  is not a constant expression

And you can’t define a constexpr memcpy()-like function that is capable of the task [5.20.2.11]:

constexpr void* memcpy(void* dest, const void* src, size_t n)
{
  char* d = (char*)dest;
  const char* s = (const char*)src;
  while(n-- > 0)
    *d++ = *s++;
  return dest;
}

constexpr uint32_t bits_memcpy(float f)
{
  uint32_t u = 0;
  memcpy(&u, &f, sizeof f);
  return u;
}

error:
  accessing value of 'u' through a 'char' glvalue
  in a constant expression

So what can you do?

Floating point operations in constant expressions

For constexpr float f = 2.0f, g = 2.0f the following operations are available [as they are not ruled out by anything I can see in 5.20]:

Comparison of floating point values e.g.
static_assert(f == g, "not equal");
Floating point arithmetic operations e.g.
static_assert(f * 2.0f == 4.0f, "arithmetic failed");
Casts from float to integral value, often with well-defined semantics e.g.
constexpr int i = (int)2.0f; static_assert(i == 2, "conversion failed");

So I wrote a function (uint32_t bits(float)) that will return the binary representation of an IEEE754 single precision float. The full function is at the end of this post. I’ll go through the various steps required to produce (my best approximation of) the desired result.

1. Zero

When bits() is passed the value zero, we want this behavior:

static_assert(bits(0.0f) == 0x00000000);

And we can have it:

  if (f == 0.0f)
    return 0;

Nothing difficult about that.

2. Negative zero

In IEEE754 land, negative zero is a thing. Ideally, we’d like this behavior:

static_assert(bits(-0.0f) == 0x80000000)

But the check for zero also matches negative zero. Negative zero is not something that the C++ standard has anything to say about, given that IEEE754 is an implementation choice [3.9.1.8: “The value representation of floating-point types is implementation defined”]. My compilers treat negative zero the same as zero for all comparisons and arithmetic operations. As such, bits() returns the wrong value when considering negative zero, returning 0x00000000 rather than the desired 0x80000000.

I did look into other methods for detecting negative zero in C++, without finding something that would work in a constant expression. I have seen divide by zero used as a way to detect negative zero (resulting in ±infinity, depending on the sign of the zero), but that doesn’t compile in a constant expression:

constexpr float r = 1.0f / -0.0f;

error: '(1.0e+0f / -0.0f)' is not a constant expression

and divide by zero is explicitly named as undefined behavior in 5.6.4, and so by 5.20.2.5 is unusable in a constant expression.

Result: negative zero becomes positive zero.

3. Infinity

We want this:

static_assert(bits(INFINITY) == 0x7f800000);

And this:

  else if (f == INFINITY)
    return 0x7f800000;

works as expected.

4. Negative Infinity

Same idea, different sign:

static_assert(bits(-INFINITY) == 0xff800000);

  else if (f == -INFINITY)
    return 0xff800000;

Also works.

5. NaNs

There’s no way to generate arbitrary NaN constants in a constant expression that I can see (not least because casting bits to floats isn’t possible in a constant expression, either), so it seems impossible to get this right in general.

In practice, maybe this is good enough:

static_assert(bits(NAN) == 0x7fc00000);

NaN values can be anywhere in the range of 0x7f800001 -- 0x7fffffff and 0xff800001 -- 0xffffffff. I have no idea as to the specific values that are seen in practice, nor what they mean. 0x7fc00000 shows up in /usr/include/bits/nan.h on the system I’m using to write this, so — right or wrong — I’ve chosen that as the reference value.

It is possible to detect a NaN value in a constant expression, but not its payload. (At least that I’ve been able to find). So there’s this:

  else if (f != f) // NaN
    return 0x7fc00000; // This is my NaN...

Which means that of the 2*(2²³-1) possible NaNs, one will be handled correctly (in this case, 0x7fc00000). For the other 16,777,213 values, the wrong value will be returned (in this case, 0x7fc00000).

So… partial success? NaNs are correctly detected, but the bits for only one NaN value will be returned correctly.

(On the other hand, the probability that it will ever matter could be stored as a denormalized float)

6. Normalized Values

// pseudo-code
static_assert(bits({  0x1p-126f, ...,  0x1.ffff7p127})
                == { 0x00800000, ...,  0x7f7fffff});
static_assert(bits({ -0x1p-126f, ..., -0x1.ffff7p127})
                == { 0x80800000, ...,  0xff7fffff});

[That 0x1pnnnf format happens to be a convenient way to represent exact values that can be stored as binary floating point numbers]

It is possible to detect and correctly construct bits for every normalized value. It does requires a little care to avoid truncation and undefined behavior. I wrote a few different implementations — the one that I describe here requires relatively little code, and doesn’t perform terribly [0].

The first step is to find and clear the sign bit. This simplifies subsequent steps.

 bool sign = f < 0.0f; 
  float abs_f = sign ? -f : f;

Now we have abs_f — it’s positive, non-zero, non-infinite, and not a NaN.

What happens when a float is cast to an integral type?

uint64_t i = (uint64_t)f;

The value of f will be stored in i, according to the following rules:

The value will be rounded towards zero which, for positive values, means truncation of any fractional part.
If the value in f is too large to be represented as a uint64_t (i.e. f > 2⁶⁴-1) the result is undefined.

If truncation takes place, data is lost. If the number is too large, the result is (probably) meaningless.

For our conversion function, if we can scale abs_f into a range where it is not larger than (2⁶⁴-1), and it has no fractional part, we have access to an exact representation of the bits that make up the float. We just need to keep track of the amount of scaling being done.

Single precision IEEE 754 floating point numbers have, at most, (23+1) bits of precision (23 in the significand, 1 implicit). This means that we can scale down large numbers and scale up small numbers into the required range.

Multiplying by powers of two change only the exponent of the float, and leave the significand unmodified. As such, we can arbitrarily scale a float by a power of two and — so long as we don’t over- or under-flow the float — we will not lose any of the bits in the significand.

For the sake of simplicity (believe it or not [1]), my approach is to scale abs_f in steps of 2⁴¹ so that (abs_f ≥ 2⁸⁷) like so:

  int exponent = 254; 

  while(abs_f < 0x1p87f) 
  { 
    abs_f *= 0x1p41f; 
    exponent -= 41; 
  }

If abs_f ≥ 2⁸⁷, the least significant bit of abs_f, if set, is 2^(87-23)==2^64.

Next, abs_f is scaled back down by 2⁶⁴ (which adds no fractional part as the least significant bit is 2⁶⁴) and converted to an unsigned 64 bit integer.

  uint64_t a = (uint64_t)(abs_f * 0x1p-64f);

All of the bits of abs_f are now present in a, without overflow or truncation. All that is needed now is to determine where they are:

  int lz = count_leading_zeroes(a);

adjust the exponent accordingly:

  exponent -= lz;

and construct the result:

  uint32_t significand = (a << (lz + 1)) >> (64 - 23); // [3]
  return (sign << 31) | (exponent << 23) | significand;

With this, we have correct results for every normalized float.

7. Denormalized Values

// pseudo-code
static_assert(bits({  0x1.0p-149f, ...,  0x1.ffff7p-127f})
                == {  0x00000001,  ...,  0x007fffff});
static_assert(bits({ -0x1.0p-149f, ..., -0x1.ffff7p-127f})
                == {  0x80000001,  ...,  0x807fffff});

The final detail is denormalized values. Handling of normalized values as presented so far fails because denormals will have additional leading zeroes. They are fairly easy to account for:

   if (exponent <= 0) 
  { 
    exponent = 0; 
    lz = 8 - 1; 
  }

To attempt to demystify that lz = 8 - 1 a little: there are 8 leading bits that aren’t part of the significand of a denormalized single precision float after the repeated 2^-41 scaling that has taken place. There is also no leading 1 bit that is present in all normalized numbers (which is accounted for in the calculation of significand above as (lz + 1)). So the leading zero count (lz) is set to account for the 8 bits of offset to the start of the denormalized significand, minus the one that the subsequent calculation assumes it needs to skip over.

And that’s it. All the possible values of a float are accounted for.

(Side note: If you’re compiling with -ffast-math, passing denormalized numbers to bits() will return invalid results. That’s -ffast-math for you. With gcc or clang, you could add an #ifdef __FAST_MATH__ around the test for negative exponent.)

Conclusion

You can indeed obtain the bit representation of a floating point number at compile time. Mostly. Negative zero is wrong, NaNs are detected but otherwise not accurately converted.

Enjoy your compile-time bit-twiddling!

The whole deal:

// Based on code from 
// https://graphics.stanford.edu/~seander/bithacks.html
constexpr int count_leading_zeroes(uint64_t v) 
{ 
  constexpr char bit_position[64] = {  
     0,  1,  2,  7,  3, 13,  8, 19,  4, 25, 14, 28,  9, 34, 20, 40, 
     5, 17, 26, 38, 15, 46, 29, 48, 10, 31, 35, 54, 21, 50, 41, 57, 
    63,  6, 12, 18, 24, 27, 33, 39, 16, 37, 45, 47, 30, 53, 49, 56, 
    62, 11, 23, 32, 36, 44, 52, 55, 61, 22, 43, 51, 60, 42, 59, 58 }; 
   
  v |= v >> 1; // first round down to one less than a power of 2  
  v |= v >> 2; 
  v |= v >> 4; 
  v |= v >> 8; 
  v |= v >> 16; 
  v |= v >> 32; 
  v = (v >> 1) + 1; 
   
  return 63 - bit_position[(v * 0x0218a392cd3d5dbf)>>58]; // [3]
}
 
constexpr uint32_t bits(float f) 
{ 
  if (f == 0.0f) 
    return 0; // also matches -0.0f and gives wrong result 
  else if (f == INFINITY) 
    return 0x7f800000; 
  else if (f == -INFINITY) 
    return 0xff800000; 
  else if (f != f) // NaN 
    return 0x7fc00000; // This is my NaN...
 
  bool sign = f < 0.0f; 
  float abs_f = sign ? -f : f; 
 
  int exponent = 254; 
 
  while(abs_f < 0x1p87f) 
  { 
    abs_f *= 0x1p41f; 
    exponent -= 41; 
  } 
 
  uint64_t a = (uint64_t)(abs_f * 0x1p-64f); 
  int lz = count_leading_zeroes(a);
  exponent -= lz;
 
  if (exponent <= 0) 
  { 
    exponent = 0; 
    lz = 8 - 1;
  } 
 
  uint32_t significand = (a << (lz + 1)) >> (64 - 23); // [3]
  return (sign << 31) | (exponent << 23) | significand; 
}

[0] Why does runtime performance matter? Because that’s how I tested the conversion function while implementing it. I was applying Bruce Dawson’s advice for testing floats and the quicker I found out that I’d broken the conversion the better. For the implementation described in this post, it takes about 97 seconds to test all four billion float values on my laptop — half that time if I wasn’t testing negative numbers (which are unlikely to cause problems due to the way I handle the sign bit). The implementation I’ve described in this post is not the fastest solution to the problem, but it is relatively compact, and well behaved in the face of -ffast-math.

Admission buried in a footnote: I have not validated correct behavior of this code for every floating point number in actual compile-time constant expressions. Compile-time evaluation of four billion invocations of bits() takes more time than I’ve been willing to invest so far.

[1] It is conceptually simpler to multiply abs_f by two (or one half) until the result is exactly positioned so that no leading zero count is required after the cast — at least, that was what I did in my first attempt. The approach described here was found to be significantly faster. I have no doubt that better-performing constant-expression-friendly approaches exist.

[2] Update 2016-03-28: Thanks to satbyy for pointing out the missing ampersand — it was lost sometime after copying the code into the article.

[3] Update 2016-03-28: Thanks to louiswins for pointing out additional code errors.

Another another C++11 ‘countof’

My earlier post received this comment which is a pretty neat little improvement over the one from g-truc.net.

Here it is, with one further tweak:

template<typename T, std::size_t N>
constexpr 
std::integral_constant<std::size_t, N> countof(T const (&)[N]) noexcept
{
  return {};
}

#define COUNTOF(...) decltype(countof(__VA_ARGS__))::value

The change I’ve made to pfultz2’s version is to use ::value rather than {} after decltype in the macro.

This makes the type of the result std::size_t not std::integral_constant, so it can be used in va_arg settings without triggering compiler or static analysis warnings.

It also has the advantage of not triggering extra warnings in VS2015U1 (this issue).