I’ve been playing around with Rust recently, and rewrote that conversion code as a bit of a learning exercise for myself, with a thoroughly contrived set of constraints: using integer and single-precision floating point math, at compile time, without unsafe blocks, while using as few unstable features as possible.

I’ve included the listing below, for your bemusement and/or head-shaking, and you can play with the code in the Rust Playground and rust.godbolt.org

// Jonathan Adamczewski 2020-05-12 // // Constructing the bit-representation of an IEEE 754 single precision floating // point number, using integer and single-precision floating point math, at // compile time, in rust, without unsafe blocks, while using as few unstable // features as I can. // // or "What if this silly C++ thing http://brnz.org/hbr/?p=1518 but in Rust?" // Q. Why? What is this good for? // A. To the best of my knowledge, this code serves no useful purpose. // But I did learn a thing or two while writing it :) // This is needed to be able to perform floating point operations in a const // function: #![feature(const_fn)] // bits_transmute(): Returns the bits representing a floating point value, by // way of std::mem::transmute() // // For completeness (and validation), and to make it clear the fundamentally // unnecessary nature of the exercise :D - here's a short, straightforward, // library-based version. But it needs the const_transmute flag and an unsafe // block. #![feature(const_transmute)] const fn bits_transmute(f: f32) -> u32 { unsafe { std::mem::transmute::<f32, u32>(f) } } // get_if_u32(predicate:bool, if_true: u32, if_false: u32): // Returns if_true if predicate is true, else if_false // // If and match are not able to be used in const functions (at least, not // without #![feature(const_if_match)] - so here's a branch-free select function // for u32s const fn get_if_u32(predicate: bool, if_true: u32, if_false: u32) -> u32 { let pred_mask = (-1 * (predicate as i32)) as u32; let true_val = if_true & pred_mask; let false_val = if_false & !pred_mask; true_val | false_val } // get_if_f32(predicate, if_true, if_false): // Returns if_true if predicate is true, else if_false // // A branch-free select function for f32s. // // If either is_true or is_false is NaN or an infinity, the result will be NaN, // which is not ideal. I don't know of a better way to implement this function // within the arbitrary limitations of this silly little side quest. const fn get_if_f32(predicate: bool, if_true: f32, if_false: f32) -> f32 { // can't convert bool to f32 - but can convert bool to i32 to f32 let pred_sel = (predicate as i32) as f32; let pred_not_sel = ((!predicate) as i32) as f32; let true_val = if_true * pred_sel; let false_val = if_false * pred_not_sel; true_val + false_val } // bits(): Returns the bits representing a floating point value. const fn bits(f: f32) -> u32 { // the result value, initialized to a NaN value that will otherwise not be // produced by this function. let mut r = 0xffff_ffff; // These floation point operations (and others) cause the following error: // only int, `bool` and `char` operations are stable in const fn // hence #![feature(const_fn)] at the top of the file // Identify special cases let is_zero = f == 0_f32; let is_inf = f == f32::INFINITY; let is_neg_inf = f == f32::NEG_INFINITY; let is_nan = f != f; // Writing this as !(is_zero || is_inf || ...) cause the following error: // Loops and conditional expressions are not stable in const fn // so instead write this as type coversions, and bitwise operations // // "normalish" here means that f is a normal or subnormal value let is_normalish = 0 == ((is_zero as u32) | (is_inf as u32) | (is_neg_inf as u32) | (is_nan as u32)); // set the result value for each of the special cases r = get_if_u32(is_zero, 0, r); // if (iz_zero) { r = 0; } r = get_if_u32(is_inf, 0x7f80_0000, r); // if (is_inf) { r = 0x7f80_0000; } r = get_if_u32(is_neg_inf, 0xff80_0000, r); // if (is_neg_inf) { r = 0xff80_0000; } r = get_if_u32(is_nan, 0x7fc0_0000, r); // if (is_nan) { r = 0x7fc0_0000; } // It was tempting at this point to try setting f to a "normalish" placeholder // value so that special cases do not have to be handled in the code that // follows, like so: // f = get_if_f32(is_normal, f, 1_f32); // // Unfortunately, get_if_f32() returns NaN if either input is NaN or infinite. // Instead of switching the value, we work around the non-normalish cases // later. // // (This whole function is branch-free, so all of it is executed regardless of // the input value) // extract the sign bit let sign_bit = get_if_u32(f < 0_f32, 1, 0); // compute the absolute value of f let mut abs_f = get_if_f32(f < 0_f32, -f, f); // This part is a little complicated. The algorithm is functionally the same // as the C++ version linked from the top of the file. // // Because of the various contrived constraints on thie problem, we compute // the exponent and significand, rather than extract the bits directly. // // The idea is this: // Every finite single precision float point number can be represented as a // series of (at most) 24 significant digits as a 128.149 fixed point number // (128: 126 exponent values >= 0, plus one for the implicit leading 1, plus // one more so that the decimal point falls on a power-of-two boundary :) // 149: 126 negative exponent values, plus 23 for the bits of precision in the // significand.) // // If we are able to scale the number such that all of the precision bits fall // in the upper-most 64 bits of that fixed-point representation (while // tracking our effective manipulation of the exponent), we can then // predictably and simply scale that computed value back to a range than can // be converted safely to a u64, count the leading zeros to determine the // exact exponent, and then shift the result into position for the final u32 // representation. // Start with the largest possible exponent - subsequent steps will reduce // this number as appropriate let mut exponent: u32 = 254; { // Hex float literals are really nice. I miss them. // The threshold is 2^87 (think: 64+23 bits) to ensure that the number will // be large enough that, when scaled down by 2^64, all the precision will // fit nicely in a u64 const THRESHOLD: f32 = 154742504910672534362390528_f32; // 0x1p87f == 2^87 // The scaling factor is 2^41 (think: 64-23 bits) to ensure that a number // between 2^87 and 2^64 will not overflow in a single scaling step. const SCALE_UP: f32 = 2199023255552_f32; // 0x1p41f == 2^41 // Because loops are not available (no #![feature(const_loops)], and 'if' is // not available (no #![feature(const_if_match)]), perform repeated branch- // free conditional multiplication of abs_f. // use a macro, because why not :D It's the most compact, simplest option I // could find. macro_rules! maybe_scale { () => {{ // care is needed: if abs_f is above the threshold, multiplying by 2^41 // will cause it to overflow (INFINITY) which will cause get_if_f32() to // return NaN, which will destroy the value in abs_f. So compute a safe // scaling factor for each iteration. // // Roughly equivalent to : // if (abs_f < THRESHOLD) { // exponent -= 41; // abs_f += SCALE_UP; // } let scale = get_if_f32(abs_f < THRESHOLD, SCALE_UP, 1_f32); exponent = get_if_u32(abs_f < THRESHOLD, exponent - 41, exponent); abs_f = get_if_f32(abs_f < THRESHOLD, abs_f * scale, abs_f); }} } // 41 bits per iteration means up to 246 bits shifted. // Even the smallest subnormal value will end up in the desired range. maybe_scale!(); maybe_scale!(); maybe_scale!(); maybe_scale!(); maybe_scale!(); maybe_scale!(); } // Now that we know that abs_f is in the desired range (2^87 <= abs_f < 2^128) // scale it down to be in the range (2^23 <= _ < 2^64), and convert without // loss of precision to u64. const INV_2_64: f32 = 5.42101086242752217003726400434970855712890625e-20_f32; // 0x1p-64f == 2^64 let a = (abs_f * INV_2_64) as u64; // Count the leading zeros. // (C++ doesn't provide a compile-time constant function for this. It's nice // that rust does :) let mut lz = a.leading_zeros(); // if the number isn't normalish, lz is meaningless: we stomp it with // something that will not cause problems in the computation that follows - // the result of which is meaningless, and will be ignored in the end for // non-normalish values. lz = get_if_u32(!is_normalish, 0, lz); // if (!is_normalish) { lz = 0; } { // This step accounts for subnormal numbers, where there are more leading // zeros than can be accounted for in a valid exponent value, and leading // zeros that must remain in the final significand. // // If lz < exponent, reduce exponent to its final correct value - lz will be // used to remove all of the leading zeros. // // Otherwise, clamp exponent to zero, and adjust lz to ensure that the // correct number of bits will remain (after multiplying by 2^41 six times - // 2^246 - there are 7 leading zeros ahead of the original subnormal's // computed significand of 0.sss...) // // The following is roughly equivalent to: // if (lz < exponent) { // exponent = exponent - lz; // } else { // exponent = 0; // lz = 7; // } // we're about to mess with lz and exponent - compute and store the relative // value of the two let lz_is_less_than_exponent = lz < exponent; lz = get_if_u32(!lz_is_less_than_exponent, 7, lz); exponent = get_if_u32( lz_is_less_than_exponent, exponent - lz, 0); } // compute the final significand. // + 1 shifts away a leading 1-bit for normal, and 0-bit for subnormal values // Shifts are done in u64 (that leading bit is shifted into the void), then // the resulting bits are shifted back to their final resting place. let significand = ((a << (lz + 1)) >> (64 - 23)) as u32; // combine the bits let computed_bits = (sign_bit << 31) | (exponent << 23) | significand; // return the normalish result, or the non-normalish result, as appopriate get_if_u32(is_normalish, computed_bits, r) } // Compile-time validation - able to be examined in rust.godbolt.org output pub static BITS_BIGNUM: u32 = bits(std::f32::MAX); pub static TBITS_BIGNUM: u32 = bits_transmute(std::f32::MAX); pub static BITS_LOWER_THAN_MIN: u32 = bits(7.0064923217e-46_f32); pub static TBITS_LOWER_THAN_MIN: u32 = bits_transmute(7.0064923217e-46_f32); pub static BITS_ZERO: u32 = bits(0.0f32); pub static TBITS_ZERO: u32 = bits_transmute(0.0f32); pub static BITS_ONE: u32 = bits(1.0f32); pub static TBITS_ONE: u32 = bits_transmute(1.0f32); pub static BITS_NEG_ONE: u32 = bits(-1.0f32); pub static TBITS_NEG_ONE: u32 = bits_transmute(-1.0f32); pub static BITS_INF: u32 = bits(std::f32::INFINITY); pub static TBITS_INF: u32 = bits_transmute(std::f32::INFINITY); pub static BITS_NEG_INF: u32 = bits(std::f32::NEG_INFINITY); pub static TBITS_NEG_INF: u32 = bits_transmute(std::f32::NEG_INFINITY); pub static BITS_NAN: u32 = bits(std::f32::NAN); pub static TBITS_NAN: u32 = bits_transmute(std::f32::NAN); pub static BITS_COMPUTED_NAN: u32 = bits(std::f32::INFINITY/std::f32::INFINITY); pub static TBITS_COMPUTED_NAN: u32 = bits_transmute(std::f32::INFINITY/std::f32::INFINITY); // Run-time validation of many more values fn main() { let end: usize = 0xffff_ffff; let count = 9_876_543; // number of values to test let step = end / count; for u in (0..=end).step_by(step) { let v = u as u32; // reference let f = unsafe { std::mem::transmute::<u32, f32>(v) }; // compute let c = bits(f); // validation if c != v && !(f.is_nan() && c == 0x7fc0_0000) && // nans !(v == 0x8000_0000 && c == 0) { // negative 0 println!("{:x?} {:x?}", v, c); } } }

Choose:

— Elan Ruskin (@despair) May 24, 2018

What is the type of x?

— Jonathan Adamczewski (@twoscomplement) May 24, 2018

std::optional<std::any>

— Andreas Fredriksson (@deplinenoise) May 24, 2018

I’ll see your C++17, and raise you C89:#define x rand()

— Jonathan Adamczewski (@twoscomplement) May 24, 2018

auto x = []() { return rand(); }

— Andreas Fredriksson (@deplinenoise) May 24, 2018

tsk tsk… that’s not mordern enough: auto x = []() { std::random_device rd; std::mt19937 gen(rd()); std::uniform_int_distribution<> dis(0, 32767); return dis(gen); } <— FIXED IT FOR YOU!

— Bobby Anguelov (@BobbyAnguelov) May 24, 2018

So, rand() assembles to ‘jmp rand’. That abomination on the other hand.. My eyes! https://t.co/ujNkLrzavs

— Andreas Fredriksson (@deplinenoise) May 24, 2018

So I did a little digging to satisfy my own curiosity about the “modern C++” version, and have learned a few things that I didn’t know previously…

(this is a manual unrolled twitter thread that starts here, with slight modifications)

Nearly all of this I gleaned from the invaluable http://cppreference.com and http://godbolt.org . Comments about implementation refer specifically to the gcc-8.1 C++ standard library, examined using Compiler Explorer and the -E command line option.

std::random_device is a platform-specific source of entropy.

std: mt19937 is a parameterized typedef of std::mersenne_twister_engine

specifically:

std::mersenne_twister_engine<uint_fast32_t, 32, 624, 397, 31, 0x9908b0df, 11, 0xffffffff, 7, 0x9d2c5680, 15, 0xefc60000, 18, 1812433253>

(What do those number mean? I don’t know.)

And std::uniform_int_distribution produces uniformly distributed random numbers over a specified range, from a provided generator.

The default constructor for std::random_device takes an implementation-defined argument, with a default value.

The meaning of the argument is implementation-defined – but the type is not: std::string. (I’m not sure why a dynamically modifiable string object was the right choice to be the configuration parameter for an entropy generator.)

There are out-of-line private functions for much of this implementation of std::random_device. The constructor that calls the out-of-line init function is itself inline – so the construction and destruction of the default std::string param is also generated inline.

Also, peeking inside std::random_generator, there is a union with two members:

void* _M_file, which I guess would be used to store a file handle for /dev/urandom or similar.

std::mt19937 _M_mt, which is a … parameterized std::mersenne_twister_engine object.

So it seems reasonable to me that if you can’t get entropy* from outside your program, generate your own approximation. It looks like it is possible that the entropy for the std::mersenne_twister_engine will be provided by a std::mersenne_twister_engine.

Unlike std::random_device, which has its implementation out of line, std::mersenne_twister_engine‘s implementation seems to be all inline. It is unclear what benefits this brings, but it results in a few hundred additional instructions generated.

And then there’s std::uniform_int_distribution, which seems mostly unsurprising. It is again fully inline, which (from a cursory eyeballing) may allow a sufficiently insightful compiler to avoid a couple of branches and function calls.

The code that got me started on this was presented in jest – but (std::random_device + std::mt19937 + std::uniform_int_distribution) is a commonly recommended pattern for generating random numbers using these modern C++ library features.

My takeaways:

std::random_device is potentially very expensive to use – and doesn’t provide strong cross-platform guarantees about the randomness it provides. It is configured with an std::string – the meaning of which is platform dependent. I am not compelled to use this type.

std::mt19937 adds a sizeable chunk of codegen via its inline implementation – and there are better options than Mersenne Twister.

Bottom line: I’m probably going to stick with rand(), and if I need something a little fancier, http://www.pcg-random.org/ or one of the other suggestions provided as replies to the twitter thread.

Addition: the code I was able to gather, representing some relevant parts

]]>Digging into it, I found that a movaps instruction was being rewritten as movups, which was a *thoroughly* confusing thing to see.

The one clue I had was that a fault due to an unaligned load had been observed in non-test code, but did not reproduce when written as a test using the google-test framework. A short hunt later (including a failed attempt at writing a small repro case), I found an explanation: google test suppresses this class of failure.

The code below will successfully demonstrate the behavior, printing out the SIMD load instruction before and after calling the function with an unaligned pointer.

[Gist]

View the code on Gist.

]]>View the code on Gist.

]]>

During the day, I’m a Lead of a group of programmers. We’re responsible for a range of tools and tech used by others at the company for making games.

I have a list of the my priorities (and some related questions) of things that I think are important for us to be able to do well as individuals, and as a team:

**Treat people with respect. Value their time, place high value on their well-being, and start with the assumption that they have good intentions**(“People” includes yourself: respect yourself, value your own time and well-being, and have confidence in your good intentions.)

**When solving a problem, know the user and understand their needs.**- Do you understand the problem(s) that need to be solved? (it’s easy to make assumptions)
- Have you spoken to the user and listened to their perspective? (it’s easy to solve the wrong problem)
- Have you explored the specific constraints of the problem by asking questions like:
- Is this part needed? (it’s easy to over-reach)
- Is there a satisfactory simpler alternative? (actively pursue simplicity)
- What else will be needed? (it’s easy to overlook details)

- Have your discussed your proposed solution with users, and do they understand what you intend to do? (verify, and pursue buy-in)
- Do you continue to meet regularly with users? Do they know you? Do they believe that you’re working for their benefit? (don’t under-estimate the value of trust)

**Have a clear understanding of what you are doing.**- Do you understand the system you’re working in? (it’s easy to make assumptions)
- Have you read the documentation and/or code? (set yourself up to succeed with whatever is available)
- For code:
- Have you tried to modify the code? (pull a thread; see what breaks)
- Can you explain how the code works to another programmer in a convincing way? (test your confidence)
- Can you explain how the code works to a non-programmer?

**When trying to solve a problem, debug aggressively and efficiently.**- Does the bug need to be fixed? (see 1)
- Do you understand how the system works? (see 2)
- Is there a faster way to debug the problem? Can you change code or data to cause the problem to occur more quickly and reliably? (iterate as quickly as you can, fix the bug, and move on)
- Do you trust your own judgement? (debug boldly, have confidence in what you have observed, make hypotheses and test them)

**Pursue excellence in your work.**- How are you working to be better understood? (good communication takes time and effort)
- How are you working to better understand others? (don’t assume that others will pursue you with insights)
- Are you responding to feedback with enthusiasm to improve your work? (pursue professionalism)
- Are you writing high quality, easy to understand, easy to maintain code? How do you know? (continue to develop your technical skills)
- How are you working to become an expert and industry leader with the technologies and techniques you use every day? (pursue excellence in your field)
- Are you eager to improve (and fix) systems you have worked on previously? (take responsibility for your work)

The list was created for discussion with the group, and as an effort to articulate my own expectations in a way that will help my team understand me.

Composing this has been useful exercise for me as a lead, and definitely worthwhile for the group. If you’ve never tried writing down your own priorities, values, and/or assumptions, I encourage you to try it :)

]]>This post contains the same material as this thread of tweets, with a few minor edits.

In IEEE754, floating point numbers are represented like this:

±2ⁿⁿⁿ×1.sss…

nnn is the exponent, which is floor(log2(size)) — which happens to be the fl value computed by TLSF.

sss… is the significand fraction: the part that follows the decimal point, which happens to be sl.

And so to calculate fl and sl, all we need to do is convert size to a floating point value (on recent x86 hardware, that’s a single instruction). Then we can extract the exponent, and the upper bits of the fractional part, and we’re all done :D

That can be implemented like this:

double sf = (int64_t)size; uint64_t sfi; memcpy(&sfi, &sf, 8); fl = (sfi >> 52) - (1023 + 7); sl = (sfi >> 47) & 31;

There’s some subtleties (there always is). I’ll break it down…

double sf = (int64_t)size;

Convert size to a double, with an explicit cast. size has type size_t, but using TLSF from github.com/mattconte/tlsf, the largest supported allocation on 64bit architecture is 2^32 bytes – comfortably less than the precision provided by the double type. If you need your TLSF allocator to allocate chunks bigger than 2^53, this isn’t the technique for you :)

I first tried using float (not double), which can provide correct results — but only if the rounding mode happens to be set correctly. double is easier.

The cast to (int64_t) results in better codegen on x86: without it, the compiler will generate a full 64bit unsigned conversion, and there is no single instruction for that.

The cast tells the compiler to (in effect) consider the bits of size as if they were a two’s complement signed value — and there is an SSE instruction to handle that case (cvtsi2sdq or similar). Again, with the implementation we’re using size can’t be that big, so this will do the Right Thing.

uint64_t sfi; memcpy(&sfi, &sf, 8);

Copy the 8 bytes of the double into an unsigned integer variable. There are a lot of ways that C/C++ programmers copy bits from floating point to integer – some of them are well defined :) memcpy() does what we want, and any moderately respectable compiler knows how to select decent instructions to implement it.

Now we have floating point bits in an integer register, consisting of one sign bit (always zero for this, because size is always positive), eleven exponent bits (offset by 1023), and 52 bits of significant fraction. All we need to do is extract those, and we’re done :)

fl = (sfi >> 52) - (1023 + 7);

Extract the exponent: shift it down (ignoring the always-zero sign bit), subtract the offset (1023), and that 7 we saw earlier, at the same time.

sl = (sfi >> 47) & 31;

Extract the five most significant bits of the fraction – we do need to mask out the exponent.

And, just like that*, we have mapping_insert(), implemented in terms of integer -> floating point conversion.

* Actual code (rather than fragments) may be included in a later post…

]]>Over my holiday break at the end of 2017, I took a look into the TLSF (Two Level Segregated Fit) memory allocator to better understand how it works. I’ve made use of this allocator and have been impressed by its real world performance, but never really done a deep dive to properly understand it.

The mapping_insert() function is a key part of the allocator implementation, and caught my eye. Here’s how that function is described in the paper A constant-time dynamic storage allocator for real-time systems:

I’ll be honest: from that description, I never developed a clear picture in my mind of what that function does.

(Reading it now, it seems reasonably clear – but I can say that only after I spent quite a bit of time using other methods to develop my understanding)

Something that helped me a lot was by looking at the implementation of that function from github.com/mattconte/tlsf/. There’s a bunch of long-named macro constants in there, and a few extra implementation details. If you collapse those it looks something like this:

void mapping_insert(size_t size, int* fli, int* sli) { int fl, sl; if (size < 256) { fl = 0; sl = (int)size / 8; } else { fl = fls(size); sl = (int)(size >> (fl - 5)) ^ 0x20; fl -= 7; } *fli = fl; *sli = sl; }

It’s a pretty simple function (it really is). But I still failed to *see* the pattern of results that would be produced in my mind’s eye.

I went so far as to make a giant spreadsheet of all the intermediate values for a range of inputs, to paint myself a picture of the effect of each step :) That helped immensely.

Breaking it down…

There are two cases handled in the function: one for when size is below a certain threshold, and on for when it is larger. The first is straightforward, and accounts for a small number of possible input values. The large size case is more interesting.

The function computes two values: fl and sl, the first and second level indices for a lookup table. For the large case, fl (where fl is “first level”) is computed via fls(size) (where fls is short for “find last set” – similar names, just to keep you on your toes).

fls() returns the index of the largest bit set, counting from the least significant slbit, which is the index of the largest power of two. In the words of the paper:

“the instruction fls can be used to compute the ⌊log2(x)⌋ function”

Which is, in C-like syntax: floor(log2(x))

And there’s that “fl -= 7” at the end. That will show up again later.

For the large case, the computation of sl has a few steps:

sl = (size >> (fl – 5)) ^ 0x20;

Depending on shift down size by some amount (based on fl), and mask out the sixth bit?

(Aside: The CellBE programmer in me is flinching at that variable shift)

It took me a while (longer than I would have liked…) to realize that this

size >> (fl – 5) is shifting size to generate a number that has exactly six significant bits, at the least significant end of the register (bits 5 thru 0).

Because fl is the index of the most significant bit, after this shift, bit 5 will always be 1 – and that “^ 0x20” will unset it, leaving the result as a value between 0 and 31 (inclusive).

So here’s where floating point comes into it, and the cute thing I saw: another way to compute fl and sl is to convert size into an IEEE754 floating point number, and extract the exponent, and most significant bits of the mantissa. I’ll cover that in the next part, here.

]]>`constexpr int Min(int a, int b)`

, construct a function `constexpr int Min(Args... args)`

that returns the minimum of all the provided args. Fail to justify your over-engineering.
**A:** Rename `Min(int, int)`

as `MinImpl(int, int)`

or stick it in a namespace. Overloading the function is not only unnecessary, it gets in the way of the implementation.

constexpr int MinImpl(int a, int b) { return a < b ? a : b; }

Implement a `constexpr`

fold left function. If we can use it for `Min()`

, we should be able to do the same for `Max()`

, and other similar functions. Should we be able to find any (#prematuregeneralization).

template<typename ArgA, typename ArgB, typename Func> constexpr auto foldl(Func func, ArgA a, ArgB b) { return func(a, b); } template<typename ArgA, typename ArgB, typename Func, typename ...Args> constexpr auto foldl(Func func, ArgA a, ArgB b, Args... args) { return foldl(func, func(a, b), args...); }

Combine the two.

template<typename ...Args> constexpr auto Min(Args... args) { return foldl(MinImpl, args...); }

Add the bare minimum amount of testing for a constexpr function: slap a `static_assert()`

on it.

static_assert(Min(6, 4, 5, 3, 9) == 3), "Nope");

I did so with Visual Studio 2015 Update 2. It did not object.

**Addendum:** Some discussion with @nlguillemot and @DrPizza led to this attempt to do something similar with a C++17/C++1z fold-expression:

#include <limits.h> constexpr int MinImpl1(int a, int b) { return a < b ? a : b; } constexpr void MinImpl2(int* m, int a, int b) { *m = a < b ? a : b; } template<typename ...Args> constexpr int Min(Args... args) { int m = INT_MAX; // a binary expression in an operand of a fold-expression // is not allowed, so this won't compile: //((m = MinImpl1(m, args), ...); // But this does: (MinImpl2(/*out*/&m, m, args), ...); return m; } int main() { static_assert(Min(3,4,5) == 3, "nope"); }

This compiles with a gcc-6 pre-release snapshot.

**Update:** Here’s a further updated version, based on a refinement by @dotstdy.

(Why would you want to do that? Maybe you want to run a fast inverse square root at compile time. Or maybe you want to do something that is actually useful. I wanted to know if it could be done.)

For context: this article is based on experiences using gcc-5.3.0 and clang-3.7.1 with -std=c++14 -march=native on a Sandy Bridge Intel i7. Where I reference sections from the C++ standard, I’m referring to the November 2014 draft.

Before going further, I’ll quote 5.20.6 from the standard:

Since this International Standard imposes no restrictions on the accuracy of floating-point operations, it is unspecified whether the evaluation of a floating-point expression during translation yields the same result as the evaluation of the same expression (or the same operations on the same values) during program execution.

^{88 }88) Nonetheless, implementations are encouraged to provide consistent results, irrespective of whether the evaluation was performed during translation and/or during program execution.

In this post, I document things that worked (and didn’t work) for me. You may have a different experience.

(Error text from g++-5.3.0)

You can’t access the bits of a float via a typecast pointer [which is undefined behavior, and covered by 5.20.2.5]:

constexpr uint32_t bits_cast(float f) { return *(uint32_t*)&f; // [2] }

error: accessing value of 'f' through a 'uint32_t {aka unsigned int}' glvalue in a constant expression

You can’t convert it via a reinterpret cast [5.20.2.13]

constexpr uint32_t bits_reinterpret_cast(float f) { const unsigned char* cf = reinterpret_cast<const unsigned char*>(&f); // endianness notwithstanding return (cf[3] << 24) | (cf[2] << 16) | (cf[1] << 8) | cf[0]; }

error: '*(cf + 3u)' is not a constant expression

(gcc reports an error with the memory access, but does not object to the `reinterpret_cast`

. clang produces a specific error for the cast.)

You can’t convert it through a union [gcc, for example, permits this for non-constant expressions, but the standard forbids it in 5.20.2.8]:

constexpr uint32_t bits_union(float f) { union Convert { uint32_t u; float f; constexpr Convert(float f_) : f(f_) {} }; return Convert(f).u; }

error: accessing 'bits_union(float)::Convert::u' member instead of initialized 'bits_union(float)::Convert::f' member in constant expression

You can’t use `memcpy()`

[5.20.2.2]:

constexpr uint32_t bits_memcpy(float f) { uint32_t u = 0; memcpy(&u, &f, sizeof f); return u; }

error: 'memcpy(((void*)(&u)), ((const void*)(&f)), 4ul)' is not a constant expression

And you can’t define a constexpr `memcpy()`

-like function that is capable of the task [5.20.2.11]:

constexpr void* memcpy(void* dest, const void* src, size_t n) { char* d = (char*)dest; const char* s = (const char*)src; while(n-- > 0) *d++ = *s++; return dest; } constexpr uint32_t bits_memcpy(float f) { uint32_t u = 0; memcpy(&u, &f, sizeof f); return u; }

error: accessing value of 'u' through a 'char' glvalue in a constant expression

So what can you do?

For `constexpr float f = 2.0f, g = 2.0f`

the following operations are available [as they are not ruled out by anything I can see in 5.20]:

- Comparison of floating point values e.g.

`static_assert(f == g, "not equal");`

- Floating point arithmetic operations e.g.

`static_assert(f * 2.0f == 4.0f, "arithmetic failed");`

- Casts from float to integral value, often with well-defined semantics e.g.

`constexpr int i = (int)2.0f; static_assert(i == 2, "conversion failed");`

So I wrote a function (`uint32_t bits(float)`

) that will return the binary representation of an IEEE754 single precision float. The full function is at the end of this post. I’ll go through the various steps required to produce (my best approximation of) the desired result.

When `bits()`

is passed the value zero, we want this behavior:

static_assert(bits(0.0f) == 0x00000000);

And we can have it:

if (f == 0.0f) return 0;

Nothing difficult about that.

In IEEE754 land, negative zero is a thing. Ideally, we’d like this behavior:

static_assert(bits(-0.0f) == 0x80000000)

But the check for zero also matches negative zero. Negative zero is not something that the C++ standard has anything to say about, given that IEEE754 is an implementation choice [3.9.1.8: “The value representation of floating-point types is implementation defined”]. My compilers treat negative zero the same as zero for all comparisons and arithmetic operations. As such, `bits()`

returns the wrong value when considering negative zero, returning `0x00000000`

rather than the desired `0x80000000`

.

I did look into other methods for detecting negative zero in C++, without finding something that would work in a constant expression. I have seen divide by zero used as a way to detect negative zero (resulting in ±infinity, depending on the sign of the zero), but that doesn’t compile in a constant expression:

constexpr float r = 1.0f / -0.0f;

error: '(1.0e+0f / -0.0f)' is not a constant expression

and divide by zero is explicitly named as undefined behavior in 5.6.4, and so by 5.20.2.5 is unusable in a constant expression.

Result: negative zero becomes positive zero.

We want this:

static_assert(bits(INFINITY) == 0x7f800000);

And this:

else if (f == INFINITY) return 0x7f800000;

works as expected.

Same idea, different sign:

static_assert(bits(-INFINITY) == 0xff800000);

else if (f == -INFINITY) return 0xff800000;

Also works.

There’s no way to generate arbitrary NaN constants in a constant expression that I can see (not least because casting bits to floats isn’t possible in a constant expression, either), so it seems impossible to get this right in general.

In practice, maybe this is good enough:

static_assert(bits(NAN) == 0x7fc00000);

NaN values can be anywhere in the range of `0x7f800001 -- 0x7fffffff`

and `0xff800001 -- 0xffffffff`

. I have no idea as to the specific values that are seen in practice, nor what they mean. `0x7fc00000`

shows up in `/usr/include/bits/nan.h`

on the system I’m using to write this, so — right or wrong — I’ve chosen that as the reference value.

It is possible to detect a NaN value in a constant expression, but not its payload. (At least that I’ve been able to find). So there’s this:

else if (f != f) // NaN return 0x7fc00000; // This is my NaN...

Which means that of the 2*(2^{23}-1) possible NaNs, one will be handled correctly (in this case, `0x7fc00000`

). For the other 16,777,213 values, the wrong value will be returned (in this case, `0x7fc00000`

).

So… partial success? NaNs are correctly detected, but the bits for only one NaN value will be returned correctly.

(On the other hand, the probability that it will ever matter could be stored as a denormalized float)

// pseudo-code static_assert(bits({ 0x1p-126f, ..., 0x1.ffff7p127}) == { 0x00800000, ..., 0x7f7fffff}); static_assert(bits({ -0x1p-126f, ..., -0x1.ffff7p127}) == { 0x80800000, ..., 0xff7fffff});

[That `0x1p`

format happens to be a convenient way to represent exact values that can be stored as binary floating point numbers]*nnn*f

It is possible to detect and correctly construct bits for every normalized value. It does requires a little care to avoid truncation and undefined behavior. I wrote a few different implementations — the one that I describe here requires relatively little code, and doesn’t perform terribly [0].

The first step is to find and clear the sign bit. This simplifies subsequent steps.

bool sign = f < 0.0f; float abs_f = sign ? -f : f;

Now we have `abs_f`

— it’s positive, non-zero, non-infinite, and not a NaN.

What happens when a float is cast to an integral type?

uint64_t i = (uint64_t)f;

The value of `f`

will be stored in `i`

, according to the following rules:

- The value will be rounded towards zero which, for positive values, means truncation of any fractional part.
- If the value in
`f`

is too large to be represented as a`uint64_t`

(i.e.`f`

> 2^{64}-1) the result is undefined.

If truncation takes place, data is lost. If the number is too large, the result is (probably) meaningless.

For our conversion function, if we can scale `abs_f`

into a range where it is not larger than (2^{64}-1), and it has no fractional part, we have access to an exact representation of the bits that make up the float. We just need to keep track of the amount of scaling being done.

Single precision IEEE 754 floating point numbers have, at most, (23+1) bits of precision (23 in the significand, 1 implicit). This means that we can scale down large numbers and scale up small numbers into the required range.

Multiplying by powers of two change only the exponent of the float, and leave the significand unmodified. As such, we can arbitrarily scale a float by a power of two and — so long as we don’t over- or under-flow the float — we will not lose any of the bits in the significand.

For the sake of simplicity (believe it or not [1]), my approach is to scale `abs_f`

in steps of 2^{41} so that (`abs_f`

≥ 2^{87}) like so:

int exponent = 254; while(abs_f < 0x1p87f) { abs_f *= 0x1p41f; exponent -= 41; }

If `abs_f`

≥ 2^{87}, the least significant bit of `abs_f`

, if set, is 2^{(87-23)}==2^{64.}

Next, `abs_f`

is scaled back down by 2^{64} (which adds no fractional part as the least significant bit is 2^{64}) and converted to an unsigned 64 bit integer.

uint64_t a = (uint64_t)(abs_f * 0x1p-64f);

All of the bits of `abs_f`

are now present in `a`

, without overflow or truncation. All that is needed now is to determine where they are:

int lz = count_leading_zeroes(a);

adjust the exponent accordingly:

exponent -= lz;

and construct the result:

uint32_t significand = (a << (lz + 1)) >> (64 - 23); // [3] return (sign << 31) | (exponent << 23) | significand;

With this, we have correct results for every normalized float.

// pseudo-code static_assert(bits({ 0x1.0p-149f, ..., 0x1.ffff7p-127f}) == { 0x00000001, ..., 0x007fffff}); static_assert(bits({ -0x1.0p-149f, ..., -0x1.ffff7p-127f}) == { 0x80000001, ..., 0x807fffff});

The final detail is denormalized values. Handling of normalized values as presented so far fails because denormals will have additional leading zeroes. They are fairly easy to account for:

if (exponent <= 0) { exponent = 0; lz = 8 - 1; }

To attempt to demystify that `lz = 8 - 1`

a little: there are 8 leading bits that aren’t part of the significand of a denormalized single precision float after the repeated 2^{-41} scaling that has taken place. There is also no leading 1 bit that is present in all normalized numbers (which is accounted for in the calculation of `significand`

above as `(lz + 1)`

). So the leading zero count (`lz`

) is set to account for the 8 bits of offset to the start of the denormalized significand, minus the one that the subsequent calculation assumes it needs to skip over.

And that’s it. All the possible values of a float are accounted for.

(Side note: If you’re compiling with -ffast-math, passing denormalized numbers to `bits()`

will return invalid results. That’s -ffast-math for you. With gcc or clang, you could add an `#ifdef __FAST_MATH__`

around the test for negative exponent.)

You can indeed obtain the bit representation of a floating point number at compile time. Mostly. Negative zero is wrong, NaNs are detected but otherwise not accurately converted.

Enjoy your compile-time bit-twiddling!

The whole deal:

// Based on code from // https://graphics.stanford.edu/~seander/bithacks.html constexpr int count_leading_zeroes(uint64_t v) { constexpr char bit_position[64] = { 0, 1, 2, 7, 3, 13, 8, 19, 4, 25, 14, 28, 9, 34, 20, 40, 5, 17, 26, 38, 15, 46, 29, 48, 10, 31, 35, 54, 21, 50, 41, 57, 63, 6, 12, 18, 24, 27, 33, 39, 16, 37, 45, 47, 30, 53, 49, 56, 62, 11, 23, 32, 36, 44, 52, 55, 61, 22, 43, 51, 60, 42, 59, 58 }; v |= v >> 1; // first round down to one less than a power of 2 v |= v >> 2; v |= v >> 4; v |= v >> 8; v |= v >> 16; v |= v >> 32; v = (v >> 1) + 1; return 63 - bit_position[(v * 0x0218a392cd3d5dbf)>>58]; // [3] } constexpr uint32_t bits(float f) { if (f == 0.0f) return 0; // also matches -0.0f and gives wrong result else if (f == INFINITY) return 0x7f800000; else if (f == -INFINITY) return 0xff800000; else if (f != f) // NaN return 0x7fc00000; // This is my NaN... bool sign = f < 0.0f; float abs_f = sign ? -f : f; int exponent = 254; while(abs_f < 0x1p87f) { abs_f *= 0x1p41f; exponent -= 41; } uint64_t a = (uint64_t)(abs_f * 0x1p-64f); int lz = count_leading_zeroes(a); exponent -= lz; if (exponent <= 0) { exponent = 0; lz = 8 - 1; } uint32_t significand = (a << (lz + 1)) >> (64 - 23); // [3] return (sign << 31) | (exponent << 23) | significand; }

[0] Why does runtime performance matter? Because that’s how I tested the conversion function while implementing it. I was applying Bruce Dawson’s advice for testing floats and the quicker I found out that I’d broken the conversion the better. For the implementation described in this post, it takes about 97 seconds to test all four billion float values on my laptop — half that time if I wasn’t testing negative numbers (which are unlikely to cause problems due to the way I handle the sign bit). The implementation I’ve described in this post is not the fastest solution to the problem, but it is relatively compact, and well behaved in the face of `-ffast-math`

.

Admission buried in a footnote: I have not validated correct behavior of this code for every floating point number in actual compile-time constant expressions. Compile-time evaluation of four billion invocations of `bits()`

takes more time than I’ve been willing to invest so far.

[1] It is conceptually simpler to multiply `abs_f`

by two (or one half) until the result is exactly positioned so that no leading zero count is required after the cast — at least, that was what I did in my first attempt. The approach described here was found to be significantly faster. I have no doubt that better-performing constant-expression-friendly approaches exist.

[2] Update 2016-03-28: Thanks to satbyy for pointing out the missing ampersand — it was lost sometime after copying the code into the article.

[3] Update 2016-03-28: Thanks to louiswins for pointing out additional code errors.

]]>Here it is, with one further tweak:

```
template<typename T, std::size_t N>
constexpr
std::integral_constant<std::size_t, N> countof(T const (&)[N]) noexcept
{
return {};
}
#define COUNTOF(...) decltype(countof(__VA_ARGS__))::value
```

The change I’ve made to pfultz2’s version is to use `::value`

rather than `{}`

after `decltype`

in the macro.

This makes the type of the result `std::size_t`

not `std::integral_constant`

, so it can be used in va_arg settings without triggering compiler or static analysis warnings.

It also has the advantage of not triggering extra warnings in VS2015U1 (this issue).

]]>