Bug of the day, Tuesday 2013-04-02

Implicitly finding bugs in other people’s code

A crash was reported with a call stack leading to some code I’d been making changes to a week ago (code that previously featured in Unexpected use of class and Initialized, but not in the right order).

The code I’d changed now looks something like this:

struct Foo : public Baz {
  Bar* m_Bar;
  Foo() : m_Bar(NULL) {}
  void SetBar(Bar* bar) { m_Bar = bar; }

  virtual void Destroy() override;
};

The cause of the crash was something trying to make use of a null m_Bar member of a Foo object.

With regard to the previous crashes I’d encountered, my first suspicion that this was an object that was somehow being created in a way that had not correctly initialized the m_Bar. Checking the circumstances of the crash (the reporter was able to find a specific cause, which was a great help), I was unable to see a way that this object could have been created with m_Bar remaining null.

Sidebar: A way in which my debugging skills have improved in the last year

I have identified the following tendency in my thinking: see a bug, take a guess at the cause, and then look for that cause in the following way:

1. Examine some part of the codebase.
2. If expected cause not found, goto 1.

So if I guessed the wrong cause, I get stuck in a loop: “It must be X, but I haven’t yet found how X could happen — I must look harder!”

Part of this came, I think, from a degree of insecurity about my own ability to read and understand code: “the bug is in here, but I must have missed something”

What I’ve realized in the past year is that my code reading skills aren’t that bad — if I haven’t found the cause I’m looking for, it’s more likely that I’m looking for the wrong cause. Particularly, I am now able to approach a problem with a hypothesis about the cause, and can more clearly identify when I’ve found some information that invalidates that hypothesis and I should be looking for a different cause.

In the context of this bug, validating how the object was being created and seeing that the codepath would always validly initialize m_Bar meant that the bug was not in the initialization — it must be somewhere else.

The other nuller

So something else was setting m_Bar to null after the initialization. There is one other place where that happens…

  virtual void Destroy() {
    if( m_Bar != NULL) {
      m_Bar->Destroy();
      m_Bar = NULL;
    } else
      HARD_TO_MISS_NOTIFICATION( USEFUL_WARNING_37 );
  }

So, it looks like there’s a good chance that this object has had its Destroy() called, and looking through the code, it becomes clear that this is indeed what has happened.

  • There was a Validate() function called on our object that checked certain things about that object’s state, calling Destroy() if it did not meet certain conditions.
  • Validate() was called before attempting to make use of m_Bar.
  • The bool returned by Validate() is ignored by the calling code.
  • The circumstances that lead to the crash are such that Validate() will always call Destroy().

One if(!Validate()) return; later, the crash no longer occurs.

This bug came from a combination of circumstances:

  • existing code that ignored the logical (? I guess that’s the right word…) validity of the object — it had been destroyed, but still contained valid data, and
  • my change with caused the logical state to be reflected in actual state.

I did examine all the other callers of the Validate() function — none of these exhibited the same type of problem.

 

Bugs of the day, Friday 2013-03-29

This wasn’t a runtime bug, but two(!!) compile-time bugs that I checked in for others to find :\

Not all code is compiled with the same warning levels everywhere, so something that compiles without warning in one application with a lower warning level may not compile in another with a higher warning level (assuming warnings-as-errors).

After finding no-longer-needed uses of a particular macro, I did not delete my finder code. But taking the address of a class’ member function and not doing anything with it can be warning-inducing. Bah.

And one of the former users of the macro was left with a no-longer-used variable after my changes,  This was code I had failed to delete on the first pass. I am disappointed with myself to have not deleted more code when I first had the chance.

How should I avoid such problems in the future? Compile code I change for all targets/configurations/applications that matter.

Sometimes one can be too lazy.

Compiler-assisted code cleaning, Thursday 2013-03-28

This was another day of no runtime bugs. Instead of a picture, I’ll describe a specific(ish) case of using the compiler to catch all uses of a particular pattern in the code.

class Foo {};

// Bar is one of many classes that inherit from Foo
class Bar : public Foo {};

// Baz is one of many classes that inherit from Bar
class Baz : public Bar {};

DECLARE_EXTRA_STUFF_FOR_CLASS(Baz);

So we have something like the above. The DECLARE_EXTRA_STUFF_FOR_CLASS() macro is used for some but not all classes that inherit directly or indirectly from Foo.

There have recently been changes within a system that means that DECLARE_EXTRA_STUFF_FOR_CLASS() is now never needed for any class that inherits from Bar — but it is not invalid for it to remain used in the codebase. Anything that makes use of the EXTRA_STUFF that the macro provides will still work, it’s just extra unneeded code which I’d like to clean up.

I could audit this by hand — find every class inheriting from Bar, check to see if there’s a macro definition for that class somewhere and perform the subsequent clean up, but I’m lazier than that.

I could write a script that finds classes that inherit from Bar, and then search for that class name with the macro, but that also seems like to much work. (A simple regex sweep through the codebase won’t trivially identify class Asd : public Baz {} as inheriting from Bar). I’m lazier than this.

What I can do is break the code expanded by the macro if it is passed a subclass of Bar. To do that, I can do something like the following:

class Foo              { public:  static void Foo(){} };
class Bar : public Foo { private: static void Foo(); };
// Add a line to the macro...
DECLARE_EXTRA_STUFF_FOR_CLASS(class_name) \
    extra_stuff; \
    class_name::Foo(); \
    more_extra_stuff

Done. Anywhere where the macro is invoked for a class inheriting from Bar will trigger a compile error. Walk through the errors, delete calls to that macro, rewrite (i.e. delete) code as necessary.

End result: less code. I count that as a win.

(In my specific case, I made an already existing public method declared in Foo private in Bar because there is no need to call it from or for classes in that part of the hierarchy. This has the added advantage of not polluting the list of Foo‘s members with no functionality of the class. Additionally, I didn’t call the function within the macro but instead took it’s address.)

Can C++11’s final qualifier affect codegen?

To satisfy my curiosity, I compiled this

struct A { void f(); };
struct B { virtual void f(); };
struct C { virtual void f() final; };
void a(A& a) { a.f(); }
void b(B& b) { b.f(); }
void c(C& c) { c.f(); }

like this

g++ call.cpp -std=c++11 -O3 -S -o- -masm=intel # gcc-4.7.2 i686-pc-cygwin

which produced [output that included] this

__Z1aR1A:
    jmp __ZN1A1fEv

__Z1bR1B:
    mov eax, DWORD PTR [esp+4]
    mov edx, DWORD PTR [eax]
    mov eax, DWORD PTR [edx]
    jmp eax

__Z1cR1C:
    jmp __ZN1C1fEv

What does it mean? Because C::f() is marked final, calls to it can be devirtualized by the compiler, so there is no extra indirection when calling it in the above case.

(Trying the same code with VS2012’s cl.exe, the call to C::f() does not appear to be devirtualized)

struct C does still appear to have a virtual function table pointer: (sizeof(C) == sizeof(B) == 4), whereas (sizeof(A) == 1)It seems to me that the vtable pointer could be omitted in this case…

Update (2015-01-24): As Tom points out below, anyone with a B* or B& that is actually pointing to an object of type C will still require the vtable pointer to find the correct implementation of f().

Bug review, 2013-03-26

Last week, I started writing about runtime bugs that I’ve encountered in my own code, for my own interest and potential benefit. This post contains a summary of the issues I noticed, and a brief self-review of the week.

Summary of the week

Tuesday

  • Changed virtual function prototype results in correct function not being called
  • memcpy() into an uninitialized pointer variable
  • Bad loop counting
  • Failing to store updated state

Wednesday

  • Uninitialized class member
  • Sliced inheritance
  • Undesired printf() output

Thursday

  • Failed to initialize instance-specific member variable correctly because I was calling the wrong constructor
  • Use of uninitialized system
  • Yet another uninitialized member variable crash

Friday

  • Unexpected use of class
  • Hard-coded array size, no checking
  • General disaster

Monday

  • Initialized, but not in the right order (fallout from Unexpected use of class)

Tuesday

  • Local static state is local, static
  • Redux: Changed virtual function prototype results in correct function not being called

Trends

Looking back over the week, failure to initialize, failure to init at the right time or in the right order stands out as a primary source of bugs. Maybe I need to be asking myself questions like “where did this variable’s state come from” to try and prompt more consideration of this.

Broadly, there is also what appears to be insufficient familiarity with various of C++’s OO features, resulting in silly mistakes. I don’t generally write much OO-style code — this week has been unusual in that way. It is a problem if I can’t use them reliably.

I’m pretty happy with the solution to the “Changed version prototype” situation. Finding ways to prevent possible runtime bugs at compile-time always feels like a worthy achievement.

Meta

Thus far, I’ve found the process of logging these bugs to be worthwhile — attempting to explain the bugs helps me to understand them more clearly. I have found myself thinking about the bugs: the solutions I’ve attempted, and how to try to avoid writing the same bugs in the future.

My plan for now is to continue to write these posts. Hopefully, it leads to fewer bugs.

Bugs of the day, Tuesday 2013-03-26

One bug, and one better solution to a previous bug.

Local static state is local, static

int Foo() {
  static int timeout = kTimeoutLimit;
  bool polled_condition = CheckExternalCondition();
  if( !polled_condition )
    --timeout;
  if( timeout == 0 ) { 
    printf("condition not set before timeout");
    return kTaskFailed;
  }
  if( polled_condition )
    return kTaskComplete;
  else
    return kTaskIncomplete;
}
void Bar() { 
  while( Foo() == kTaskIncomplete );
}

Sometimes a local static variable is a useful thing — conveniently quick to add to any piece of code, and localized to a particular scope. You know everything about what can happen to it during its entire lifetime — assuming you know about how the scope it is contained within operates. (and I’m disregarding the returning of references to local static variables from a function. That’s a whole different bag of fun…)

Quickly rewriting some code today, I added a local static counter — counting down each time a desired condition was not met, and triggering a printf() should the count ever reach zero. I was using this to diagnose that something unexpected had happened, and that some event had not taken place.

The above code should work adequately well — if Bar() is only ever run once (and only from one thread at a time — but that’s not the issue here). timeout is never reset, so unless CheckExternalCondition() returns true, first time, every time, the value in timeout will continue to decrease with each run of Bar(). If it drops below zero, there will no longer be a limit on the on how long Foo() can loop.

A fix here is to make sure timeout is reset each time Foo() “completes” i.e. when it returns kTaskComplete or kTaskFailed.

More generally, this does not seem like a good use of a static local variable — relevant context for the function is not readily apparent within the function, so it seems better to store the timeout value outside of Foo().

(The actual code is more convoluted than the example above, so it’s not quite as simple as adding a local timeout var in Bar(), but only just)

Redux: Changed virtual function prototype results in correct function not being called

This was a bug where I changed the prototype for a virtual function, but that classes with the old prototype would not be detected at compile time and would silently do nothing should the virtual function be called. Read the details via the link above.

I had added an assert to catch any classes that didn’t have the new version of the function, but that only worked at runtime, and only when the function was actually called. What I really wanted for this was a compile-time check that was more reliable than the inheriting programmer remembering to use override. I discovered that there is a way to do this with C++11: final.

class Base {
public:
  virtual void Foo(int some_param, WhatsitType some_other_param)  {  }
private:
  virtual void Foo(int) final { }
};

With this, anything that inherits from Base and attempts to define its own void Foo(int) method will result in a compile error. I’m much happier with this, and making this change revealed some extra instances of the old function that I’d missed in my previous sweep. Huzzah.

Bugs of the day, Monday 2013-03-25

Initialized, but not in the right order

One runtime bug today, being fallout from an attempted fix for Friday’s Unexpected use of class bug.

To recap/elaborate: I changed something like this

struct Foo : public Baz {
  Bar* m_Bar;
  virtual void Destroy() override { }
};

to be something more like this

struct Foo : public Baz {
  Bar* m_Bar;
  virtual void Destroy() override { m_Bar->Destroy(); }
};

and I had assumed this would be OK because I [thought that I] knew the only place that Foos would be created, which would always safely initialize m_Bar.

Unfortunately, there is another way to construct these objects with some unexpected (bad) input data that would not initialize m_Bar. In such a case, Foo::Destroy() would crash because m_Bar pointed at who-knows-what.

So I implemented something like

struct Foo : public Baz {
  Bar* m_Bar;
  virtual void Init() override { m_Bar = NULL; }
  virtual void Destroy() override {
    if( m_Bar != NULL)
      m_Bar->Destroy();
    else
      HARD_TO_MISS_NOTIFICATION( USEFUL_WARNING_37 );
  }
};

“Hooray,” I thought. “That will catch those pesky bad datas.”

In practice what I got was a crash from the first thing that tried to use m_Bar.

It turns out that the construction path for Foos sets up m_Bar before calling Foo::Init(), so now our correctly constructed Foo object has it’s valid data overwritten by an overzealous initialization function.

Foo::Init() is called from both the preferred and not-preferred construction paths, so my hope had been to fix it there and catch use of the not-preferred path, but this didn’t work. My next thought was to reorder the setup of m_Bar and the call to Init(), but that seemed far too risky — there are a lot of subclasses of Foo, and many of them have non-trivial Init() overrides of their own, so there is a high probability that something else will break if m_Bar is not set up before the call to Init().

The winner this time seems to be to initialize the member to zero using: a constructor. Yes, it’s probably the obvious option, but for much of this codebase explicit Init() functions are used instead of constructors. Fortunately, for Foo and its friends (sub classes, parent class) we call the constructor (via placement new) to set up the vtable for us, so we can use it to also initialize the m_Bar pointer and consequently help catch occurrences of bad data in a less fatal fashion.

 

Bugs of the day, Friday 2013-03-22

Terrible day — between an unfamiliar system, the unshelving of the remains of half a changelist and a whole lot of carelessness, I made a lot of mistakes and got to do a lot of full recompiles. A day to learn from.

So before I summarize that disaster, I’ll start with a could of other bugs.

Unexpected use of class

A couple of days ago, I checked in some changes to a class whereby one of its member classes could dynamically accrue allocations during its lifespan — it keeps track of certain memory allocations to reduce the number of allocations needed during its lifespan.

I subsequently added a Destroy() method for that member so that it could be cleaned up properly at the end of its life, expecting that the member would be correctly initialized so that Destroy() would Do The Right Thing.

One crash callstack later…

There exists two ways to create objects of the class type. One — used to construct objects in a particular family — will perform the correct initialization. The other — a more general construction path used to construct a wide variety of other more-or-less-related object types — will not correctly initialize this member type.

I was of the understanding that, for various architectural reasons, nothing would attempt to construct an object of this type via the more general approach. It should be no surprise that I was wrong.

I am of the opinion that anyone using the “general” construction path is Doing Something Wrong, and so I think it will be appropriate to add an assert to catch this kind of thing [a little awkwardly] in future. I do need confirmation from from other users of this system that we agree that what has happened here is undesired/unexpected, and that it can be assumed to be Wrong in future.

The “general” construction path is used to dynamically construct object based on class data passed into the program. It is likely possible to prevent this kind of “bad” data from entering the program by more strongly protecting it at the data-construction stage, rather than trying to prevent it at runtime. I’m not sure what protection is in place presently and need to investigate that.

The absolute minimum fix is to prevent this code from crashing should it get the wrong/unexpected data…

One of the issues leading to this problem is that the type being used here derives from a common base class. This does offer a great deal of convenience in practice within this codebase, but it also leads to this kind of complexity (source of bugs) where Bar isMostlyButNotAlwaysA Foo.

Hard-coded array size, no checking

This one arrived as a crash callstack in some code I had a lot to do with a while back, that had remained largely nearly touched for a number of months. The code is heavily optimized and is reasonably complex — and quite awkward to modify. As a result, having to locate and fix bugs in this code is a somewhat daunting task, not least because of how much use it has received (every frame…), and has been running without problems for a long time.

Fortunately (HAHAHAHA), it seems the problem was really quite simple: this code uses some fixed-size static arrays for storage of intermediate data. At no point is there a check that the data being processed does not exceed the size of these arrays. As is the way of these things, the size of the data being processed has grown over time, resulting in a silent writing past the end of one of these arrays. Crashilarity ensues.

Fix: increase the size of the arrays, add an assert to make sure we have enough space.

(Removed the comment on one of the arrays asking why it was so small in the first place :\ )

General disaster

  • Double-init of system — whereas yesterday I was failing to init the system I was working with, today I managed to call the init function twice. Added a flag and check to detect this.
  • Double-shutdown of system — Later on in the day (after I’d got past most of my other bugs), I discovered I was calling the shutdown function for the system twice, in exactly the same way as I’d been double-calling the init function. I should have checked for this when I fixed the double-init bug (where there’s smoke…)
  • Failure to flush — started writing a prefix token to a buffer with an incorrect initialization order. Data didn’t get flushed, record of memory blocks was confused, things failed.
  • Need to reset state after flush — once the buffer is flushed to disk, reset all the variables that tracked the un-flushed state.
  • Flush [too] early and flush [too] often — put calls to various Flush() methods from a range of class destructors. Got a little carried away. Let it mellow.
  • Calling functions on the right objects — sometimes it’s nice to be able to construct things from other things, and to be able to rely on object lifetimes to encode certain behaviors. Even so, it will usually matter which object’s method gets called…
  • Case matters. Sometimes. — When interacting with an external unfamiliar system, it’s worth making sure you know whether it is case sensitive. In this instance, it was, and I wasn’t. I worked this out almost by accident, which is regrettable.

Post-disaster-mortem: working through this code was highly reactive — I spent far too little time reading the code and working out what it was/should be doing, and far too much compiling, running the app, eyeballing output, manually feeding it into the next system, and trying to work out what I’d done wrong each time. I have no doubt that manually tracing through [a relatively small amount of] code, and validating its function up front would have saved me a lot of time.

Note to self: know when you’ve fallen into the reactive compile/test/modify cycle and step back to look at the bigger picture.

Bugs of the day, Thursday 2013-03-21

Failed to initialize instance-specific member variable correctly because I was calling the wrong constructor

struct Foo {
  bool m_B;
  Foo(Bar& bar) : m_B(true) { /*do something else with bar*/ }
};
struct Baz : public Foo {
  Baz(Foo& foo):Foo(foo) {}
};

Thinking that Baz::Baz(Foo&) was calling the defined construct-from-reference-to-bar constructor that would initialize m_B to a particular state when in fact I was calling the compiler-generated copy constructor that would not.

Yet another case of pay attention-to-what-you’re-doing. The thing that got me this time was that this codebase doesn’t have a lot of construct-from-reference, so led to the trap of if it compiles, it’s probably OK.

Quite possibly this was fallout from yesterday’s the slicing-inheritance bug where I changed inheritance to a pointer-member, but didn’t change the construction call.

Use of uninitialized system

Calling functions in a particular system without having called the appropriate System::Init() function. The results aren’t pretty.

This was caused by carelessness when un-shelving a partial rework of the init and use of the particular system.

Keeping track of where particular code is being called requires extra effort when managing multiple versions of the same files. Clearly, I’m too easy to confuse and need to be more careful, and to work with fewer variations.

In general, checks/asserts in the external interfaces to a system are probably worthwhile to catch this kind of thing a little sooner.

Yet another uninitialized member variable crash

In a rework of a rework of a rework of some code I lost a couple of struct member initializers.

Init to zero, all works as expected. Sigh.

End of day reflection:

Making changes to a previously-working system, it would have been profitable to spend some time to capture the known-good output (and a record of the known-extant bugs) at the start of the process. I spent some time reverting all my changes and re-building to confirm my analysis of these things, which was not the most efficient approach.

Bugs of the day, Wednesday 2013-03-20

Uninitialized class member

The problem:

struct Foo {
  int m_A;
  int m_B;
  int m_C;
  Foo():m_A(0), m_B(0) {}
  void Bar() {
    DoSomethingWith(m_C); // fail it
  }
};

My defense is that this was spread across a header and a .cpp file, which is to say I was careless. (My preference really is to put all code inline in one place — .h and .cpp is repetitive and costs time and effort to keep in sync)

This was some existing code that I reworked a couple of weeks ago and hadn’t got to the point of running/testing until today. I’m really not sure how I managed to remove the initializer here — it existed in the original version.

The fix:

Initialize to zero, which was not just known-value, but correct-starting-value.

Prevention:

Always initialize everything. Keep class definitions in sync with constructors and Init() methods.

Sliced inheritance

The problem:

class Bar {
  virtual void Baz() {/*nothing interesting*/}
};

class BarExtender: public Bar {
  virtual void Baz() {DoSomethingInteresting();}
};

class Foo : public Bar {
  Foo(Bar& bar): Bar(bar) {}
  void Yolo() { Baz(); }
};

BarExtender b;
Foo f(b);
f.Yolo(); // intended that this would DoSomethingInteresting(). Actually does nothing interesting

The fix:

class Foo : public Bar {
  Foo(Bar& bar): m_Bar(bar) {}
  void Yolo() { m_Bar.Baz(); }
  Bar& m_Bar;
};

Prevention:

Know the language well enough to know when you’re writing code that doesn’t say what you mean.

Undesired printf output

The problem:

This is the first “functional” bug — something that didn’t work quite the way I’d intended.

For a data serialization class, I had hoped to replace several functions of the form:

WriteInt32(int32_t i) { snprintf(..., "%d", i); }
WriteDouble(double d) { snprintf(..., "%f", d); }

to a single function:

WriteNumber(double d) { snprintf(..., "%f", d);  }

Unfortunately, the %f format specifier will, by default, always write a decimal point and 6 figures thereafter, which is undesirable for integers-types-cast-to-double written by this function — for particular use cases it doubles (hah!) the size of the data written.

The fix:

I’ve not tested this yet, but it looks like %g (or probably %.ng will provide the desired behaviour by not writing the surplus zeroes. I will need to determine if the output for floating point input and large integers will be acceptable.

Prevention:

I still have a lot to learn about printf format specifiers, though I’ve found http://www.cplusplus.com/reference/cstdio/printf/ useful when considering/trying to to understand the available options. This does make me wonder if a printf-format-string constructor exists analogous to http://cdecl.org

Update:

Rather than trying to find a single format that works for all input or testing the number before printing it, I’ve opted to overload the function by type. This way we can use knowledge of the type to do something appropriate and more efficient.

I have overloads for int, unsigned int, double, long and unsigned long, and a comment as to why there’s no implementation for unsigned long long (the loader treats all numbers as doubles, which means we could end up writing out numbers that we can’t read back without loss of precision along the way). I’ve also added static asserts in case this code gets compiled on a system where sizeof(long) > sizeof(int). There are separate functions for writing out bool and pointer values.