Terrible day — between an unfamiliar system, the unshelving of the remains of half a changelist and a whole lot of carelessness, I made a lot of mistakes and got to do a lot of full recompiles. A day to learn from.
So before I summarize that disaster, I’ll start with a could of other bugs.
Unexpected use of class
A couple of days ago, I checked in some changes to a class whereby one of its member classes could dynamically accrue allocations during its lifespan — it keeps track of certain memory allocations to reduce the number of allocations needed during its lifespan.
I subsequently added a Destroy() method for that member so that it could be cleaned up properly at the end of its life, expecting that the member would be correctly initialized so that Destroy() would Do The Right Thing.
One crash callstack later…
There exists two ways to create objects of the class type. One — used to construct objects in a particular family — will perform the correct initialization. The other — a more general construction path used to construct a wide variety of other more-or-less-related object types — will not correctly initialize this member type.
I was of the understanding that, for various architectural reasons, nothing would attempt to construct an object of this type via the more general approach. It should be no surprise that I was wrong.
I am of the opinion that anyone using the “general” construction path is Doing Something Wrong, and so I think it will be appropriate to add an assert to catch this kind of thing [a little awkwardly] in future. I do need confirmation from from other users of this system that we agree that what has happened here is undesired/unexpected, and that it can be assumed to be Wrong in future.
The “general” construction path is used to dynamically construct object based on class data passed into the program. It is likely possible to prevent this kind of “bad” data from entering the program by more strongly protecting it at the data-construction stage, rather than trying to prevent it at runtime. I’m not sure what protection is in place presently and need to investigate that.
The absolute minimum fix is to prevent this code from crashing should it get the wrong/unexpected data…
One of the issues leading to this problem is that the type being used here derives from a common base class. This does offer a great deal of convenience in practice within this codebase, but it also leads to this kind of complexity (source of bugs) where Bar isMostlyButNotAlwaysA Foo.
Hard-coded array size, no checking
This one arrived as a crash callstack in some code I had a lot to do with a while back, that had remained largely nearly touched for a number of months. The code is heavily optimized and is reasonably complex — and quite awkward to modify. As a result, having to locate and fix bugs in this code is a somewhat daunting task, not least because of how much use it has received (every frame…), and has been running without problems for a long time.
Fortunately (HAHAHAHA), it seems the problem was really quite simple: this code uses some fixed-size static arrays for storage of intermediate data. At no point is there a check that the data being processed does not exceed the size of these arrays. As is the way of these things, the size of the data being processed has grown over time, resulting in a silent writing past the end of one of these arrays. Crashilarity ensues.
Fix: increase the size of the arrays, add an assert to make sure we have enough space.
(Removed the comment on one of the arrays asking why it was so small in the first place :\ )
- Double-init of system — whereas yesterday I was failing to init the system I was working with, today I managed to call the init function twice. Added a flag and check to detect this.
- Double-shutdown of system — Later on in the day (after I’d got past most of my other bugs), I discovered I was calling the shutdown function for the system twice, in exactly the same way as I’d been double-calling the init function. I should have checked for this when I fixed the double-init bug (where there’s smoke…)
- Failure to flush — started writing a prefix token to a buffer with an incorrect initialization order. Data didn’t get flushed, record of memory blocks was confused, things failed.
- Need to reset state after flush — once the buffer is flushed to disk, reset all the variables that tracked the un-flushed state.
- Flush [too] early and flush [too] often — put calls to various Flush() methods from a range of class destructors. Got a little carried away. Let it mellow.
- Calling functions on the right objects — sometimes it’s nice to be able to construct things from other things, and to be able to rely on object lifetimes to encode certain behaviors. Even so, it will usually matter which object’s method gets called…
- Case matters. Sometimes. — When interacting with an external unfamiliar system, it’s worth making sure you know whether it is case sensitive. In this instance, it was, and I wasn’t. I worked this out almost by accident, which is regrettable.
Post-disaster-mortem: working through this code was highly reactive — I spent far too little time reading the code and working out what it was/should be doing, and far too much compiling, running the app, eyeballing output, manually feeding it into the next system, and trying to work out what I’d done wrong each time. I have no doubt that manually tracing through [a relatively small amount of] code, and validating its function up front would have saved me a lot of time.
Note to self: know when you’ve fallen into the reactive compile/test/modify cycle and step back to look at the bigger picture.