Hardware, testing and time
This week I fixed a bug that dated back to last May. It was in a piece of hardware I assembled, running firmware I wrote most of. And it had been in operation since May without me noticing the issue.
What was the trigger that led to me discovering the bug’s existence? The colder temperatures. See, the device in question is a Digispark/433MHz receiver/USB serial dongle combo that listens for broadcasts from a Digoo DG-R8H wireless temperature/humidity weather station monitor. This is placed outside, giving me external temperature data to feed into my home automation setup.
The thing is, while Belfast is often cold and wet, it’s rarely really cold. So up until recently the fact I never saw sub-zero temperatures reported could just be attributed to the fact the sensor is on a window sill and the house probably has enough residual heat and it’s sheltered enough that it never actually got below zero. And then there were a few days where it obviously did and that wasn’t reflected in the results and so I scratched my head and dug out the code.
It was obvious when I looked what the issue was; I made no attempt to try and deal with negative temperatures. My excuse for this is that my DS18B20 1-Wire temperature sensor code didn’t make any attempt to deal with negative temperatures either - it didn’t need to, as those are all deployed inside my home and if the temperature gets towards zero the heating is turned on. So first mistake; not thinking about the fact the external sensor was going to have a different set of requirements/limits than the internal one.
Secondly, when I looked at the code closely, it wasn’t clear to me how I’d ever been getting the right value. The temperature is a 12 bit value in the middle of a 36 bit data stream, so there’s a shift and mask to extract it for printing. I strip out 4 bits which are always one, which makes things fit nicely into a 32 bit variable. Except I failed to take that into account for the temperature shift - all the other pieces of information were correctly handled, just not the temperature.
Now when I mentioned this on IRC Lars Wirzenius helpfully piped up with something along the lines of “Of course your test suite caught this for you”. I’ve got a bunch of excuses here; part of it is about the fact that once you involve hardware doing a full end-to-end test becomes a lot harder, part of it is about the fact this is a personal project and writing the code is more fun than writing an exhaustive test suite and part of it is about the fact I obviously just never wrote the negative temperature support and I’d not have written a test for it either most probably.
Also I did actually perform some testing before putting it into service. Values looked sane when compared to it sitting inside on my desk, and it sitting outside on the window sill. It varied over time as I’d expect, with overnight temperatures being lower than during the day. I even had a weird issue where I was seeing a daily peak at around 7am which I investigated and realised was a result of that being the point where sunlight would bounce off another window and shine directly on the Digoo device.
So what additional testing did I do this time, to make sure I’d fixed the issue properly? I put the Digoo in the freezer…
Anyway, my attempt to take some generally useful lessons from this experience are as follows:
- Not implementing functionality is fine, but at least make a note (file a ticket if it’s a formalise project) of the fact. I’m a big fan of low priority feature bugs to track these things.
- Developer driven testing will miss things. A second pair of eyes / a suitably devious test mind would have thought of this sort of thing very quickly. Unit tests are great, but I’ve a big belief that QA is a distinctly different skill to development and you need both in your team.
- 20+ years of experience with bit shifting doesn’t mean you won’t mess it up in a non-obvious way.
- Testing that involves hardware can provide trickier problems than pure software testing.
(Oh, and if you really care, I fixed the DS18B20 negative temperature support for completeness.)