Shit happens…

Pay-as-you-go cloud services are a dream for any agile maker: they allow you to hit the ground running, and with their elasticity and scalability, the sky is the limit. However, your wallet definitely has a limit and thus:

“Fail fast, learn fast” with Cloud is a bad idea (We Burnt $72K testing Firebase - Cloud Run and almost went Bankrupt)

Unfortunately, these “bill shock” stories have become quite common:

I spent 25k euros in five days in Big Query, more than our usual monthly cloud costs at the time. Fortunately, the change causing this was under a feature flag, and I could deactivate it as soon as somebody asked whether we knew something about this cost spike. It took me little to realize that it was my change, routing some timing-out queries from PostgreSQL to Big Query. I was lucky this time because the costs were not as big as to have dire consequences (e.g., bailing out a small startup), but I could have lost way more money.

…but why did I make shit happen?

📖 The Design of Everyday Things has a marvelous chapter on human errors involving technology, “Human Error? No, Bad Design”:

Why does the root cause analysis stop as soon as a human error is found? (p. 164)

Link to original

People do tend to blame themselves when they do something that, after the fact, seems inexcusable. “I knew better”, is a common comment by those who have erred. But when someone says, “It was my fault, I knew better”, this is not a valid analysis of the problem. That doesn´t help prevent its recurrence (p. 167).

Link to original
Of course there is a ton to say about the very poor status of observability, monitoring and the awful blast radius regarding costs in cloud platforms, and that is usually the topic on bill shock posts.

And rightfully so. Ideally, the most productive use of our energy would be to redesign the cost model and its observability so that the blast radius of these errors is reduced. But unfortunately, and while we demand for better systems, we still need to work with them as they are. So it still pays off to have a deeper look at the human side of this.

And I needed to go through this analysis. Those days I was told several times (with the best intentions) that “shit happens” but I couldn’t let it go that easily. To me, it looked like a rookie mistake. I needed to at least see if I could learn something from it.

Join me in a tour around the chain of events that led to this. A very interesting metaphor to study an incident is:

Transclude of 📖-The-Design-of-Everyday-Things#^a39a3f

Holes in the cheese: a superficial look

First hole in the cheese: I didn’t have a reliable estimate of the costs of cloud services

(For the non-techincal reader, you may skip this. The tl;dr is that things didn’t work as I thought, but I could have tested those thoughts).

I was aware of Big Query costs: at $5.00 per TB, I was pretty confident these queries were not going to make a dent in our bill. I was also aware of the counterintuitive cost model of Big Query: you are charged per TB processed, not necessarily returned by the query. Again, I was aware of this and (over)confident: the query had a couple of WHERE conditions, and a very quick back-of-the-envelope estimate reassured me in the decision.

What I should have done: it was very simple to run that query through the Query Planner Big Query tool and have a real estimate from the service of how much data was going to be processed. It turned out that the whole table was going to be scanned on every query. WHERE clauses only limit the information to be scanned if the queried table is clustered or partitioned.

Second hole in the cheese: someone else had a hunch but I did not double-check

My manager at the time also expressed his concern regarding the counterintuitivity of Big Query costs. Since I was aware of the “bytes processed” thing, and I was going to put the change under a feature flag, I was still confident on the change: it was meant to be an experiment, a quick fix until we build a good solution for this. Thus, if anything went wrong, I would simply disable it.

Third hole in the cheese: I did not monitor the costs after deploying

And this is the most painful to admit. Over the next few days, I did not pay attention to our billing. If I had done so, I could have saved 75% to 95% of the costs we had by reverting the change. It is not that I was not tracking the new change: I was carefully following the logs on Kibana, making sure the new queries completed ok and that the number of errors we experienced went down to zero. But, thinking about costs, I remember a passing thought: “somebody is probably already tracking costs company-wide. If there is a spike, they will let me know”. It turns out that there was no proper tracking and alerting set up for cloud costs. And I didn’t notified anybody that I was making a change that could potentially go wrong.

The important lessons to learn

There were more cheese layers with holes that were not my responsibility and contributed to a greater blast radius than what it should (e.g., I was not responsible at the time for controlling the costs of cloud services at the organisation level), but I want to focus on the root causes of why I made some of the poor decisions I made.

It is very, very easy in an introspection like this to fall for a superficial explanation, adopt an easy “solution”, and call it a day. The most immediate root cause one could think of would be this:

I didn’t really know the inner workings of Big Query with regards to pricing, so I was in a “this cannot happen” scenario. Solution: already solved, now I know.

Ok, this is a valid piece of knowledge that I got out of this. However, how many technologies are there (and will be) with their own quirks, dark stuff, and ever changing? If I just simply keep the “I didn’t know that” mentality, I will be bound to make another mistake like this as soon as I start using another new technology. What about a more general idea? Let’s try this:

I didn’t know that cloud costs can be surprising. Solution: be prudent, have a protocol for cloud services.

This would be a much more valuable piece of knowledge… that I already had. Maybe if this happens to you once you are starting in software engineering, there are a couple of lessons to learn: that you should always have full ownership over the work you do and pragmatic paranoia. From 📖 The Pragmatic Programmer, 20th Anniversary Edition:

No one in the brief history of computing has ever written a piece of perfect software. But Pragmatic Programmers take this a step further. They don’t trust themselves, either. Knowing that no one writes perfect code, including themselves, Pragmatic Programmers build in defenses against their own mistakes.

Link to original

If I already knew this… what happened? One could argue that what happened to me is the precise difference between knowledge and wisdom, and there was still wisdom to build up. And once again, I link to an article that I link to again and again, 🗞️ How to read self-help:

But this is a hallmark of wisdom: it’s trivial to read but nearly impossible to put into practice (…) we need lots of examples to drive this wisdom home. We should be more forgiving of self-help (the genre) and more forgiving of ourselves. Putting wisdom into practice takes requxires reading, reflection, and practice.

Link to original

However, regarding “being paranoid”, I’m usually at the complete other side of the spectrum, and I have to work hard to not fall into the worst side of it (e.g., micromanagement).

Further complicating analysis, there was no lack of ownership either: I followed how the feature was being used every day after development, making sure it worked as intended, checking the logs… even though I did not follow the cost. Worse than that, I did not really made a fully conscious choice of delegating the cost control: the thought that “somebody would be tracking our cloud costs” was a passing thought… not a full-fledged decision. What would be the explanation for this then?

Stop clearing the decks

“Mistake”, “error” are very broad terms: so you never really know what people is thinking when they tell you that “everybody does mistakes”. What kind of error was this? Again drawing from 📖 The Design of Everyday Things:

I’d say my error qualified as a memory-lapse at the plan level. I forgot to consider deeply the cost issue (the passing thought), so I did not specified a proper plan (e.g., trial run and cost tracking).

And why did I have a memory-lapse? A classic: distraction and self-pressure. I did not want to spend much time on this thing because I had a myriad of other things that I also “needed” to do. This is, as we can see, a dangerous mode of operation in some areas. The right way to work is to prioritize ruthlessly and devote full attention to what you have on your hands. Divided attention and hurry don’t work:

You can’t force yourself to think faster. If you try, you’re likely to end up making much worse decisions.

Link to original

I needed to step out of this “clearing the decks” mode (which is a lie anyway, decks never get cleared). From the wonderful 📖 Four Thousand Weeks:

Why You Should Stop Clearing the Decks

Not only that, but you end up doing shit:

The more efficient you get, the more you become “a limitless reservoir for other people’s expectations,” in the words of the management expert Jim Benson (…) it grew painfully clear that the things I got done most diligently were the unimportant ones, while the important ones got postponed—either forever or until an imminent deadline forced me to complete them, to a mediocre standard and in a stressful rush. (p. 34)

So, stop clearing the decks:

declining to clear the decks, focusing instead on what’s truly of greatest consequence while tolerating the discomfort of knowing that, as you do so, the decks will be filling up further.

Link to original

And, as easy as this sounds, it is terribly hard to do it in practice. It requires the embrace of your finitude, and a strong “patience muscle”:

The second principle is to embrace radical incrementalism (…) willing to stop when your daily time is up, even when you’re bursting with energy and feel as though you could get much more done (…) Because as [Robert Boice, psychology professor] explained, the urge to push onward beyond that point “includes a big component of impatience about not being finished, about not being productive enough, about never again finding such an ideal time” for work. Stopping helps strengthen the muscle of patience that will permit you to return to the project again and again, and thus to sustain your productivity over an entire career (p. 115).

Link to original

Before embarking on this analysis, I had the suspicion that there was something deeper that I needed to confront and I’m glad that I did the work. What came out of it, however, was not something that I could dispatch easily, but a life-long exercise that goes over my professional life and into the core experience of how to live.