Everything will be broken

I've been writing software for a while and it's reaching a point where I don't think I can trust the software to work reliably on production, I'm not longer surprised to hear something is broken. It's becoming an expected part of the experience.

We are in an industry that considers acceptable — at certain scale — to have hundreds if not thousands of crashes in production, we dedicate servers (and teams) just to capture this anomalies and to report them back to engineers, tracking them only to prioritize the “blocker” issues, those issues we can't ignore because they are just too damn costly.

Mainstream software is unpredictable, companies who were admired examples of “it just works” can now barely do simple things consistently. Somehow we managed to convince ourselves, very slowly, that we should just “restart the app”, wait for “the next version” or just “give it a few seconds”. The joke that started as “just restart your computer” ended up being something we now put on actual release notes and manuals.

All of this, I believe it's getting worse with the culture and practices we recently added to development, things like Agile where we look at the process and craft of engineering as something that should only be seen as a collection of small weekly “sprints”, where there's no big picture, no long term design, you just iterate, fail fast and break things, hacking your way to something that embarrasses yourself, otherwise you probably launched your product too late. (That's an actual quote from VC's).

On the engineering front it feels like every week we get a new “framework”, tool or practice that will make writing software a bliss, the “weekly panacea”. But those very rarely last. We download hundreds of gigabytes worth of development tools that barely work, so that we can write our software in text files, just like our ancestors started doing forty or fifthy years ago.

Three short stories:

It works on my machine

I was developing a feature on a mobile client, and the server was returning exceptions (500) on all my calls, so when I asked the backend developer he replied “it works for me”, we both looked at the logs, checking why his HTTP request was succeeding and mine was crashing, we made sure everything was the same, the HTTP headers, the body, even the user agent, it was the exact same HTTP request!

After some debugging, it turned out this exception was only caused if you connected from the corporate VPN (I was), turns out we were doing reverse Geo on the IP address and it was failing because I was physically connecting from a different place. The failure had nothing to do with my code, I was just in the wrong place.

Time is hard

All of the sudden there was this huge spike in requests failing, from 0 to 100%. Clients were unable to authenticate, people checked the logs all seemed normal, there was no release for a while, things were pretty stable. What happened was that the clients were using “%G” as the year format, instead of “yyyy” and because it was a leap year, the server date and the client date differed making all authentication requests fail.

This bug would not have been caught with a unit test (I mean who writes tests against all combinations of dates), it would only be discovered if the person doing the code review knew from memory the date format specification.

That's weird

I once added a simple textfield to a mobile app and when it was released hundreds of crahes appeared on the dashboard, I was puzzled because the exception had no meaningful backtrace, it was all UIKit crashing for some reason. But I knew the problem started when I roll out the feature, so I definitely had triggered it. After hours trying to reproduce the issue I found that it was only caused when using “dictation” and altering the text while the OS requested the dictation transcript, if you removed the text it would cause an unexpected crash. turns out this was not only an issue on the code I've pushed, but I was able to reproduce on all applications out there. Kind a cool and scary.

Some bugs like this one are obscured behind layers of abstractions that your code has no control of. You can only work around them as they change the code that underlyingi code that lives below yours, hiding with those lying abstractions we call API's.

Most of these issues would not be discovered or solved with “documentation”, “readability”, “frameworks”, “agile”, “unit testing”, “integration testing”, “Manual Q&A”, “verification tests”, “bug bashes”, “continuos integration”, “AB/testing”, “conditional roll outs”, ...

All of these techniques and processes are nothing but palliatives to an underlying issue we're not solving. As an industry we can't say we're 100% sure of the stability and correctness of our systems. And I'm starting to believe we'll never be.