Sometimes, a "bug" can be caused by nasty architecture with intertwined hacks. Particularly on games, where you can easily have event A that triggers B unless C is in X state...
What I want to say is that I've seen what happens in a team with a history of quick fixes and inadequate architecture design to support the complex features. In that case, a proper bugfix could create significant rework and QA.
> Sometimes, a "bug" can be caused by nasty architecture with intertwined hacks
The joys of enterprise software. When searching for the cause of a bug let you discover multiple "forgotten" servers, ETL jobs, crons all interacting together. And no one knows why they do what they do how they do. Because they've gone away many years ago.
> searching for the cause of a bug let you discover multiple "forgotten" servers, ETL jobs, crons all interacting together. And no one knows why they do [..]
And then comes the "beginner's" mistake. They don't seem to be doing anything. Let's remove them, what could possibly go wrong?
If you follow the prescribed procedure and involve all required management, it stops being a beginner's mistake; and given reasonable rollback provisions it stops being a mistake at all because if nobody knows what the thing is it cannot be very important, and a removal attempt is the most effective and cost efficient way to find out whether the ting can be removed.
that's a management/cultural problem. if no one knows why it's there, the right answer is to remove it and see what breaks. If you're too afraid to do anything, for nebulous cultural reasons, you're paralyzed by fear and no one's operating with any efficiency. It hits different when it's the senior expert that everyone revere's that invented everything the company depends on that does it, vs a summer intern vs Elon Musk bought your company (Twitter). Hate the man for doing it messily and ungraciously, but you can't argue with the fact that it gets results.
This does depend on a certain level of testing (automated or otherwise) for you to even be able to identify what breaks in the first place. The effect might be indirect several times over and you don't see what has changed until it lands in front of a customer and they notice it right away.
Move fast and break things is also a managerial/cultural problem in certain contexts.
I do this frequently. But sometimes identifying and/or fixing takes more than 2 days.
But you hit on a point that seems to come up a lot. When a user story takes longer than the alloted points, I encourage my junior engineers to split it into two bugs. Exactly like what you say... One bug (or issue or story) describing what you did to typify the problem and another with a suggestion for what to do to fix it.
There doesn't seem to be a lot of industry best practice about how to manage this, so we just do whatever seems best to communicate to other teams (and to ourselves later in time after we've forgotten about the bug) what happened and why.
Bug fix times are probably a pareto distribution. The overwhelming majority will be identifiable within a fixed time box, but not all. So in addition to saying "no bug should take more than 2 days" I would add "if the bug takes more than 2 days, you really need to tell someone, something's going on." And one of the things I work VERY HARD to create is a sense of psychological safety so devs know they're not going to lose their bonus if they randomly picked a bug that was much more wicked than anyone thought.
At Amazon we had a bug that was the result of a compiler bug and the behaviour of intel cores being mis-documented. It was intermittent and related to one core occasionally being allowed to access stale data in the cache. We debugged it with a logic analyzer, the commented nginx source and a copy of the C++ 11 spec.
The hardware team had some semi-custom thing from intel that spat out (no surprise) gigabytes of trace data per second. I remember much of the pain was in constructing a lab where we could drive a test system at reasonable loads to get the buggy behavior to emerge. It was intermittent so it took use a couple weeks to come up with theories, another couple days for testing and a week of analysis before we came up triggers that allowed us to capture the data that showed the bug. it was a bit of a production.
I think in general, bugs go unfixed in two scenarios:
1. The cause isn't immediately obvious. In this case, finding the problem is usually 90% of the work. Here it can't be known how long finding the problem is beforehand, though I don't think bailing because it's taking too long is a good idea. If anything, it's those really deep rabbit holes the real gremlins can hide.
2. The cause is immediately obvious, but is an architecture mistake, the fix is a shit-ton of work, breaks workflows, requires involving stakeholders, etc. Even in this case it can be hard to say how long it will take, especially if other people are involved and have to sign off on decisions.
I suppose it can also happen in low-trust sweatshops where developers held on such a tight leash they aren't able to fix trivial bugs they find without first going through a bunch of jira rigmarole, which is sort of low key the vibe I got from the post.
I had a job that required estimation on bug tickets. It's honestly amazing how they didn't realize that I'd take my actual estimate, then multiply it by 4, then use the extra time to work on my other bug tickets that the 4x multiplier wasn't good enough for.
That's just you hedging, they don't really need to know that. As long as if you are hedging accurately in the big picture, that's all that matters. They need estimates to be able to make decisions on what should be done and what not.
You could tell them that 25% chance it's going to take 2 hours or less, 50% chance it's going to take 4 hours or less, 75% chance it's going to take 8 hours or less, 99% it's going to take 16 hours or less, to be accurate, but communication wise you'll win out if you just call items like those 10 hours or similar intuitively. Intuitively you feel that 10 hours seems safe with those probabilities (which are intuitive experience based too). So you probably would say 10 hours, unless something really unexpected (the 1%) happens.
Btw in reality with above probabilities the actual average would be 5h - 6h with 1% tasks potentially failing, but even your intuitive probability estimations could be off so you likely want to say 10h.
But anyhow that's why story points are mostly used as well, because if you say hours they will naturally think it's more fixed estimation. Hours would be fine if everyone understood naturally that it implies a certain statistical average of time + reasonable buffer it would take over a large amount of similar tasks.
Sometimes you find the cause of the bug in 5 minutes because its precisely where you thought it was, sometimes its not there and you end up writing some extra logging to hopefully expose its cause in production after the next release because you can't reproduce as its transient. I don't know how to predict how long a bug will take to reproduce and track down and only once its understood do we know how long it takes to fix.
> unless fixing a bug requires a significant refactor/rewrite, I can’t imagine spending more than a day on one
Race conditions in 3rd party services during / affected by very long builds and with poor metrics and almost no documentation. They only show up sometimes, and you have to wait for it to reoccur. Add to this a domain you’re not familiar with, and your ability to debug needs to be established first.
Stack two or three of these on top of each other and you have days of figuring out what’s going on, mostly waiting for builds, speculating how to improve debug output.
After resolving, don’t write any integration tests that might catch regressions, because you already spent enough time fixing it, and this needs to get replaced soon anyway (timeline: unknown).
LLMs have helped me here the most. Adding copious detailed logging across the app on demand, then inspecting the logs to figure out the bug and even how to reproduce it.
For me the longer I work, the worse the bugs I work with become.
Nowadays, after some 17 years in the business, it's pretty much always intermittently and rarely occurring race conditions of different flavors. They might result in different behaviors (crashes, missing or wrong data, ...), but at the core of it, it's almost always race conditions.
The easy and quick to fix bugs never end up with me.
Yep. Non-determinism. Back in the day it was memory corruption caused by some race condition. By the time things have gone pop, you’re too far from the proximate cause to have useful logs or dumps.
“Happens only once every 100k runs? Won’t fix”. That works until it doesn’t, then they come looking for the poor bastard that never fixes a bug in 2 days.
I think the worst case I encountered was something like two years from first customer report to even fully confirming the bug, followed by about a month of increasingly detailed investigations, a robot, and an osciliscope.
The initial description? "Touchscreen sometimes misses button presses".
I'm no Raymond Chen, but sometimes I wish I'd kept notes on interesting bugs that I could take with me when I moved jobs. I've often been the go-to guy for weird shit that is happening that nobody else understands and requires cross-disciplinary insight.
Other favourites include "Microsoft Structured Exception Handling sometimes doesn't catch segfaults", and "any two of these network devices work together but all three combined freak out".
Its odd at first, but springs from economic principles, mainly sunk cost fallacy.
If you invest 2 days of work and did not find the root cause of a bug, then you have the human desire to keep investing more work, because you already invested so much work. At that point however its best to re-evaluate and do something different instead, because it might have a bigger impact.
Likelihood that after 2 days of not finding the problem, you wont find it after another 2 days is higher than starting over with another bug that on average you find the problem earlier.
> It’s virtually impossible for me to estimate how long it will take to fix a bug, until the job is done.
In my experience there are two types of low-priority bugs (high-priority bugs just have to be fixed immediately no matter how easy or hard they are).
1. The kind where I facepalm and go “yup, I know exactly what that is”, though sometimes it’s too low of a priority to do it right now, and it ends up sitting on the backlog forever. This is the kind of bug the author wants to sweep for, they can often be wiped out in big batches by temporarily making bug-hunting the priority every once in a while.
2. The kind where I go “Hmm, that’s weird, that really shouldn’t happen.” These can be easy and turn into a facepalm after an hour of searching, or they can turn out to be brain-broiling heisenbugs that eat up tons of time, and it’s difficult to figure out which. If you wipe out a ton of category 1 bugs then trying to sift through this category for easy wins can be a good use of time.
And yeah, sometimes a category 1 bug turns out to be category 2, but that’s pretty unusual. This is definitely an area where the perfect is the enemy of the good, and I find this mental model to be pretty good.
I believe the idea is to pick small items that you'd likely be able to solve quickly. You don't know for sure but you can usually take a good guess at which tasks are quick.
> It’s virtually impossible for me to estimate how long it will take to fix a bug, until the job is done.
This is explained later in the post. The 2 day hard limit is applied not to the estimate but rather to the actual work: "If something is ballooning, cut your losses. File a proper bug, move it to the backlog, pick something else."
Most of the work in finding/fixing bugs is reproducing them reliably enough to determine the root cause.
Once I find a bug, the fix is often negligible.
But I can get into a rabbithole, tracking down the root cause. I don’t know if I’ve ever spent more than a day, trying to pin down a bug, but I have walked away from rabbitholes, a couple of times. I hate doing that. Leaves an unscratchable itch.
you cannot know. that’s why the post elaborates saying (paraphrasing) “if you realize it’s taking longer, cut your losses and move on to something else”
I don’t. I worked on firmware stuff where unexplainable behavior occurs; digging around the code, you start to feel like it’s going to take some serious work to even start to comprehend the root cause; and suddenly you find the one line of code that sets the wrong byte somewhere as a side effect, and what you thought would fill up your week ended up taking 2 hours.
I just find it so oversimplified that I can't believe you're sincere. Like you have entirely no internal heuristic for even a coarse estimation of a few minutes, hours, or days? I would say you're not being very introspective or are just exaggerating.
Working on drivers, a relatively recent example is when we started looking at a "small" image corruption issue in some really specific cases, that slowly spidered out to what was fundamentally a hardware bug affecting an entire class of possible situations, it was just this one case happened to be noticed first.
There was even talk about a hardware ECO at points during this, though an acceptable workaround was eventually found.
I could never have predicted that when I started working on it, and it seemed every time we thought we'd got a decent idea about what was happening even more was revealed.
And then there's been many other issues when you fall onto the cause pretty much instantly and a trivial fix can be completed and in testing faster than updating the bugtracker with an estimate.
True there's probably a decent amount, maybe even 50%, where you can probably have a decent guess after putting in some length of time and be correct within a factor of 2 or so, but I always felt the "long tail" was large enough to make that pretty damn inaccurate.
My team once encountered a bug that was due to a supplier misstating the delay timing needed for a memory chip.
The timings we had in place worked, for most chips, but they failed for a small % of chips in the field. The failure was always exactly identical, the same memory address for corrupted, so it looked exactly like an invalid pointer access.
It took multiple engineers months of investigating to finally track down the root cause.
But what was the original estimate? And even so I'm not saying it must be completely and always correct. I'm saying it seems wild to have no starting point, to simply give up.
Have you ever fixed random memory corruption in an OS without memory protection?
Best case you trap on memory access to an address if your debugger supports it (ours didn't). Worst case you go through every pointer that is known to access nearby memory and go over the code very very carefully.
Of course it doesn't have to be a nearby pointer, it can be any pointer anywhere in the code base causing the problem, you just hope it is a nearby pointer because the alternative is a needle in a haystack.
I forget how we did find the root cause, I think someone may have just guessed bit flip in a pointer (vs overrun) and then un-bit-flipped every one of the possible bits one by one (not that many, only a few MB of memory so not many active bits for pointers...) and seen what was nearby (figuring what the originally intended address of the pointer was) and started investigating what pointer it was originally supposed to be.
Then after confirming it was a bit flip you have to figure out why the hell a subset of your devices are reliably seeing the exact same bit flipped, once every few days.
So to answer your question, you get a bug (memory is being corrupted), you do an initial investigation, and then provide an estimate. That estimate can very well be "no way to tell".
The principal engineer on this particular project (Microsoft Band) had a strict 0 user impacting bugs rule. Accordingly, after one of my guys spend a couple weeks investigating, the principal engineer assigned one of the top firmware engineers in the world to track down this one bug and fix it. It took over a month.
This is why a test suite and mock application running on the host is so important. Tools like valgrind can be user to validate that you won't have any memory errors once you deploy to the platform that doesn't have protections against invalid accesses.
It wouldn't have caught your issue in this case. But it would have eliminated a huge part of the search space your embedded engineers had to explore while hunting down the bug.
There is a divide in this job between people who can always provide an estimate but accept that it is sometimes wrong, and people who would prefer not to give an estimate because they know it’s more guess than analysis.
You seem to be in the first club, and the other poster in the second.
It rather depends on the environment in which you are working - if estimates are well estimates then there is probably little harm in guessing how long something might take to fix. However, some places treat "estimates" as binding commitments and then it could be risky to make any kind of guess because someone will hold you to it.
I can explain it to you. A bug description at the beginning is some observed behaviour that seems to be wrong. Now the process starts of UNDERSTANDING the bug. Once that process has concluded, it will be possible to make a rough guess of how long fixing it will take. Very often, the answer then is a minute or two, unless major rewrites are necessary. So, the problem is you cannot put an upfront bound on how long you need to understand the bug. Understanding can be a long winded process that includes trying to fix the bug in the process.
> A bug description at the beginning is some observed behaviour that seems to be wrong.
Or not. A bug description can also be a ticket from a fellow engineer who knows the problem space deeply and have an initial understanding of the bug, likely cause and possible problems. As always, it depends, and IME the kind of bugs that end up in those "bugathons" are the annoying "yeah I know about it, we need to fix it at some point because it's PITA".
Ex-Meta employee here. I worked at reality labs, perhaps in other orgs the situation is different.
At Meta we did "fix-it weeks", more or less every quarter. At the beginning I was thrilled: leadership that actually cares about fixing bugs!
Then reality hit: it's the worst possible decision for code and software quality. Basically this turned into: you are allowed to land all the possible crap you want. Then you have one week to "fix all the bugs". Guess what: most of the time we couldn't even fix a single bug because we were drown in tech debt.
> That’s not to say we don’t fix important bugs during regular work; we absolutely do. But fixits recognize that there should be a place for handling the “this is slightly annoying but never quite urgent enough” class of problems.
So in their case, fixit week is mostly about smaller bugs, quality of life improvements and developer experience.
I've had to inform leadership that stability is a feature, just like anything else, and that you can't just expect it to happen without giving it time.
One leader kind of listened. Sort of. I'm pretty sure I was lucky.
Ask them if they're into pro sports. If so (and most men outside of tech are in some way), they'll probably know the phrase "availability is the best ability".
Where have you worked where this was practiced if you don’t mind sharing?
I’ve seen very close to bug free backends (more early on in development). But every frontend code base ever just always seems to have a long list of low impact bugs. Weird devices, a11y things, unanticipated screen widths, weird iOS safari quirks and so on.
Also I feel like if this was official policy, many managers would then just start classifying whatever they wanted done as a bug (and the line can be somewhat blurry anyway). So curious if that was an issue that needed dealing with.
I'm not going to share my employer, but this is exactly how we operate. Bugs first, they show up on the Jira board at the top of the list. If managers would abuse that (they don't), we'd just convert them to stories, lol.
I do agree that it's rare, this is my first workplace where they actually work like that.
Frontend bugs mostly stem from usage of overblown frontend frameworks, that try to abstract from the basics of the web too much. When relying on browser defaults and web standards, proper semantic HTML and sane CSS usage, the scope of things that can go wrong is limited.
Bugs have priorities associated with them, too. It's reasonable for a new feature to be more important than fixing a lower priority bug. For example, if reading the second "page" of results for an API isn't working correctly; but nobody is actually using that functionality; then it might not be that important to fix it.
>For example, if reading the second "page" of results for an API isn't working correctly; but nobody is actually using that functionality; then it might not be that important to fix it.
I've seen that very argument several times, it was even in the requirements on one occasion. In each instance it was incorrect, there were times when a second page was reached.
I'd love to see an actual bug-free codebase. People who state the codebase in bug-free probably just lack awareness. Even stating we 'have only x bugs' is likely not true.
> The type that claims they're going to achieve zero known and unknown bugs is also going to be the type to get mad at people for finding bugs.
This is usually EMs in my experience.
At my last job, I remember reading a codebase that was recently written by another developer to implement something in another project, and found a thread safety issue. When I brought this up and how we’ll push this fix as part of the next release, he went on a little tirade about how proper processes weren’t being followed, etc. although it was a mistake anyone could have made.
Many of the bugs have very low severity or appear to small minority of users under very specific conditions. Fixing these first might be quite bad use of your capacities. Like misaligned UI elements, etc.
Critical bugs should be done immediately of course as a hotfix.
Any modern system with a sizeable userbase has thousands of bugs. Not all bugs are severe, some might be inconveniences at best affecting only a small % of customers. You have to usually balance feature work and bug fixes and leadership almost always favours new features if the bugs aren't critical to address.
In your experience, is there a lot of contention over whether a given issue counts as a bug fix or a feature/improvement? In the article, some of the examples were saving people a few clicks in a frequent process, or updating documentation. Naively, I expect that in an environment where bug fixes get infinite priority, those wouldn't count as bugs, so they would potentially stick around forever too.
This is the 'Zero Defects'[1] mode of development. A Microsoft department adopted it in 1989 after their product quality dropped. (Balmer is cc'd on the memo.)
In my experience, having a fixit week on the calendar encourages teams to just defer what otherwise could be done relatively easily at first report. ("ah we'll get to it in fixit week"). Sometimes it's a PM justifying putting their feature ahead of product quality, other times it's because a dev thinks they're lining up work for an anticipated new hire's onboarding. It's even hinted at in the article ('All year round, we encourage everyone to tag bugs as “good fixit candidates” as they encounter them.')
My preferred approach is to explicitly plan in 'keep the lights on' capacity into the quarter/sprint/etc in much the same way that oncall/incident handling is budgeted for. With the right guidelines, it gives the air cover for an engineer to justify spending the time to fix it right away and builds a culture of constantly making small tweaks.
That said, I totally resonate with the culture aspect - I think I'd just expand the scope of the week-long event to include enhancements and POCs like a quasi hackathon
I’m a strong believer in “fix bugs first” - especially in the modern age of “always be deploying” web apps.
(I run a small SaaS product - a micro-SaaS as some call it.)
We’ll stop work on a new feature to fix a newly reported bug, even if it is a minor problem affecting just one person.
Once you have been following a “fix bugs first” approach for a while, the newly discovered bugs tend to be few, and straight forward to reproduce and fix.
This is not necessarily the best approach from a business perspective.
But from the perspective of being proud of what we do, of making high quality software, and treating our customers well, it is a great approach.
Oh, and customers love it when the bug they reported is fixed within hours or days.
Would love to work on a project with this as a rule but I am working on a project that was build before me with 1.2 million lines of code, 15 years old, really old frameworks; I don't think we could add features if we did this.
Same. The legacy project that powers all of our revenue-making projects at work is a gargantuan hulking php monster of the worst code I’ve ever seen.
A lot of the internal behaviors ARE bugs that have been worked around, and become part of the arbitrary piles of logic that somehow serve customer needs. My own understanding of bugs in general has definitely changed.
About stopping and fixing problems, did anybody have had this kind of experience?
1. Working on Feature A, stopped by management or by the customer because we need Feature B as soon as possible.
2. Working on Feature B, stopped because there is Emergency C in production due to something that you warned the customer about months ago but there was no time to stop, analyze and fix.
3. Deployed a workaround and created issue D to fix it properly.
4. Postponed issue D because the workaround is deemed to be enough, resumed Feature B.
5. Stopped Feature B again because either Emergency E or new higher priority Feature F. At this point you can't remember what that original Feature A was about and you get a feeling that you're about to forget Feature B too.
6. Working on whatever the new thing is, you are interrupted by Emergency G that happened because that workaround at step 3 was only a workaround, as you correctly assessed, but again, no time to implement the proper fix D so you hack a new workaround.
Maybe add another couple of iterations but at this time every party are angry or at least unhappy of each other party.
You have a feeling that the work of the last two or three months on every single feature has been wasted because you could not deliver any one of them. That means that the customer wasted the money they paid you. Their problem, but it can't be good for their business so your problem too.
The current state of the production system is "buggy and full of workarounds" and it's going to get worse. So you think that the customer would have been wiser to pause and fix all the nastier bugs before starting Feature A. We could have had a system running smoothly, no emergencies, and everybody happier. But no, so one starts thinking that maybe the best course of action is changing company or customer.
Yes, usually not worth it to spend too much time on proper engineering if the company is still trying to find a product-market fit and you will be working on something else or deleting the code in a few months.
This is not uncommon but I've mostly managed to avoid it, because it's a management failure. There is a delicate process of "managing the customer" so that they get a result they will eventually be satisfied with, rather than just saying yes to whatever the last phone call was.
We do this too sometimes and I love it. When I work on my own projects I always stop and refactor/fix problems before adding any new features. I wish companies would see the value in doing this
Also love the humble brag. "I've just closed my 12th bug" and later "12 was maximum number of bugs closed by one person"
It's fairly telling of the state of the software industry that the exotic craft of 'fixing bugs' is apparently worth a LinkedIn-style self-promotional blog post.
I don't mean to be too harsh on the author. They mean well. But I am saddened by the wider context, where a dev posts 'we fix bugs occasionally' and everyone is thrilled, because the idea of ensuring software continues to work well over time is now as alien to software dev as the idea of fair dealing is to used car salesmen.
> But I am saddened by the wider context, where a dev posts 'we fix bugs occasionally' and everyone is thrilled, because the idea of ensuring software continues to work well over time is now as alien to software dev as the idea of fair dealing is to used car salesmen
This is not the vibe I got from the post at all. I am sure they fix plenty of bugs throughout the rest of the year, but this will be balanced with other work on new features and the like and is going to be guided by wider businesses priorities. It seems the point in the exercise is focusing solely on bugs to the exclusion of everything else, and a lot of latitude to just pick whatever has been annoying you personally.
The name is just an indication you can do it any day but idea is on Friday when you are at no point to start big thing, pick some small one you want to fix personally. Maybe a big in product maybe local dev setup.
That is why I stand on the side of better law for company responsibilities.
We as industry have taught people that broken products is acceptable.
In any other industry, unless people are from the start getting something they know is broken or low quality, flea market, 1 euro shop, or similar, they will return the product, ask for the money back, sue the company whatever.
There should be better regulation of course, but I want to point out, that the comparison with other industries doesn't quite work, because these days software is often given away at no financial cost. Often it costs ones data. But once that data is released into their data flows, you can never unrelease it. It has already been processed in LLM training or used somehow to target you with ads or whatever other purpose. So people can't do what they usually would do, when the product is broken.
"Free" software resulting in your data being sold is the software working as intended, it's orthogonal to the question of software robustness.
Software isn't uniquely high stakes relative to other industries. Sure, if there's a data breach your data can't be un-leaked, but you can't be un-killed when a building collapses over your head or your car fails on the highway. The comparison with other industries works just fine - if we have high stakes, we should be shipping working products.
Imagining that the software will be shipped with hardware, that has no internet access and therefore cumbersome firmware upgrades, might be helpful. Avoiding shipping critical bugs is actually critical so bricking the hardware is undesirable.
This type of testing is incredibly expensive and you'll have a startup run circles around you, assuming a startup could even exist when the YC investment needs to stretch 4x as far for the same product.
The real solution is to have individual software developers be licensed and personally liable for the damage their work does. Write horrible bugs? A licencing board will review your work. Make a calculated risk that damages someone? Company sued by the user, developer sued by the company. This correctly balances incentives between software quality and productivity, and has the added benefit of culling low quality workers.
We did this ages ago at our company (back then we were making silly Facebook games, remember those?)
It was by far the most fun, productive, and fulfilling week.
It went on to shape the course of our development strategy when I started my own company. Regularly work on tech debt and actively applaud it when others do it too.
I firmly believe that this sort of fixit week is as much of an anti-pattern as all-features-all-the-time. Ensuring engineers have the agency and the space to fix things and refactor as part of the normal process pays serious dividends in the long run.
eg: My last company's system was layer after layer built on top of the semi-technical founder's MVP. The total focus on features meant engineers worked solo most of the time and gave them few opportunities to coordinate and standardize. The result was a mess. Logic smeared across every layer, modules or microservices with overlapping responsibilities writing to the same tables and columns. Mass logging all at the error or info level. It was difficult to understand, harder to trace, and nearly every new feature started off with "well first we need to get out of this corner we find ourselves painted into".
When I compare that experience with some other environments I've been in where engineering had more autonomy at the day-to-day level, it's clear to me that this company should have been able to move at least as quickly with half the engineers if they were given the space to coordinate ahead of a new feature and occasionally take the time to refactor things that got spaghettified over time.
As I pointed out in the "criticisms" section, I don't see fixit weeks as a replacement for good technical hygiene.
To be clear, engineers have a lot of autonomy in my team to do what they want. People can and do fix things as they come up and are encouraged to refactor and pay down technical debt as part of their day to day work.
It's more that even with this autonomy fixits bugs are underappreciated by everyone, even engineers. Having a week where we can address the balance does wonders.
It is good to fix bugs, but in my team we need neither the "points system” for bugs nor the leaderboard showing how many points people have. We are against quantifying.
A company I worked at also did this, though there was no limits. Some folks would choose to spend the whole week working on a larger refactor, for example, I unified all of our redis usage to use a single modern library compared to the mess of 3 libraries of various ages across our codebase. This was relatively easy, but tedious, and required some new tests/etc.
Overall, I think this kind of thing is very positive for the health of building software, and morale to show that it is a priority to actually address these things.
I like the idea of this, but why not just have some time per week/sprint for bugs? At my company we prioritise features, but we also take some bug tickets every sprint (sometimes loads of bug tickets if there aren't many new features ready for dev), and generally one engineer is on "prod support" which means tackling bugs as they get reported
Because marginal work is only marginally rewarded. Spending one week and coming back to whoever with a nice piece of paper saying we fixed 60 bugs will earn a lot more rope from non-technical folk than fixing 3 bugs per week - the latter just looks like cleaning up your incompetence.
From the report, it sounds like a good thing, for the product and the team morale.
Strangely the math looks such that they could hire nearly 1 FTE engineer that works full time only on "little issues" (40 weeks, given that people have vacations and public holidays and sick time that's a full year's work at 100%), and then the small issues could be addressed immediately, modulo the good vibes created by dedicating the whole group to one cause for one week. Of course nobody would approve that role...
The unkind world we live in would see this role being abused quickly and a person not lasting long in this role. For one, in the wrong team, it might lead to devs just doing 80% of the work and leaving the rest to the janitor. And the janitor might get fed up with having to fix the buggy code of their colleagues.
I wonder if the janitor role could be rotated weekly or so? Then everyone could also reap the benefits of this role too, I can imagine this being a good thing for anyone in terms of motivation. Fixing stuff triggers a different positive response than building stuff
I've never understood why bugs get treated differently from new features. If there was a bug, the old feature was never completed. The time cost and benefits should be considered equally.
Yet engineers are pushed to give unknowable estimates of points and when things take "longer" (did you notice that shift right there?) they are either overdue, taking too long, or they don't, and to say: "It takes as long as it takes." is not accepted by middle management.
Because the goal of most businesses is not to create complete features. There's only actions in response to the repeated question of "which next action do we think will lead us to the most money"?
I introduced this to my old company years ago and called it Big Block of Cheese Day after the West Wing episode [1]. We mostly focused on very minor bugs that affected a tiny bit of our user base in edgey edge cases but littered our error logs. (This was years ago at a, back then, relatively immature tech company.)
We once did this for a massive product with 3 releases per year: took a whole cycle to do zero features, and just fix bugs. Internal customers who usually stepped over themselves to get their latest feature in the program, were accepting it. But we had to announce it early. Otherwise the usual consensus is that customers would rather take 1 feature together with 10 new bugs, than -5 bugs and no new features.
I just had a majorly fun time addressing tech debt, deleting about 15k lines-of-code from a codebase that now has ~45k lines of implementation, and 50k lines of tests. This was made possible by moving from a homegrown auth system to Clerk, as well as consolidating some Cloudflare workers, and other basic stuff. Not as fun as creating the tech debt in the first place, but much more satisfying. Open source repo if you like to read this sort of thing: https://github.com/VibesDIY/vibes.diy/pull/582
> closed a feature request from 2021! > It’s a classic fixit issue: a small improvement that never bubbled to the priority list. It took me one day to implement. One day for something that sat there for four years
> The benefits of fixits
> For the product: craftsmanship and care
sorry, but this is not care when the priority system is so broken that it requires a full suspension, but only once a quarter
> A hallmark of any good product is attention to detail:
That's precisely the issue, taking 4 years to bring attention to detail, and only outside the main priority system.
Now, don't get me wrong, a fixit is better than nothing and having 4 year bugs turn into 40 year ones, it's just that this is not a testament of craftsmanship/care/attention to detail
I wanted to take a look at some of these bug fixes, and one of the linked ones [1] seems more like a feature to me. So maybe it should be the week of "low priority" issues, or something like that.
I don't mean to sound negative, I think it's a great idea. I do something like this at home from time to time. Just spend a day repairing and fixing things. Everything that has accumulated.
Confused about the meaning of "bug" used in this artcle. It seems to be more about feature requests, nice to haves and polish rather than actual errors in edge cases.
Also explains the casual mention of "estimation" on fixes. A real bug fix is even more hard to estimate than already brittle feature estimates.
> We also have a “points system” for bugs and a leaderboard showing how many points people have. [...] It’s a simple structure, but it works surprisingly well.
What good and bad experiences have people had with software development metrics leaderboards?
I've never had a good experience with individual metrics leaderboards. On one team we had a JIRA story point tracker shown on a tv by a clueless exec. Devs did everything they could to game the system and tasks that required uncertainty (hard tasks) went undone. I believe it contributed to the cog culture that caused an exodus of developers.
However, I love the idea of an occasional team based leaderboard for an event. I've held bug and security hackathons with teams of 3-5 and have had no problem with them.
One nice thing if you work on the B2B software side - end of year is generally slow in terms of new deals. Definitely a good idea to schedule bug bashes, refactors, and general tech debt payments with greater buy in from the business
Focused bug-fixing weeks like this really help improve product quality and team morale. It’s impressive to see the impact when everyone pitches in on these smaller but important issues that often get overlooked.
We’ve done little mini competitions like this at my company, and it’s always great for morale. Celebrating tiny wins in a light, semi-competitive way goes a long way for elevating camaraderie. Love it!
I'm a bit torn on Fix-it weeks. They are nice but many bugs simply aren't worth fixing. Generally, if they were worth fixing - they would have been fixed.
I do appreciate though that certain people, often very good detail oriented engineers, find large backlogs incredibly frustrating so I support fix-it weeks even if there isn't clear business ROI.
> Generally, if they were worth fixing - they would have been fixed.
???
Basically any major software product accumulates a few issues over time. There's always a "we can fix that later" mindset and it all piles up. MacOS and Windows are both buggy messes. I think I speak for the vast majority of people when I say that I'd prefer they have a fix-it year and just get rid of all the issues instead of trying to rush new features out the door.
Maybe rushing out features is good for more money now, but someday there'll be a straw that breaks the camel's back and they'll need to devote a lot of time to fix things or their products will be so bad that people will move to other options.
>For iOS 27 and next year’s other major operating system updates — including macOS 27 — the company is focused on improving the software’s quality and underlying performance.
how will the poor engineers get promotions if they can not write "Launch feature X" (broken, half baked) on their promotion requests? Nobody ever got promoted for fixing bugs or keeping software useable.
A greedy algorithm (in the academic sense, although I suppose also in the colloquial sense) isn't the optimal solution to every problem. Sometimes doing the next most valuable thing at a given step can still lead you down a path where you're stuck at a local optimum, and the only way to get somewhere better is to do something that might not be the most valuable thing measured at the current moment only; fixing bugs is the exact type of thing that sometimes has a low initial return but can pay dividends down the line.
ROI is in reduced backlog, reduced duplicate reports and most importantly mitigation of risk of phase transition between good enough and crap. This transition is not linear, it’s a step function when the amount of individually small and mildly annoying at worst issues is big enough to make the experience of using the whole product frustrating. I’m sure you can think of very popular examples of such software.
I did this with my entire employment at a company I worked with. Or rather, I should say I made it a point to ignore the roadmap and do what was right for the company by optimizing for value for customers and the team.
Fixit weeks is a band aid, and we also tried it. The real fix is being a good boss and trusting your coworkers to do their jobs.
I feel odd about "bug fixing" to be a special occasion than being the work. Features need to be added, so do bugs need to be fixed. Making it a special occasion makes it feel like some very low priority "grunt work" that requires a hard push to be looked at.
So much of the tech debt work scheduling feels like a coordination or cover problem. We’re overdue for a federal “Tech Debt Week” holiday once a year, and just save people all the hand-wringing of how when or how much. If big tech brands can keep affording to celebrate April fools jokes, they can afford to celebrate this.
They said they only pick bugs that take 2 days to fix.
Places where you can move fast and actually do things are actually far better places to work for. I mean the ones were you can show up, do 5 hours of really good work, and then slack off/leave a little early.
I can find more of these that I've run into if I look. I've had tricky bugs in my team's code too, but those don't result in public artifacts, and I'm responsible for all the code that runs on my server, regardless of who wrote it... And I also can't crash client code, regardless of who wrote it, even if my code just follows the RFC.
Oh. Well, I've done easy fixes too. There's plenty of things that just need a couple minutes, like a copy error somewhere.
Or just an hour or two. I can't find it anymore, but I've run into libraries where simple things with months didn't work, because like May only has three letters or July and June both start with Ju. That can turn into a big deal, but often it's easy, once someone notices it.
You criticize the initiative because you judge it doesn't have impact on the product or business. I would challenge the assumption with the claim that a sense of acconplishment, of decision-making and of completion are strong retention and productivity enhancers. Therefore, they're absolutely, albeit indirectly, impacting product and business.
> 1) no bug should take over 2 days
Is odd. It’s virtually impossible for me to estimate how long it will take to fix a bug, until the job is done.
That said, unless fixing a bug requires a significant refactor/rewrite, I can’t imagine spending more than a day on one.
Also, I tend to attack bugs by priority/severity, as opposed to difficulty.
Some of the most serious bugs are often quite easy to find.
Once I find the cause of a bug, the fix is usually just around the corner.
What I want to say is that I've seen what happens in a team with a history of quick fixes and inadequate architecture design to support the complex features. In that case, a proper bugfix could create significant rework and QA.
The joys of enterprise software. When searching for the cause of a bug let you discover multiple "forgotten" servers, ETL jobs, crons all interacting together. And no one knows why they do what they do how they do. Because they've gone away many years ago.
And then comes the "beginner's" mistake. They don't seem to be doing anything. Let's remove them, what could possibly go wrong?
Of course you can get lost on the way but worst case is you learn the architecture.
Move fast and break things is also a managerial/cultural problem in certain contexts.
and you just look at this and thinks: one day, all of this is going to crash and it will never, ever boot again.
But you hit on a point that seems to come up a lot. When a user story takes longer than the alloted points, I encourage my junior engineers to split it into two bugs. Exactly like what you say... One bug (or issue or story) describing what you did to typify the problem and another with a suggestion for what to do to fix it.
There doesn't seem to be a lot of industry best practice about how to manage this, so we just do whatever seems best to communicate to other teams (and to ourselves later in time after we've forgotten about the bug) what happened and why.
Bug fix times are probably a pareto distribution. The overwhelming majority will be identifiable within a fixed time box, but not all. So in addition to saying "no bug should take more than 2 days" I would add "if the bug takes more than 2 days, you really need to tell someone, something's going on." And one of the things I work VERY HARD to create is a sense of psychological safety so devs know they're not going to lose their bonus if they randomly picked a bug that was much more wicked than anyone thought.
Wish there were more like you, out there.
It took longer than 2 days to fix.
(apart from the ones in the firmware, and the hardware glitches...)
They were damn cool. I seriously doubt that something like that, exists outside of a TSMC or Intel lab, these days.
1. The cause isn't immediately obvious. In this case, finding the problem is usually 90% of the work. Here it can't be known how long finding the problem is beforehand, though I don't think bailing because it's taking too long is a good idea. If anything, it's those really deep rabbit holes the real gremlins can hide.
2. The cause is immediately obvious, but is an architecture mistake, the fix is a shit-ton of work, breaks workflows, requires involving stakeholders, etc. Even in this case it can be hard to say how long it will take, especially if other people are involved and have to sign off on decisions.
I suppose it can also happen in low-trust sweatshops where developers held on such a tight leash they aren't able to fix trivial bugs they find without first going through a bunch of jira rigmarole, which is sort of low key the vibe I got from the post.
You could tell them that 25% chance it's going to take 2 hours or less, 50% chance it's going to take 4 hours or less, 75% chance it's going to take 8 hours or less, 99% it's going to take 16 hours or less, to be accurate, but communication wise you'll win out if you just call items like those 10 hours or similar intuitively. Intuitively you feel that 10 hours seems safe with those probabilities (which are intuitive experience based too). So you probably would say 10 hours, unless something really unexpected (the 1%) happens.
Btw in reality with above probabilities the actual average would be 5h - 6h with 1% tasks potentially failing, but even your intuitive probability estimations could be off so you likely want to say 10h.
But anyhow that's why story points are mostly used as well, because if you say hours they will naturally think it's more fixed estimation. Hours would be fine if everyone understood naturally that it implies a certain statistical average of time + reasonable buffer it would take over a large amount of similar tasks.
Race conditions in 3rd party services during / affected by very long builds and with poor metrics and almost no documentation. They only show up sometimes, and you have to wait for it to reoccur. Add to this a domain you’re not familiar with, and your ability to debug needs to be established first.
Stack two or three of these on top of each other and you have days of figuring out what’s going on, mostly waiting for builds, speculating how to improve debug output.
After resolving, don’t write any integration tests that might catch regressions, because you already spent enough time fixing it, and this needs to get replaced soon anyway (timeline: unknown).
Good result == LLM + Experience.
The LLM just reduces the overhead.
That’s really what every “new paradigm” has ever done.
The longer I work as a software engineer, the rarer it is that I get to work with bugs that take only a day to fix.
Nowadays, after some 17 years in the business, it's pretty much always intermittently and rarely occurring race conditions of different flavors. They might result in different behaviors (crashes, missing or wrong data, ...), but at the core of it, it's almost always race conditions.
The easy and quick to fix bugs never end up with me.
“Happens only once every 100k runs? Won’t fix”. That works until it doesn’t, then they come looking for the poor bastard that never fixes a bug in 2 days.
It was all about fixing bugs; often, terrifying ones.
That background came in handy, once I got into software.
I tend to mostly work alone, these days (Chief Cook & Bottle-Washer).
All bugs are mine.
The initial description? "Touchscreen sometimes misses button presses".
I love hearing stories like this.
Other favourites include "Microsoft Structured Exception Handling sometimes doesn't catch segfaults", and "any two of these network devices work together but all three combined freak out".
If you invest 2 days of work and did not find the root cause of a bug, then you have the human desire to keep investing more work, because you already invested so much work. At that point however its best to re-evaluate and do something different instead, because it might have a bigger impact.
Likelihood that after 2 days of not finding the problem, you wont find it after another 2 days is higher than starting over with another bug that on average you find the problem earlier.
In my experience there are two types of low-priority bugs (high-priority bugs just have to be fixed immediately no matter how easy or hard they are).
1. The kind where I facepalm and go “yup, I know exactly what that is”, though sometimes it’s too low of a priority to do it right now, and it ends up sitting on the backlog forever. This is the kind of bug the author wants to sweep for, they can often be wiped out in big batches by temporarily making bug-hunting the priority every once in a while.
2. The kind where I go “Hmm, that’s weird, that really shouldn’t happen.” These can be easy and turn into a facepalm after an hour of searching, or they can turn out to be brain-broiling heisenbugs that eat up tons of time, and it’s difficult to figure out which. If you wipe out a ton of category 1 bugs then trying to sift through this category for easy wins can be a good use of time.
And yeah, sometimes a category 1 bug turns out to be category 2, but that’s pretty unusual. This is definitely an area where the perfect is the enemy of the good, and I find this mental model to be pretty good.
The fact that something is high priority doesn't make it less work.
I often find the nastiest bugs are the quickest fixes.
I have a "zero-crash" policy. Crashes are never acceptable.
It's easy to enforce, because crashes are usually easy to find and fix.
$> Threading problems have entered the chat...
This is explained later in the post. The 2 day hard limit is applied not to the estimate but rather to the actual work: "If something is ballooning, cut your losses. File a proper bug, move it to the backlog, pick something else."
Once I find a bug, the fix is often negligible.
But I can get into a rabbithole, tracking down the root cause. I don’t know if I’ve ever spent more than a day, trying to pin down a bug, but I have walked away from rabbitholes, a couple of times. I hate doing that. Leaves an unscratchable itch.
I have encountered areas where the basic design was wrong (often comes from rushing in, before taking the time to think things through, all the way).
In these cases, we can either kludge a patch, or go back and make sure the design is fixed.
The longer I've been working, the less often I need to go back and fix a busted design.
Now I find that odd.
And sometimes, the exact opposite happens.
Working on drivers, a relatively recent example is when we started looking at a "small" image corruption issue in some really specific cases, that slowly spidered out to what was fundamentally a hardware bug affecting an entire class of possible situations, it was just this one case happened to be noticed first.
There was even talk about a hardware ECO at points during this, though an acceptable workaround was eventually found.
I could never have predicted that when I started working on it, and it seemed every time we thought we'd got a decent idea about what was happening even more was revealed.
And then there's been many other issues when you fall onto the cause pretty much instantly and a trivial fix can be completed and in testing faster than updating the bugtracker with an estimate.
True there's probably a decent amount, maybe even 50%, where you can probably have a decent guess after putting in some length of time and be correct within a factor of 2 or so, but I always felt the "long tail" was large enough to make that pretty damn inaccurate.
The timings we had in place worked, for most chips, but they failed for a small % of chips in the field. The failure was always exactly identical, the same memory address for corrupted, so it looked exactly like an invalid pointer access.
It took multiple engineers months of investigating to finally track down the root cause.
Best case you trap on memory access to an address if your debugger supports it (ours didn't). Worst case you go through every pointer that is known to access nearby memory and go over the code very very carefully.
Of course it doesn't have to be a nearby pointer, it can be any pointer anywhere in the code base causing the problem, you just hope it is a nearby pointer because the alternative is a needle in a haystack.
I forget how we did find the root cause, I think someone may have just guessed bit flip in a pointer (vs overrun) and then un-bit-flipped every one of the possible bits one by one (not that many, only a few MB of memory so not many active bits for pointers...) and seen what was nearby (figuring what the originally intended address of the pointer was) and started investigating what pointer it was originally supposed to be.
Then after confirming it was a bit flip you have to figure out why the hell a subset of your devices are reliably seeing the exact same bit flipped, once every few days.
So to answer your question, you get a bug (memory is being corrupted), you do an initial investigation, and then provide an estimate. That estimate can very well be "no way to tell".
The principal engineer on this particular project (Microsoft Band) had a strict 0 user impacting bugs rule. Accordingly, after one of my guys spend a couple weeks investigating, the principal engineer assigned one of the top firmware engineers in the world to track down this one bug and fix it. It took over a month.
It wouldn't have caught your issue in this case. But it would have eliminated a huge part of the search space your embedded engineers had to explore while hunting down the bug.
You seem to be in the first club, and the other poster in the second.
Or not. A bug description can also be a ticket from a fellow engineer who knows the problem space deeply and have an initial understanding of the bug, likely cause and possible problems. As always, it depends, and IME the kind of bugs that end up in those "bugathons" are the annoying "yeah I know about it, we need to fix it at some point because it's PITA".
At Meta we did "fix-it weeks", more or less every quarter. At the beginning I was thrilled: leadership that actually cares about fixing bugs!
Then reality hit: it's the worst possible decision for code and software quality. Basically this turned into: you are allowed to land all the possible crap you want. Then you have one week to "fix all the bugs". Guess what: most of the time we couldn't even fix a single bug because we were drown in tech debt.
> That’s not to say we don’t fix important bugs during regular work; we absolutely do. But fixits recognize that there should be a place for handling the “this is slightly annoying but never quite urgent enough” class of problems.
So in their case, fixit week is mostly about smaller bugs, quality of life improvements and developer experience.
The way I learned the trade, and usually worked, is that bug fixing always comes first!
You don't work on new features until the old ones work as they should.
This worked well for the teams I was on. Having a (AFAYK) bug free code base is incredibly useful!!
One leader kind of listened. Sort of. I'm pretty sure I was lucky.
I've had some mix of luck and skill in finding these jobs. Working with people you've worked with before helps with knowing what you're in for.
I also don't really ask anyone, I just fix any bugs I find. That may not work in all organizations :)
Yes, a ticket takes 2 seconds. it also puts me off my focus :P but i guess measuring is more important than achieving
code reviewing coworker: "This shouldn't be done on this branch!" (OK, at least this is easy to fix by doing it on a separate branch.)
I’ve seen very close to bug free backends (more early on in development). But every frontend code base ever just always seems to have a long list of low impact bugs. Weird devices, a11y things, unanticipated screen widths, weird iOS safari quirks and so on.
Also I feel like if this was official policy, many managers would then just start classifying whatever they wanted done as a bug (and the line can be somewhat blurry anyway). So curious if that was an issue that needed dealing with.
I do agree that it's rare, this is my first workplace where they actually work like that.
I've seen that very argument several times, it was even in the requirements on one occasion. In each instance it was incorrect, there were times when a second page was reached.
The type that claims they're going to achieve zero known and unknown bugs is also going to be the type to get mad at people for finding bugs.
This is usually EMs in my experience.
At my last job, I remember reading a codebase that was recently written by another developer to implement something in another project, and found a thread safety issue. When I brought this up and how we’ll push this fix as part of the next release, he went on a little tirade about how proper processes weren’t being followed, etc. although it was a mistake anyone could have made.
There are also always bugs detected after shipping (usually in beta), which need to be accounted for.
cat /dev/null .
Assuming it works as intended.
1. https://sriramk.com/memos/zerodef.pdf
My preferred approach is to explicitly plan in 'keep the lights on' capacity into the quarter/sprint/etc in much the same way that oncall/incident handling is budgeted for. With the right guidelines, it gives the air cover for an engineer to justify spending the time to fix it right away and builds a culture of constantly making small tweaks.
That said, I totally resonate with the culture aspect - I think I'd just expand the scope of the week-long event to include enhancements and POCs like a quasi hackathon
(I run a small SaaS product - a micro-SaaS as some call it.)
We’ll stop work on a new feature to fix a newly reported bug, even if it is a minor problem affecting just one person.
Once you have been following a “fix bugs first” approach for a while, the newly discovered bugs tend to be few, and straight forward to reproduce and fix.
This is not necessarily the best approach from a business perspective.
But from the perspective of being proud of what we do, of making high quality software, and treating our customers well, it is a great approach.
Oh, and customers love it when the bug they reported is fixed within hours or days.
A lot of the internal behaviors ARE bugs that have been worked around, and become part of the arbitrary piles of logic that somehow serve customer needs. My own understanding of bugs in general has definitely changed.
1. Working on Feature A, stopped by management or by the customer because we need Feature B as soon as possible.
2. Working on Feature B, stopped because there is Emergency C in production due to something that you warned the customer about months ago but there was no time to stop, analyze and fix.
3. Deployed a workaround and created issue D to fix it properly.
4. Postponed issue D because the workaround is deemed to be enough, resumed Feature B.
5. Stopped Feature B again because either Emergency E or new higher priority Feature F. At this point you can't remember what that original Feature A was about and you get a feeling that you're about to forget Feature B too.
6. Working on whatever the new thing is, you are interrupted by Emergency G that happened because that workaround at step 3 was only a workaround, as you correctly assessed, but again, no time to implement the proper fix D so you hack a new workaround.
Maybe add another couple of iterations but at this time every party are angry or at least unhappy of each other party.
You have a feeling that the work of the last two or three months on every single feature has been wasted because you could not deliver any one of them. That means that the customer wasted the money they paid you. Their problem, but it can't be good for their business so your problem too.
The current state of the production system is "buggy and full of workarounds" and it's going to get worse. So you think that the customer would have been wiser to pause and fix all the nastier bugs before starting Feature A. We could have had a system running smoothly, no emergencies, and everybody happier. But no, so one starts thinking that maybe the best course of action is changing company or customer.
Yes, the issue is not you, it's a toxic workplace. Leave as soon as you can.
Also love the humble brag. "I've just closed my 12th bug" and later "12 was maximum number of bugs closed by one person"
I don't mean to be too harsh on the author. They mean well. But I am saddened by the wider context, where a dev posts 'we fix bugs occasionally' and everyone is thrilled, because the idea of ensuring software continues to work well over time is now as alien to software dev as the idea of fair dealing is to used car salesmen.
This is not the vibe I got from the post at all. I am sure they fix plenty of bugs throughout the rest of the year, but this will be balanced with other work on new features and the like and is going to be guided by wider businesses priorities. It seems the point in the exercise is focusing solely on bugs to the exclusion of everything else, and a lot of latitude to just pick whatever has been annoying you personally.
The name is just an indication you can do it any day but idea is on Friday when you are at no point to start big thing, pick some small one you want to fix personally. Maybe a big in product maybe local dev setup.
We as industry have taught people that broken products is acceptable.
In any other industry, unless people are from the start getting something they know is broken or low quality, flea market, 1 euro shop, or similar, they will return the product, ask for the money back, sue the company whatever.
Software isn't uniquely high stakes relative to other industries. Sure, if there's a data breach your data can't be un-leaked, but you can't be un-killed when a building collapses over your head or your car fails on the highway. The comparison with other industries works just fine - if we have high stakes, we should be shipping working products.
Example: (aftermarket) car headunit.
The real solution is to have individual software developers be licensed and personally liable for the damage their work does. Write horrible bugs? A licencing board will review your work. Make a calculated risk that damages someone? Company sued by the user, developer sued by the company. This correctly balances incentives between software quality and productivity, and has the added benefit of culling low quality workers.
It was by far the most fun, productive, and fulfilling week.
It went on to shape the course of our development strategy when I started my own company. Regularly work on tech debt and actively applaud it when others do it too.
eg: My last company's system was layer after layer built on top of the semi-technical founder's MVP. The total focus on features meant engineers worked solo most of the time and gave them few opportunities to coordinate and standardize. The result was a mess. Logic smeared across every layer, modules or microservices with overlapping responsibilities writing to the same tables and columns. Mass logging all at the error or info level. It was difficult to understand, harder to trace, and nearly every new feature started off with "well first we need to get out of this corner we find ourselves painted into".
When I compare that experience with some other environments I've been in where engineering had more autonomy at the day-to-day level, it's clear to me that this company should have been able to move at least as quickly with half the engineers if they were given the space to coordinate ahead of a new feature and occasionally take the time to refactor things that got spaghettified over time.
To be clear, engineers have a lot of autonomy in my team to do what they want. People can and do fix things as they come up and are encouraged to refactor and pay down technical debt as part of their day to day work.
It's more that even with this autonomy fixits bugs are underappreciated by everyone, even engineers. Having a week where we can address the balance does wonders.
Overall, I think this kind of thing is very positive for the health of building software, and morale to show that it is a priority to actually address these things.
Strangely the math looks such that they could hire nearly 1 FTE engineer that works full time only on "little issues" (40 weeks, given that people have vacations and public holidays and sick time that's a full year's work at 100%), and then the small issues could be addressed immediately, modulo the good vibes created by dedicating the whole group to one cause for one week. Of course nobody would approve that role...
I wonder if the janitor role could be rotated weekly or so? Then everyone could also reap the benefits of this role too, I can imagine this being a good thing for anyone in terms of motivation. Fixing stuff triggers a different positive response than building stuff
It's not binary.
It had the same spirit as a hackathon.
[1] https://westwing.fandom.com/wiki/Big_Block_of_Cheese_Day
> The benefits of fixits
> For the product: craftsmanship and care
sorry, but this is not care when the priority system is so broken that it requires a full suspension, but only once a quarter
> A hallmark of any good product is attention to detail:
That's precisely the issue, taking 4 years to bring attention to detail, and only outside the main priority system.
Now, don't get me wrong, a fixit is better than nothing and having 4 year bugs turn into 40 year ones, it's just that this is not a testament of craftsmanship/care/attention to detail
I don't mean to sound negative, I think it's a great idea. I do something like this at home from time to time. Just spend a day repairing and fixing things. Everything that has accumulated.
1: https://github.com/google/perfetto/issues/154
Also explains the casual mention of "estimation" on fixes. A real bug fix is even more hard to estimate than already brittle feature estimates.
What good and bad experiences have people had with software development metrics leaderboards?
However, I love the idea of an occasional team based leaderboard for an event. I've held bug and security hackathons with teams of 3-5 and have had no problem with them.
I do appreciate though that certain people, often very good detail oriented engineers, find large backlogs incredibly frustrating so I support fix-it weeks even if there isn't clear business ROI.
???
Basically any major software product accumulates a few issues over time. There's always a "we can fix that later" mindset and it all piles up. MacOS and Windows are both buggy messes. I think I speak for the vast majority of people when I say that I'd prefer they have a fix-it year and just get rid of all the issues instead of trying to rush new features out the door.
Maybe rushing out features is good for more money now, but someday there'll be a straw that breaks the camel's back and they'll need to devote a lot of time to fix things or their products will be so bad that people will move to other options.
>For iOS 27 and next year’s other major operating system updates — including macOS 27 — the company is focused on improving the software’s quality and underlying performance.
-via Bloomberg today
Fixit weeks is a band aid, and we also tried it. The real fix is being a good boss and trusting your coworkers to do their jobs.
Places where you can move fast and actually do things are actually far better places to work for. I mean the ones were you can show up, do 5 hours of really good work, and then slack off/leave a little early.
This kind of thing takes more than 2 days to fix, unless you're really good.
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=217637
Or this one
https://security.stackexchange.com/questions/104845/dhe-rsa-...
I can find more of these that I've run into if I look. I've had tricky bugs in my team's code too, but those don't result in public artifacts, and I'm responsible for all the code that runs on my server, regardless of who wrote it... And I also can't crash client code, regardless of who wrote it, even if my code just follows the RFC.
Or just an hour or two. I can't find it anymore, but I've run into libraries where simple things with months didn't work, because like May only has three letters or July and June both start with Ju. That can turn into a big deal, but often it's easy, once someone notices it.
Doing what you want to do instead of what you should doing (hint: you should be busy making money).
Inability to triage and live with imperfections.
Not prioritizing business and democratizing decision making.