Hacker News

We deleted the production database by accident(keepthescore.co)

432 pointscaspii posted 2 days ago442 Comments
442 Comments:
skrebbel said 2 days ago:

I'm appalled at the way some people here receive an honest postmortem of a human fuck-up. The top 3 comments, as I write this, can be summarized as "no, it's your fault and you're stupid for making the fault".

This is not good! We don't want to scare people into writing less of these. We want to encourage people to write more of them. An MBA style "due to a human error, we lost a day of your data, we're tremendously sorry, we're doing everything in our power yadayada" isn't going to help anybody.

Yes, there's all kinds of things they could have done to prevent this from happening. Yes, some of the things they did (not) do were clearly mistakes that a seasoned DBA or sysadmin would not make. Possibly they aren't seasoned DBAs or sysadmins. Or they are but they still made a mistake.

This stuff happens. It sucks, but it still does. Get over yourselves and wish these people some luck.

t0mas88 said 2 days ago:

The software sector needs a bit of aviation safety culture: 50 years ago the conclusion "pilot error" as the main cause was virtually banned from accident investigation. The new mindset is that any system or procedure where a single human error can cause an incident is a broken system. So the blame isn't on the human pressing the button, the problem is the button or procedure design being unsuitable. The result was a huge improvement in safety across the whole industry.

In software there is still a certain arrogance of quickly calling the user (or other software professional) stupid, thinking it can't happen to you. But in reality given enough time, everyone makes at least one stupid mistake, it's how humans work.

janoc said 2 days ago:

It is not only that but also realizing that there is never a single cause to an accident or incident.

Even when it was a suicidal pilot flying the plane into a mountain on purpose. Someone had to supervise him (there are two crew members in the cockpit for a reason), someone gave him a medical, there is automation in the cockpit that could have at least caused an alarm, etc.

So even when the accident is ultimately caused by a pilot's actions, there is always a chain of events where if any of the segments were broken the accident wouldn't have happened.

While we can't prevent a bonkers pilot from crashing a plane, we could perhaps prevent a bonkers crew member from flying the plane in the first place.

Aka the Swiss cheese model. You don't want to let the holes to align.

This approach is widely used in accident investigations and not only in aviation. Most industrial accidents are investigated like this, trying to understand the entire chain of events in order that processes could be improved and the problem prevented in the future.

Oh and there is one more key part in aviation that isn't elsewhere. The goal of an accident or incident investigation IS NOT TO APPORTION BLAME. It is to learn from it. That's why pilots in airlines with a healthy safety culture are encouraged to report problems, unsafe practices, etc. and this is used to fix the process instead of firing people. Once you start to play the blame game, people won't report problems - and you are flying blind into a disaster sooner or later.

jonplackett said a day ago:

It’s interesting that this is the exact opposite of how we think about crime and punishment. All criminals are like the pilot, just the person who did the action. But the reasons for them becoming criminals is a seldom taken into account. The emphasis is on blaming and punishing them rather than figuring out the cause and stopping it happening again.

fireant said 4 hours ago:

The difference is the intent. The criminal wants to do bad things while the pilot does not want anyone to get harmed.

Talinx said 13 hours ago:

To stop the cause from happening is not always feasible. It might also be against human rights.

globular-toast said 2 days ago:

There is sometimes a single cause, but as the parent comment pointed out, that should never be the case and is a flaw in the system. We are gradually working towards single errors being correctable, but we're not there yet.

On the railways in Britain the failures were extensively documented. Years ago it was possible for a single failure to cause a loss. But over the years the systems have been patched and if you look at more recent incidents it is always a multitude of factors aligning that cause the loss. Sometimes it's amazing how precisely these individual elements have to align, but it's just probability.

As demonstrated by the article here, we are still in the stage where single failures can cause a loss. But it's a bit different because there is no single universal body regulating every computer system.

janoc said a day ago:

There is almost never a single cause. If a single cause can trigger a disaster, then there is another cause by definition - poor system design.

E.g. in the article's case it is clear that there is some sort of procedural deficiency there that allows the configuration variables to be set wrong and thus cause a connection to the wrong database.

Another one is that the function that has directly caused the data loss DOES NOT CHECK for this.

Yet another WTF is that if that code is meant to ever run on a development system, why is it in a production codebase in the first place?

And the worst bit? They throw arms up in the air, unable to identify the reason why this has happened. So they are leaving the possibility open to another similar mistake happening in the future, even though they have removed the offending code.

Oh and the fact that they don't have backups except for those of the hosting provider (which really shouldn't be relied on except as the last hail Mary solution!) is telling.

That's not a robust system design, especially if they are hosting customers' data.

JackFr said a day ago:

This should be a teachable moment with respect to their culture. Throwing up their hands without an understanding of what happened is unacceptable — if something that is believed impossible happens, it is important to know where your mental model failed. Otherwise you may make things worse by ‘remediating’ the wrong thing.

And while this sounds overly simplistic the simplest way this could have been avoided is enforcing production hygiene. No developers on production boxes. Ever.

said a day ago:
[deleted]
quietbritishjim said a day ago:

> Even when it was a suicidal pilot flying the plane into a mountain on purpose. Someone had to supervise him (there are two crew members in the cockpit for a reason), someone gave him a medical, there is automation in the cockpit that could have at least caused an alarm, etc.

There was indeed a suicidal pilot that flew into a mountain, I'm not sure if you were deliberately referencing that specific time. In that case he was alone in the cabin – this would have happened briefly but he was able to lock the cabin door before anyone re-entered, and the lock cannot be opened by anyone from the other side in order to avoid September 11th-type situations. It only locks for a brief period but it can be reapplied from the pilot side before it expires an indefinite number of times.

I'm not saying that we can put that one down purely to human action, just that (to be pedantic) he wasn't being supervised by anyone, and there were already any number of alarms going off (and the frantic copilots on the other side of the door were well aware of them).

t0mas88 said a day ago:

And as a result of that incident the procedures have changed, now a cabin crew member (or relief pilot in long haul ops) joins the other pilot in the cockpit if one has to go to the bathroom.

A similar procedure already exists for controlled rest in oceanic cruise flight at certain times, using the cabin crew to ensure the remaining pilot was checked to be awake every 20 minutes.

janoc said a day ago:

I was referring specifically to the Germanwings incident.

That pilot shouldn't have been in the cockpit to begin with - his eyesight was failing, he had mental problems (has been medically treated for suicidal tendencies), etc. This was not discovered nor identified, due to deficiencies in the system (doctors didn't have the duty to report this, he withheld the information from his employer, etc.)

The issue with the door was only the last element of the chain.

There were changes as the result of this incident - the cabin crew member has to be in the cockpit whenever one of the pilots steps out, there were changes to how the doors operate, etc.

confidantlake said a day ago:

The change to require a cabin crew member in the cockpit is a good one.

Not really sure what you can about the suicidal tendencies. If you make pilots report medical treatment for suicidal tendencies, they aren't going to seek treatment for suicidal tendencies.

janoc said 12 hours ago:

That should have been reported by the doctor. Lubitz (the pilot) was denied an American license for this before - and somehow it wasn't caught/discovered when he got the Lufthansa/Germanwings job. Or nobody has followed up on it.

On the day of the crash he was not supposed to be on the plane at all - a paper from the doctors was found at his place after the crash declaring him unfit for duty. He kept it from his employer and it wasn't reported by the doctors neither (they didn't have the duty to do so), so the airline had no idea. Making a few of the holes in the cheese align nicely.

Pilots have the obligation to report when they are unfit for duty already, (no matter what the reason, being treated for a psychiatric problem certainly applies, though).

What was/is missing is the obligation of doctors to report such important issue to the employer when the crewman is unfit. It could be argued that it would be an invasion of privacy but there are precedents for this - e.g. failed medicals are routinely being reported to the authorities (not just for pilots - also for car drivers, gun holders, etc. where the corresponding licenses are then suspended), as are discoveries of e.g. child abuse.

vlovich123 said a day ago:

Any examinations of whether or not the job itself has properties that might cause the medical issues?

odyssey7 said a day ago:

My impression of the Swiss cheese model is that it's used to take liability from the software vendor and (optionally) put it back on the software purchaser. Sure, there was a software error, but really, Mr. Customer, if this was so important, then you really should have been paying more attention and noticed the data issues sooner.

janoc said 11 hours ago:

Nonsense.

Software vendor cannot be held responsible for errors committed by the user.

That would be blaming a parachute maker for the death of the guy who jumped out of a plane without a parachute or with one rigged wrong despite the explicit instructions (or industrial best practices) telling him not to do so.

Certainly vendors need to make sure that their product is fit for the purpose and doesn't contain glaring design problems (e.g. the infamous Therac-25 scandal) but that alone is not enough to prevent a disaster.

For example, in the cited article there was no "software error". The data haven't been lost because of a bug in some 3rd party code.

Data security and safety is always a process, there is no magic bullet you can buy and be done with it, with no effort of your own.

The swiss cheese model shows this - some of the cheese layers are safeguards put in place by the vendor, the others are there for you to put in place (e.g. the various best practices, safe work procedures, backups, etc.) If you don't, well, you are making the holes easier to align because there are now fewer safety layers between you and the disaster. By your own choice.

gonzo41 said a day ago:

You can't outsource risk.

oblio said a day ago:

The user? Start a discussion about using better programming language and you'll see people, even here, blaming the developer.

The common example is C: "C is a sharp tool, but with a sufficiently smart, careful and experienced developer it does what you want (you're holding it wrong").

Developers still do this to each other.

m463 said a day ago:

That reminds me of the time during the rise of the PC when windows would do something wrong, from a confusing interface all the way up to a blue screen of death.

What happened is that users started blaming themselves for what was going wrong, or start thinking they needed a new PC because problems would become more frequent.

From the perspective of a software guy, it was obvious that windows was the culprit but people would assign blame elsewhere and frequently point the finger at themselves.

so yes - an FAA investigation would end up unraveling the nonsense and point to windows.

That said, aviation level of safety is reliable and dependable and few single points of failure and... there are no private kit jets darnit!

There is a continuum from nothing changes & everything works to everything changes & nothing works. You have to choose the appropriate place on the dial for the task. Sounds like this is a one-man band.

neillyons said 2 days ago:

This sounds quite interesting. Any books you could recommend on the "pilot error" topic.

jsmith45 said 7 hours ago:

Not sure about books, but the NTSB generally seems to adopt the philosophy of not trying to assign blame, but instead to figure out what happened, and try to determine what can be changed to prevent this same issue from happening again.

Of course trying to assign blame is human nature, so the reports are not always completely neutral. When I read the actual NTSB report for Sullenburger's "Miracle on the Hudson", I was forced to conclude that while there were some things that the pilots could in theory have done better, given the pilots training and documented procedures, they honestly did better than could reasonably be expected. I am nearly certain that some of the wording in the report was carefully chosen to lead one to this conclusion, despite still pointing out the places where the pilots actions were suboptimal (and thus appearing facially neutral).

The "what can we do to avoid this ever happing again?" attitude applies to real air transit accident reports. Sadly many general aviation accident reports really do just become "pilot error".

miketery said a day ago:

When I was getting my pilots license I used to read accident reports from Canada's Transportation Safety Board [1]. I'm sure the NTSB (America's version) has similar calibre reports [2].

There is also Cockpit Resource Management [3] which addresses the human factor in great detail (how people work with each other, and how prepared are people).

In general what you learn from reading these things is that its rarely one big error or issue - but many small things leading to the failure event.

1 - https://www.tsb.gc.ca/eng/rapports-reports/aviation/index.ht...

2 - https://www.ntsb.gov/investigations/AccidentReports/Pages/Ac...

3 - https://en.wikipedia.org/wiki/Crew_resource_management

masklinn said a day ago:

The old "they write the right stuff" essay on the On-Board Shuttle Group also talked about this mindset of errors getting through the process as being first and foremost a problem with the process to be examined in detail and fixed.

Jugurtha said a day ago:

"The Checklist Manifesto", by Atul Gawande, dives into how they looked at other sectors such as aviation to improve healthcare systems, reduce infections, etc. Interesting book.

neillyons said a day ago:

Just bought the audiobook. About to give it a listen now. Thanks.

permarad said 2 days ago:

The Design of Everyday Things by Donald A. Norman. He covers pilot error a lot in this book in how it falls back on design and usability. Very interesting read.

jnsaff2 said a day ago:

Anything by Sidney Dekker. https://sidneydekker.com/books/ I would start by The Field Guide to Unterstanding 'Human Error'. It's very approachable and gives you a solid understanding of the field.

benkelaar said a day ago:

This is also one of the core tenets of SRE. The chapter on blameless postmortems is quite nice: https://landing.google.com/sre/sre-book/chapters/postmortem-...

janoc said 2 days ago:

Not sure about books but look up the Swiss cheese model. It is widely used approach and not only in aviation. Most industrial accidents and incidents are investigated with this in mind.

t0mas88 said a day ago:

It's part of the Human Performance subject in getting an ATPL (airline license), it was one of the subjects that I didn't hate as much when studying. You can probably just buy the book on Amazon, they're quite accessible.

AdrianB1 said a day ago:

As a GA pilot I know people that had accidents with planes and I know that in most cases what is the the official report and what really happened are not the same, so any book would have to rely on inaccurate or unreal data. For airliners it is easy because there are flight recorders, for GA it is still a bit of Wild West.

ENOTTY said a day ago:

The idea that multiple failures must occur for catastrophic failure is found in certain parts of the computing community. https://en.wikipedia.org/wiki/Defense_in_depth_(computing)

watwut said 2 days ago:

Yeah, but "pilot was drinking alcohol" would be considerate issue, would lead to fired pilot and would lead to more alcohol testing.

I understand what you are taking about, but aviation has also strong expectations on pilots.

janoc said 2 days ago:

Of course it would. But then there should be a process that identifies such pilot before they even get on the plane, there are two crew in the cockpit, so if one crewman does something unsafe or inappropriate, the other person is there to notice it, call it out and, in the extreme case, to take control of the plane.

Also, if the guy or gal has alcohol problems, it would likely be visible on their flying performance over time, it should be noticed during the periodic medicals, etc.

So while a drunk pilot could be the immediate cause of a crash, it is not the only one. If any of those other things I have mentioned functioned as designed (or were in place to start with - not all flying is airline flying!), the accident wouldn't have happened.

If you focus only on the "drunk pilot, case closed", you will never identify deficiencies you may have elsewhere and which have contributed to the problem.

watwut said 2 days ago:

Note how none of those is in article. Article is like "Could we blame alcohol? Oh no, surely not.".

t0mas88 said a day ago:

Believe it or not, even "pilot is an alcoholic" is still part of the no blame culture in aviation. As long as the pilot reports himself he'll not be fired for that. Look up the HIMS program to read more details.

said a day ago:
[deleted]
user5994461 said a day ago:

You can google and find cases of US pilots getting fired and sentenced to a year in prison for flying intoxicated.

Maybe they don't get fired if they report themselves unable to fly beforehand but I wouldn't quite call that a no blame culture.

suzakus said 2 days ago:

It's a piece of software for scoreboards. Not the Therac-25, nor an airplane.

qz2 said 2 days ago:

This time.

Some days it’s just an on line community that gets burned to the ground.

Other days it’s just a service tied into hundred of small businesses that gets burned to the ground.

Other says it’s massive financial platform getting burned to the ground.

I’m responsible for the latter but the former two have had a much larger impact for many people when they occur. Trivialising the lax administrative discipline because a product isn’t deemed important is a slippery slope.

We need to start building competence in to what we do regardless of what it is rather than run on apologies because it’s cheaper.

wizzwizz4 said a day ago:

Prismo is an example of the first: https://fediverse.blog/~/Prismo/on-prismo-data-loss

The project never recovered.

ozim said 2 days ago:

Parent is not advocating about going as strict with procedures as operating an airplane. Post is saying about "a bit of aviation safety culture" then it highlights a specific part that would be useful.

Safety culture element highlighted is: not blaming a single person but finding out how to prevent accident that happened from happening again. Which is reasonable because you don't want to impose some strict rules that are expensive up front. This way you just introduce measure to prevent same thing in the future, in the context of your project.

suzakus said a day ago:

Ah, I misread. That's what I get for commenting late at night :(

Thanks for clarifying!

t0mas88 said 2 days ago:

It isn't about the importance of this one database, it's about the cultural issue in most of the sector that the parent comment was pointing out: we far too often blame the user/operator calling them stupid, while every human makes mistake, it's inevitable.

dmitriid said 2 days ago:

The Russian joke about investigating is "Punish the innocent, award the uninvolved"

ganafagol said a day ago:

It's good to have a post mortem. But this was not actually a post mortem. They still don't know how it could happen. Essentially, how can they write "We’re too tired to figure it out right now." and right after attempt to answer "What have we learned? Why won’t this happen again?" Well obviously you have not learned the key lesson yet since you don't know what it is! And how can you even dream of claiming to guarantee that it won't happen again before you know the root cause?

Get some sleep, do a thorough investigation, and the results of that are the post mortem that we would like published and where you learn from.

Publishing some premature thoughts without actual insight is not helping anybody. It will just invite the hate that you are seeing in this thread.

ordu said a day ago:

> I'm appalled at the way some people here receive an honest postmortem of a human fuck-up. The top 3 comments, as I write this, can be summarized as "no, it's your fault and you're stupid for making the fault".

It seems that people annoyed mostly by "complexity gremlins". They are so annoyed, that they miss previous sentence "we’re too tired to figure it out right now." Guys fucked up their system, they restored it the best they could, they tried to figure out what happened, but failed. So they decided to do PR right now, to explain what they know, and to continue the investigation later.

But people see just "complexity gremlins". The lesson learned is do not try any humor in a postmortem. Be as serious, grave, and dull as you can.

rawgabbit said a day ago:

For me, this is an example of DevOps being carried too far.

What is to stop developers for checking into Github "drop database; drop table; alter index; create table; create database; alter permission;"? They are automating environment builds and so that is more efficient right? In my career, I have seen a Fortune 100 company's core system down and out for a week because of hubris like this. In large companies, data flows downstream from a core system. When you have to restore from backup, that cascades into restores in all the child systems.

Similarly, I once had to convince a Microsoft Evangelist who was hired into my company, not to redeploy our production database, every-time we had a production deployment. He was a pure developer and did not see any problems of dropping the database, recreating the database, and re-inserting all the data. I argued that a) this would take 10+ hours b) the production database has data going back many years and that the schema/keys/rules/triggers have evolved during that time-- meaning that many of the inserts would fail because they didn't meet the current schema. He was unconvinced but luckily my bosses overruled him.

My bosses were business types and understood accounting. In accounting, once you "post" a transaction to the ledger that becomes permanent. If you need to correct that transaction, then you create a new one that "credits" or corrects the entry. You don't take out the eraser.

bromuro said a day ago:

I think you should wait 10+ hours to read different kind of comments on HN.

For example, if i open the comments about a “14 hours ago” post, I usually see a top comment about other comments (like yours).

I then feel so out of the loop because i don’t see the “commenters” your are referring too - so the thread that follows seem off topic to me.

caspii said 2 days ago:

Thanks

qz2 said 2 days ago:

I disagree.

Culturally speaking we like to pat people on their back when they do something stupid and comfort them. But most of the time this isn’t productive because it doesn’t instil the requisite fear required when working out what decision to make.

What happens is we have growing complacency and disassociation from consequences.

Do you press the button on something potentially destructive because your are confident it is ok through analysis, good design and testing or confidence it is ok through trite complacency?

The industry is mostly the latter and it has to stop. And the first thing is calling bad processes, bad software and stupidity out for what it is.

Honestly these guys did good but most will try and hide this sort of fuck up or explain it away with weasel words.

jurre said 2 days ago:

You should have zero fear instilled when pressing any button. The system or process has failed if a single person pressing a button can bring something down unintended. Fix the system/process, don’t “instill fear” onto the person, it’s toxic, plus now you have to make sure any new person on boarded has “fear instilled”, and that’s just hugely unproductive

qz2 said 2 days ago:

That’s precisely my point. A lot of people have no fear because they’re complacent or ignorant rather than the button is well engineered.

But to get there you need to fear the bad outcomes.

ddelt said 2 days ago:

I’m sorry, but this really hasn’t been my experience at all in web technology or managing on-prem systems either.

I used to be extremely fearful of making mistakes, and used to work in a zero-tolerance fear culture. My experience and the experience of my teammates on the DevOps team? We did everything manually and slowly because we couldn’t see past our own fear to think creatively on how to automate away errors. And yes, we still made errors doing everything carefully, with a second set of eyes, late night deployments, backups taken, etc.

Once our leadership at the time changed to espouse a culture of learning from a mistake and not being afraid to make one as long as you can recover and improve something, we reduced a lot of risk and actually automated away a lot of errors we typically made which were caused initially by fear culture.

Just my two cents.

qz2 said 2 days ago:

I’m not talking about fear culture. It’s a more personal thing. Call it risk management culture if that helps which is the inverse.

Manual is wrong for a start. That would increase the probability of making a mistake and thus increase risk for example. The mitigations are automation, review and testing.

I agree with you. Perhaps fear was the wrong term. I treat it as my personal guide to how uneasy i feel about something on the quality front.

inglor_cz said a day ago:

I recalled Akimov pressing the AZ-5 button in Chernobyl...

ClumsyPilot said a day ago:

"fear required when working out what decision to make"

People like you keep making the same mistake, creating companies/organisations/industries/societies that run on fear of failure. We've tried it a thousand times, and it never works.

You can't solve crime by making all punishments hearsh death, we've tried that in 1700 in Britain and crimerate was sky high.

This culture gave us disasters in USSR and famine in China.

The only thing that can solve this problem is structural change.

qz2 said a day ago:

I think my point is probably being misunderstood and that is my fault for explaining it poorly. See I fucked up :)

The fear I speak of is a personal barrier which is lacking in a lot of people. They can sleep quite happily at night knowing they did a shitty job and it's going to explode down the line. It's not their problem. They don't care.

I can't do that. Even if there are no direct consequences for me.

This is not because I am threatened but because I have some personal and professionals standards.

mrmonkeyman said a day ago:

HN likes to downplay this, apparantly, but not everything can be boiled down to bureaucracy.

Yes, medical professionals use checklists. They also have a harsh and very unforgiving culture that fosters craftsmanship and values professionalism above all else. You see this in other high-stakes professions too.

You cannot just take the checklist and ignore the relentless focus on quality, the feelings of personal failure and countless hours honing and caring for the craft.

Developers are notorious for being lazy AF, so it's not hard to explain our obsession with "just fix the system". It's a required but not sufficient condition.

ClumsyPilot said a day ago:

'The system' includes the attitudes of developers and people that pay them.

Everyone takes job of a medical proffesional seriously, from the education to the hospitals that enmloys them to the lawmakers to the patients.

When you pick a surgeon, you avoid the ones that killed people. Do you avoid developers that introduce bugs? We don't even keep track of that!

You can have the license taken away as a surgeon, I've never heard of anyone becoming unemployable as a developer.

You are not gonna get an equivalent outcome even if tomorrow all developers show up to work with an attitude of a heart surgeon.

However if suddenly all data loss and data breaches would result in massive compensation, and if slow and buggy software resulted in real lawsuits, you would see the results very quickly.

Basically same issues as in trading securities: no accountability for either developers or the decision makers.

mr_toad said 19 hours ago:

> medical professionals

operate in an environment where they don’t fully understand the systems they’re working with (human biology still has many unknowns), and many mistakes are irreversible.

If you look at the worst performing IT departments, they suffer from the same problems: they don’t fully understand how their systems work, and they lack easy ways to reverse changes.

johnisgood said a day ago:

> The only thing that can solve this problem is structural change.

Well, care to elaborate on this? What do we have to change, and to what end?

MaxBarraclough said a day ago:

You're speaking to the mistake. The comment you're replying to is speaking to the write-up analysing the mistake.

Blog posts analysing real-world mistakes should not be met with beratement.

corobo said a day ago:

Most will hide it away because being truthful will hurt current business or future career prospects because people like yourself exist who want everyone shitting themselves at the prospect of being honest.

In a blame free environment you find the underlying issue and fix it. In a blame full environment you cover up the mistake to avoid being fired and some other person does it again later down the line

qz2 said a day ago:

No.

There’s a third option where people accept responsibility and are rewarded for that rather than hide from it one way or another.

I have a modicum of respect for people who do that. I don’t for people who shovel it under a rug or point a finger which are the points you are referring to. I’ve been in both environments and neither end up with a positive outcome.

If I fuck up I’m the first person to put my hand up. Call it responsibility culture.

rawoke083600 said a day ago:

I think you missing point, I love the idea about "bringing aviation methodology" to lower error/f-up-rates" for the software industry.

No one is not saying don't take responsibility, they are saying - as I understood it:

Have a "systematic-approach" to the problem, the current system for preventing "drunk pilots or the wiping of production db's are not sufficient" - improve the system ! ! All the "taking responsibilities and "falling on one's own sword" won't improve the process for the future.

If we take the example of the Space-Industry where having 3x Backups Systems are common (like life support)

It seems some people's view in the comments stream is:

"No bugger that, the life-support systems engineers and co should just 'take responsibility' and produce flawless products. No need for this 3 x backups systems"

The "system" approach is that there is x-rates of failures by having 2 backups we have now reduce the possibility of error by y amount.

Or in the case of production-dbs:

If I were the CEO and the following happens:

CTO-1: "Worker A, has deleted the production DB, I've scolded him, he is sorry and got dock a months pay and is feeling quite bad and has taken full responsibility for his stupid action this probably won't happen again !"

VS

CTO-2: "Worker A, has deleted the production DB, We Identified that our process/system for allowing dev-machines to access production db's was a terrible idea and oversight, we now have measures abc in place to prevent that in the future"

I'd go with CTO-2 EVERY day of the week !

qz2 said a day ago:

Yes. CTO-2 is my approach. As the CTO I fucked up because I didn't have that process in place to start with. To buck stops at me.

CTO-2 also has the responsibility of making sure that everyone is educated on this issue and can communicate those issues and worries (fears) to his/her level effectively because prevention is better than cure. Which is my other point.

corobo said a day ago:

See that leading with a "No." there

That's what we're talking about. I hope you don't have direct reports.

Next time be honest "Just shut the conversation down, everyone's a dumbass, I'm right, you're dumb" it'll be quicker than all this back and forth actually trying to get to a stable middle ground :)

qz2 said a day ago:

That's a tad ironic is it not?

All I am calling for is people to take responsibility.

rawoke083600 said a day ago:

What is the future value in that ? From a system and reliability point of view ? Genuine question - not trying to be a dk

qz2 said a day ago:

The point is that if you take responsibility then you're taking pride in your work, are invested in it and willing to invest in self-improvement and introspection rather than doing the minimum to get to a minimum viable solution. The outcome of this is an increase in quality and a decrease in risk.

rawoke083600 said a day ago:

Wow - "...invest in self-improvement and introspection.."

I would hate for that to be our system reliability improvement methodology.

Ok fine now I'm being slightly a "dk" but really ?

qz2 said a day ago:

Well you can also apply a QMS if you want but all that does is generate paperwork full of accepted risks...

said a day ago:
[deleted]
michelpp said 2 days ago:

> Computers are just too complex and there are days when the complexity gremlins win.

I'm sorry for your data loss, but this is a false and dangerous conclusion to make. You can avoid this problem. There are good suggestions in this thread, but I suggest you use Postgres's permission system to REVOKE DROP action on production except for a very special user that can only be logged in by a human, never a script.

And NEVER run your scripts or application servers as a superuser. This is a dangerous antipattern embraced by many and ORM and library. Grant CREATE and DROP to non-super users.

sushshshsh said 2 days ago:

As a mid level developer contributing to various large corporate stacks, I would say the systems are too complex and it's too easy to break things in non obvious ways.

Gone are the days of me just being able to run a simple script that accesses data read only an exports the result elsewhere as an output.

Nextgrid said 2 days ago:

This is why I am against the current trend of over-complicating stacks for political or marketing reasons. Every startup nowadays wants microservices and/or serverless and a mashup of dozens of different SaaS (some that can't easily be simulated locally) from day 1 while a "boring" monolithic app will get them running just fine.

tamrix said 2 days ago:

I think we're hitting peak tech. All this "technical" knowledge just dates itself in a year's time anyway.

Eventually, you come to realise that the more tech you've got, the more problems you have. .

Now developers spend more time googling errors and plugging in libraries and webservices together than writing any actual code.

Sometimes I wish for a techless cloudless revolution when we just go back to the foundations of computers and is use plain text wherever possible.

a_imho said a day ago:

My point today is that, if we wish to count lines of code, we should not regard them as “lines produced” but as “lines spent”: the current conventional wisdom is so foolish as to book that count on the wrong side of the ledger.

I'm yet to encounter a point in my career where KISS fails me. OTOH this is nothing new, I don't have my hopes up that the current trends of overcomplicating things are going to change in the near future.

adwn said a day ago:

> Sometimes I wish for a techless cloudless revolution when we just go back to the foundations of computers and is use plain text wherever possible.

... because software in the 60s/70s/80s was so reliable and bug-free?!

Nextgrid said a day ago:

It most likely had less moving parts & failure modes than a modern microservice mess.

mrmonkeyman said a day ago:

It actually was. Shocking, isn't it?

csomar said a day ago:

For the most part, we are not complicating stuff. Today's requirements are complicated. We used to operate from the commandline on a single processor. Now things are complicated: People expect a Web UI, High availability, integration with their phone, Email notification, 2FA authentication, and then you have things like SSL/HTTPS, Compliance, and you need to log the whole thing for errors or compliance or whatever.

Sometimes it's simpler to go back to a commandline utility, sometimes it's not.

Nextgrid said a day ago:

All of these can be done just fine in a monolithic Django/Ruby/PHP/Java/etc app.

sushshshsh said 2 days ago:

Yup, 100% agree. It may be that you will eventually need an auto-scalable message queue and api gateway, but for most people a web server and csv will serve the first thousand customers

emodendroket said 2 days ago:

There is sense in not building more services than you need. But many folks end up finding it hard to break away from their monolith and then it becomes an albatross. Not sure how to account for that.

stepbeek said 2 days ago:

If a team doesn’t have the engineering know how to split a monolith into distinct services in a straightforward fashion, then I’m not sure that team will have the chops to start with microservices.

nucleardog said a day ago:

Dealing with existing code and moving it forward in significant ways without taking down production is always much more challenging than writing new code, whatever form those new ways take.

You can get by with one strong lead defining services and interfaces for a bunch of code monkeys that write what goes behind them.

Given an existing monolithic codebase, you can’t specify what high level services should exist and expect juniors to not only make the correct decisions on where the code should land up but also develop a migration plan such that you can move forward with existing functionality and customer data rather than starting from zero.

Aeolun said 2 days ago:

What you end up with is a set of tiny monoliths.

sushshshsh said 2 days ago:

parallel rewrites before the current prod system ever hits performance problems :)

arthurcolle said 2 days ago:

Sounds expensive

dwohnitmok said 2 days ago:

But potentially within the budget of a business with a large base of proven customers.

While annoying technically, for early stage startups, performance problems caused by an overly large number of users are almost always a good problem to have and are a far rarer sight than startups that have over-architected their technical solution without the concomitant proven actual users.

vp8989 said a day ago:

"for most people a web server and csv will serve the first thousand customers"

I hear this kind of thought-terminating cliche a lot on here and it makes absolutely no sense.

If # of users is a rough approximate of a company's success and more successful companies tend to hire more engineers ... then actually the majority of engineers would not have the luxury of not needing to think about scalability.

With engineering salaries being what they are, why would you think that "most people" are employed working on systems that only have 1000 users?

sushshshsh said a day ago:

"what is an MVP even", the post!

dnautics said 2 days ago:

Please don't use csv. At the very least use SQLite. But hosted sqls are probably the smart thing to do.

theamk said 2 days ago:

Do use CSV (and other similar formats) for read-only data which fits entirely in the memory.

It is great for data safety -- chown/chmod the file, and you can be sure your scripts won't touch this. And if you are accessing live instance, you can be pretty sure that you won't accidentally break it by obtaining a database lock.

Now "csv" in particular is kinda bad because it never got standardized, so if you you have complex data (punctuation, newlines, etc..), it you might not be able to get the same data back using a different software.

So consider some other storage formats -- there are tons. Like TSV (tab-separated-values) if you want it simple; json if you want great tooling support; jsonlines if you want to use json tools with old-school Unix tools as well; protobufs if you like schemas and speed; numpy's npy if you have millions of fixed-width records; and so on...

There is no need to bother with SQL if the app will immediately load every row into memory and work with native objects.

oblio said a day ago:

> complex data (punctuation, newlines, etc..)

Oh, the irony! Text with punctuation and newlines is complex data.

CSV is doomed but the world runs on it and pays its cost with engineer tears.

daniellarusso said 2 days ago:

I agree with the JSON suggestion, but what advantage is there to TSV versus CSV?

I have experienced pain with both characters (tab and comma), particularly when I am not the one creating the output file.

junon said 2 days ago:

Tabs do not appear in common literature. They're easier to justify inputs not have them in order to avoid having a quotation or escaping mess.

Commas are _way_ too common.

CSV is an awful format anyway.

theamk said a day ago:

If you can make sure the data has no newlines or tabs, the TSV needs no quoting. It is just "split" function, which is present in every language and very easy to use. When I use it, I usually add a check to writer that there is no newlines or tabs in data, and assert if this is not the case.

You use this tsv with Unix tools like "cut", "paste", and ad-hoc scripts.

There is also "tsv" as defined by excel, which has quoting and stuff. It is basically a dialect of CSV (Python even uses the same module to read it), and has all the disadvantages of CSV. Avoid it.

ethbr0 said 2 days ago:

> Please don't use csv

Could you elaborate? I'm interested in the specific reasons.

roenxi said 2 days ago:

If you don't know how to use a database for whatever reason and are doing something as a test-of-concept then CSV is fine. But for anything serious - databases, particularly older featureful ones like postgres, have a lot of efficiency tricks; clever things like indexes and nobody has ever come up with anything that is decisively better organised than the relational model of data.

If you use a relational database, the worst-case outcome is you hit tremendous scale and have to do something special later on. The likely case scenario is some wins from the many performance lessons databases have learned. Best case outcome is avoiding a very costly excursion relearning lessons the database community has known about since 1970 (like reinventing transactions).

Managing data with a csv (without a great reason) is like programming a complex GUI in assembly without a reason - it isn't going to look like a good decision in hindsight. Most data is not special, and databases are ready for it.

duskwuff said 2 days ago:

High-quality software which interacts with real transactional SQL databases is readily available, free, and easy to use. Building an inferior solution yourself, like a CSV-based database, doesn't make your product any better. (If anything, it will probably make it worse.)

vorticalbox said 2 days ago:

It was never suggested to use a csv database, it was suggested that for small amounts of read only data a csv or other format file is a better option and I agree.

apta said 2 days ago:

Reliability, atomic updates, roll backs, proper backups, history management, proper schema, etc.

judge2020 said 2 days ago:
hibbelig said 2 days ago:

The way I understand sushsh‘s suggestion is that CSV is an alternative to an API gateway and message queue. I don’t think the suggestion was to replace a database with CSV.

ivan_gammel said 2 days ago:

Microservices are easy to build and throw away these days. In startups time to market is more important than future investment in devops. In terms of speed of delivery they are not worse than monolithic architecture (also not better). For similar reasons SaaS is and must be the default way to build IT infrastructure, because it has much lower costs and time to deploy, compared to custom software development.

danmaz74 said 2 days ago:

If you're talking about a web application or API back end with a smallish startup team, time to market is definitely going to be much longer for a microservices architecture compared to developing a "monolith" in a batteries included framework like eg rails, using a single database.

jand said 2 days ago:

Just to be fair: You are combining microservice characteristics and "using a single database" in your argument.

Please also consider that especially for smallish teams, microservices are not required to be the same as big corp microservices.

I have encountered a trend towards calling surprisingly many things non-monolithic a microservice. So what kind of microservice are you all referring to in your minds?

edit: orthography

ivan_gammel said 2 days ago:

If you think it’s much longer, then you haven’t done it with modern frameworks. I’m CTO of a German startup, which went from a two-founder team to over 250 employees in 70 locations in 3 years. For us microservice architecture was crucial to deliver plenty of things in time. Low coupling, micro-teams of 2 ppl max working on several projects at once... we did not have luxury of coordinating monolith releases. Adding one more microservice to our delivery pipeline now takes no more than one hour end-to-end. Building infrastructure MVP with CI/CD on test and production environments in AWS took around a week. We use Java/Spring Cloud stack, one Postgres schema per service.

AdrianB1 said a day ago:

It is probably not what you intended, but this is how it sounds like: we have a hundred micro-teams of 2 working in silo on low-coupling microservices and we don't have the luxury of coordinating an end to end design.

Edit: 2 questions were asked, too deep to reply. 1. You said 250 people, nothing about IT. Based on the info provided, this was the image reflected. 2. "the luxury of coordinating a monolith". Done well, it is not much more complicated that coordinating the design of microservices, some can argue it is the same effort.

ivan_gammel said a day ago:

That’s an interesting interpretation, but... 1. our whole IT team is 15 people, covering every aspect of automation of a business with significant complexity of the domain and big physical component (medical practices). 2. can you elaborate more on end to end design? I struggle to find the way of thinking which could lead to that conclusion.

greggman3 said 2 days ago:

What? I thought that was just the opposite. The advantage of serverless is that I pay AWS to make backups so I don't have to. I mean under time pressure if it do it myself I'll skip making backups, setting permissions perfectly, and making sure I can actually restore those backups. If I go with a microservice, the whole point is they already do those things for me. No?

jamil7 said 2 days ago:

What does serverless have to do with making backups? Any managed database can do this for you. Microservices attempt to facilitate multiple teams working on a system independently. They’re mostly solving human problems not technical.

The grandparent comments point is a single person or team can deploy a monolith on herkou and avoid a huge amount of complexity. Especially in the beginning.

ozorOzora said 2 days ago:

I'm pretty sure the advantage of serverless is that you can use microservices for your MVPs. The gains for a one-man team might not be ovious, but I like to believe that once a project needs scaling it is not as painful as tearing down the monolith.

said 2 days ago:
[deleted]
moksly said 2 days ago:

Why are those days gone? I do it all the time in an organisation with 10,000 employees. I obviously agree with the parent poster in that you should only do such things with users that have only the right amount of access, but that’s what having many users and schemas are for. I don’t, however, see why you’d ever need to complicate your stack beyond a simple python/powershell script, a correct SQL setup, an official sql driver and maybe a few stores procedures.

I build and maintain our entire employee database with a python script, from a weird non-standard XML”like” daily dump from our payment system, and a few web-services that hold employee data in other requires systems. Our IT then builds/maintains our AD from a few powershell scripts, and finally we have a range of “micro services” that are really just independent scripts that send user data changes to the 500 systems that depend on our central record.

Sure, sure, we’re moving it to azure services for better monitoring, but basically it’s a few hundred lines of scripting that, combined with AD and ADDS, does more than a 1 million USD a year license IDM.

theamk said 2 days ago:

Why are these days gone?

Just a few weeks ago, I set up a read-only user for myself, and moved all modify permission to role one must explicitly assume. Really helped me with peace of mind while developing the simple scripts that access data read only. This was on our managed AWS RDS database,

akerro said 2 days ago:

I'm on similar position as you, but I say systems are as complex as their designed made them and it's on you to change it.

auroranil said 2 days ago:

Tom Scott made a mistake with a similar outcome as this article, but with an SQL query that is much more subtle than DROP.

https://www.youtube.com/watch?v=X6NJkWbM1xk

By all means, find ways to fool-proof the architecture. But be prepared for scenarios where some destructive action happens to a production database.

heavenlyblue said a day ago:

He would not have done that if he were simply using a database transaction for this operation.

madbkarim said a day ago:

That’s exactly the point he’s trying to get across with that video.

thih9 said 2 days ago:

> You can avoid this problem.

The article isn’t claiming that the problem is impossible to solve.

On the contrary: “However, we will figure out what went wrong and ensure that this particular error doesn’t happen again.”.

DelightOne said a day ago:

If you use terraform to deploy the managed production database, do you use the postgresql terraform provider to create roles or are you creating them manually?

bsder said 2 days ago:

> You can avoid this problem.

No, you can't. No matter how good you are, you can always "rm -rf" your world.

Yes, we can make it harder, but, at the end of the day, some human, somewhere, has to pull the switch on the stuff that pushes to prod.

You can clobber prod manually, or you accidentally write an erroneous script that clobbers prod. Either way--prod is toast.

The word of the day is "backups".

ikiris said 2 days ago:

excuse me, but no. this is harmful bullshit.

Yes, backups are vitally important, but no it is not possible to accidentally rm -rf with proper design.

It's possible to have the most dangerous credentials possible and still make it difficult to do catastrophic global changes. Hell it's my job to make sure this is the case.

thunderrabbit said 2 days ago:

> not possible to accidentally rm -rf with proper design.

Can you say more about this?

I understand rm -rf, but not sure how I could design that to be impossible for the most dangerous credentials.

ikiris said a day ago:

You can make the most dangerous credentials involve getting a keycard from a safe, and multi party sign off, not possible to deploy to more than X machines at a time with a sliding window of application, independent systems with proper redundant and failback design, canary analysis, etc etc etc.

I didn't even mean you can only make it difficult, I meant you can make it almost impossible to harm a real production environment in such a nuclear way without herculean effort and quite frankly likely collusion from multiple parties.

auggierose said a day ago:

He said "difficult", not impossible.

heavenlyblue said a day ago:

Just don’t use the most dangerous credentials.

The most dangerous credentials are cosmic rays and we use the Earth’s atmosphere and ECC to fight that.

fuzxi said 2 days ago:

Difficult, but not impossible. Which was the point, I think.

centimeter said 2 days ago:

> but this is a false and dangerous conclusion to make

Until we get our shit together and start formally verifying the semantics of everything, their conclusion is 100% correct, both literally and practically.

oppositelock said 2 days ago:

You have to put a lot of thought into protecting and backing up production databases, and backups are not good enough without regular testing of recovery.

I have been running Postgres in production supporting $millions in business for years. Here's how it's set up. These days I use RDS in AWS, but the same is doable anywhere.

First, the primary server is configured to send write ahead logs (WAL) to a secondary server. What this means is that before a transaction completes on the master, she slave has written it too. This is a hot spare in case something happens to the master.

Secondly, WAL logs will happily contain a DROP DATABASE in them, they're just the transaction log, and don't prevent bad mistakes, so I also send the WAL logs to backup storage via WAL-E. In the tale of horror in the linked article, I'd be able to recover the DB by restoring from the last backup, and applying the WAL delta. If the WAL contains a "drop database", then some manual intervention is required to only play them back up to the statement before that drop.

Third is a question of access control for developers. Absolutely nobody should have write credentials for a prod DB except for the prod services. If a developer needs to work with data to develop something, I have all these wonderful DB backups lying around, so I bring up a new DB from the backups, giving the developer a sandbox to play in, and also testing my recovery procedure, double-win. Now, there are emergencies where this rule is broken, but it's an anomalous situation handled on a case by case basis, and I only let people who know what they're doing touch that live prod DB.

azeirah said a day ago:

Quick tip for anyone learning from this thread.

If you're using MySQL, it's called a binary log and not a Write Ahead Log, it was very difficult to find meaningful Google results for "MySQL WAL"

x87678r said a day ago:

Interesting, I immediately thought they would have a transaction log, I didn't think it would have the delete as well.

Its a real problem that we used to have trained DBAs to own the data where now devs and automatic tools are relied upon, there isn't a culture or toolset built up yet to handle it.

mr_toad said 19 hours ago:

> I have all these wonderful DB backups lying around, so I bring up a new DB from the backups

It’s nice to have that capability, but some databases are just too big to have multiple copies lying around, or to able to create a sandbox for everyone.

aszen said 2 days ago:

I had a narrow escape once doing something fancy with migrations.

We had several MySQL string columns as long text type in our database but they should have been varchar(255) or so. So I was assigned to convert these columns to their appropriate size.

Being the good developer I was, I decided to download a snapshot of the prod database locally and checked the maximum string length we had for each column via a script. Using this script it made a migration query that would alter column types to match their maximum used length keeping the minimum length as varchar (255).

I tested that migration and everything looked good, it passed code review and was run on prod. Soon after we start getting complaints from users that their old email texts have been truncated. I then realize the stupidity of the whole thing, the local dump of production database always wiped out many columns clean for privacy like the email body column. So the script thought it had max length of 0 and decided to convert the column to varchar(255).

I realize the whole thing may look incredibly stupid, that's only because the naming for db columns was in a foreign european language so I didn't know even know the semantics of each column.

Thankfully my seniors managed to restore that column and took the responsibility themselves since they had passed the review.

We still did fix those unusually large columns but this time by simple duplicate alter queries for each of those columns instead of using fancy scripts.

I think a valuable lesson was learned that day to not rely on hacky scripts just to reduce some duplicate code.

I now prefer clarity and explicitness when writing such scripts instead of trying to be too clever and automating everything.

heavenlyblue said 2 days ago:

And you didn’t even bother to do a query of the actual maximum length value of the columns you were mutating? Or at least query and see the text in there?

Basically you just blindly ran the migration on the data and checked if it didn’t fail?

The lesson here is not about cleverness unfortunately.

aszen said a day ago:

I did see some values and found them reasonable, problem of the whole thing was there were atleast 200 or so tables with dozens of columns each and only two or so tables were excluded from being dumped locally.

So yes I could have noticed their length 0 if I had looked carefully amidst hundreds of rows but since my faulty logic of prod db = local db didn't even consider this possible I didn't bother.

If it had been just 10 to 20 migrations queries that would have been a lot easier to validate but then I wouldn't even have attempted to write a script

mr_toad said 18 hours ago:

> my faulty logic of prod db = local db didn't

It happens. “It worked in dev” is the database equivalent of “worked on my machine”.

detaro said 2 days ago:

The comment clearly states that they did.

heavenlyblue said a day ago:

If they did, they would have noticed that the columns were empty (because they were wiped clean for PI data).

The parent is either misrepresenting the situation or they didn’t do what they say they did.

Also in any production setup, before the migration in the same transaction you would have something along the lines of “check if the column size is larger than and then abort”, because you never know when that can be added while working on the database.

aszen said a day ago:

I agree we could have done it a lot differently and safely, I especially like the last point you mentioned that is what would have been the correct way to do it.

But this happened as described, a local only script that generated a list of columns to modify then a migration to execute the alter queries for all of them.

danellis said 2 days ago:

> after a couple of glasses of red wine, we deleted the production database by accident

> It’s tempting to blame the disaster on the couple of glasses of red wine. However, the function that wiped the database was written whilst sober.

It was _written_ then, but you're still admitting to the world that your employees do work on production systems after they've been drinking. Since they were working so late, one might think this was emergency work, but it says "doing some late evening coding". I think this really highlights the need to separate work time from leisure time.

cle said 2 days ago:

No. Your systems and processes should protect you from doing something stupid, because we’ve all done stupid things. Most stupid things are done whilst sober.

In this case there were like 10 relatively easy things that could have prevented this. Your ability to mentally compile and evaluate your code before you hit enter is not a reliable way to protect your production systems.

Coding after drinking is probably not a good idea of course, but “think better” is not the right takeaway from this.

TedDoesntTalk said 2 days ago:

> Coding after drinking is probably not a good idea

I’ve done some of my most productive work this way. Not on production systems fortunately, and not in a long time.

ultrarunner said 2 days ago:

Riding the Ballmer Peak is a precarious operation, but I simply cannot deny its occasional effectiveness.

onlinejk said 2 days ago:

Well played, and it's always pleasant to find XKCD in one's search results. (And an on-topic one at that).

For ref: https://xkcd.com/323/

nicoburns said a day ago:

A highly experienced developer in their 50s who I used to work with said that they used to regularly sit down and code with a pint. Until on one occasion they introduced some Undefined Behaviour into their aplication during one of these sessions and it took them 3 days to track down! Probably less of an issue with modern tooling. Still, it certainly makes me think twice before drinking and coding.

marcosdumay said 2 days ago:

Yes, it can be productive (depends on exactly what you are doing). But I imagine you revise your work while sober before you deploy it.

said 2 days ago:
[deleted]
bigbubba said 2 days ago:

You know it's totally feasible to make a car that won't turn on for drunk people. Should those systems be installed on all cars, in pursuit of creating systems that don't permit stupid actions?

Maybe such a breathalyzer interlock could be installed on your workstation too. After all, your systems and processes should prevent you from stupid things.

KronisLV said a day ago:

Replace a breathalyzer with something that's less intruisive (like a camera with AI that would observe the person, AI with thermal imaging or air quality sensors, or another possibly-fictional-yet-believable piece of technology) and suddenly, in my eyes, the technology in this thought experiment becomes a no brainer.

If there was more of a cultural pressure against drunk driving and actual mechanisms to prevent it that aren't too difficult to maintain and utilize, things like the Daikou services ( https://www.quora.com/How-effective-is-the-Japanese-daikou-s... ) would pop up and take care of the other logistical problems of getting your car home. And the world would be all the better for it, because of less drunk driving and accidents.

q6fqa said a day ago:

Mercedes[1] has drowsiness detector, it observes the driver through sensors. It's freakishly accurate.

[1] https://media.daimler.com/marsMediaSite/en/instance/ko/ATTEN...

Schiendelman said a day ago:

I think a camera is far more intrusive than a breathalyzer.

cle said 2 days ago:

Don’t be absurd. Of course there are costs and tradeoffs to guardrails, and you have to balance the tradeoffs based on your requirements.

This person had to publicly apologize to their customers. One or two low-cost guardrails could have prevented it and would probably have been worth the cost.

bigbubba said 2 days ago:

Is that absurd? Systems should prevent mistakes, unless the part needed to implement that is a hundred dollars or so? That seems like quite the walk-back. Courts order alcoholics to install these things, they're established available technology. What tradeoffs are you balancing here?

TomVDB said 2 days ago:

It's a free website to keep scoreboards. Not a mission critical nuclear missile launcher.

(But if your mission critical system relies on that scoreboard website, that's on you...)

said 2 days ago:
[deleted]
emodendroket said 2 days ago:

Honestly not a bad idea to install the interlocks on all cars.

AdrianB1 said a day ago:

It will be a great idea when the reliability of the system will have a large number of nines, so chances that you are stranded in the middle of nowhere and the car does not want to start because of a fault will be less than being hit by an asteroid. Other than that, people would consider it an unsafe product and refuse to use it and people vote for what finally becomes a law.

I heard the same argument for electronic gun safety measures, except that no government agency even consider using it for their own guns. Why? They are not reliable enough, yet.

WarOnPrivacy said 2 days ago:

Yes. I could finally start a failed interlock story blog.

necovek said 2 days ago:

Or a blog on being unable to drive your kid to an emergency room because you just finished a glass of wine over dinner.

A problem with devices of that type is that they only test for a potential source of inability to drive safely. What we want is to test for an inability to drive safely.

And while one is easy and might give some quick wins, the drawbacks scare me too much.

Agentlien said 2 days ago:

Being unable to take your car when your child needs to go to the ER would be terrible.

Actually getting in the car while under the influence such of stress and alcohol sounds worse.

I know someone who had a glass of wine just before her daughter needed to be brought to the hospital. This was just two days ago. She simply concluded she could not drive. Luckily, she was able to get a taxi.

kungtotte said 2 days ago:

Maybe we should, as a society, invest in taxis equipped with medical facilities and trained personnel so that they can provide first response medical treatment while on the way to the ER.

I'm sure that would save a lot of lives. An ambulatory medical service, if you will.

Agentlien said 2 days ago:

Sometimes ambulances are occupied and taking a taxi goes faster. Especially if it's something which isn't immediately life threatening.

I once dislocated my shoulder while on a large trampoline and was unable to get up from my hands and knees due to the intense pain whenever the trampoline wobbled. The ambulance was redirected to more serious injuries three times. I was stuck in that position waiting for two hours before it arrived.

kungtotte said 2 days ago:

Sure, that's true.

In that scenario it would also be appropriate to wait for a driver to sober up before driving you to the hospital if neither ambulance nor taxi were available (or delayed). One glass of wine would be out of most people's systems after two hours.

Thus poking hole in the "drunk drive someone to the hospital" argument, which is what this was all about in the first place.

Agentlien said a day ago:

I did argue, in my original comment, that drunk driving should not be an option. I certainly stand by that. My original comment also mentioned a taxi, to which you replied about ambulances.

In my previous comment I just meant that sometimes ambulances can take a good while and a taxi might not.

In the unfortunate case of the trampoline there were several sober people with driver's license and cars available and a taxi would have been there immediately.

Unfortunately,they failed to get me out of there, meaning I still had to wait until an ambulance was available. It was beyond painful and exhausting both physically and mentally. But it was still technically not an emergency.

Aeolun said 2 days ago:

That’s a problem with the ambulance service. Not with people being able to drive while drunk.

Agentlien said a day ago:

Yes. I was answering a comment suggesting the use of an ambulance (instead of a regular taxi). Simply pointing out that, in practice, there are times when a taxi can get you there faster.

rwbhn said a day ago:

Note that ambulance ride (depending on insurance) may cost an order or 2 of magnitude more than the taxi. Well worth it in some circumstances - but not always the best option.

Agentlien said a day ago:

This didn't spring to my mind as I'm Swedish and here it's less than a taxi and any medical costs beyond the first USD $130 per year is covered by the free health insurance.

nullsense said 2 days ago:

Uber Meds.

necovek said 17 hours ago:

You are still simply going with one glass of wine will affect everyone equally, and that is proven to be untrue. (Or stress, for that matter)

While your friend made a call judging their own abilities and the level of emergency, that's exactly how it should be: cars should not stop us humans for making that decision.

(Fwiw, if you were just having an alcoholic drink, a breathalizer would show a much higher concentration even though alcohol might not have even kicked in or there wasn't enough for it to kivk in at all)

said a day ago:
[deleted]
thom said 2 days ago:

There was a funny story recently in the UK, where a football team was late for a match because their breathalyser-equipped team bus refused to start. Turns out it wasn’t that the driver had been drinking, rather that the alcohol-based disinfectant they’d used to clean the bus triggered it.

watwut said 2 days ago:

> Most stupid things are done whilst sober.

That is because most companies these days have processes around drinking in workplace, coming in drunk and working drunk.

Most mistakes are done sober only in environments where drinking couple of vine cups and then doing production change is considered unacceptable. In environment where drunk people work, mistakes are made when drunk.

cranekam said 2 days ago:

The whole piece has a slightly annoying flippant tone to it. We were drunk! Computers just.. do this stuff sometimes! Better to sound contrite and boring in such a situation IMO.

Also I agree with other comments: doing some work after a glass or two should be fine because you should have other defences in place. “Not being drunk” shouldn’t be the only protection you have against disaster.

caspii said a day ago:

Yeah, I agree I'm being slightly flippant.

But it's just a side-project and I will continue late night coding with a glass of wine. I find it hugely enjoyable.

I would have a different mind-set if I was writing software for power stations as a professional.

alistairSH said a day ago:

But it's just a side-project and I will continue late night coding with a glass of wine.

Normally, this would be fine. But, it appears the site has paying members. Presumably, it's not "just a side-project" to them. You owe them better than tinkering with prod while tipsy.

hnlmorg said a day ago:

I don't think that's fair. We've all had occasions from time to time when we've had a drink at lunch time or even had to do emergency work in the evening after having a drink.

The bigger issue is the lack of guardrails to protect against accidental damage. This is also a common trait for hobby projects (after all, it's more fun to hack stuff together) but hopefully the maintainer will use this experience as a sobering reminder (pun intended) to put some guardrails in place now.

danjac said 2 days ago:

The Exxon Valdez comes to mind - the company blamed the drunk captain, but this was just part of huge systemic failures and negligence.

Aeolun said 2 days ago:

If your captain feels the need to be drunk, you’ve probably made a few mistakes before it got to that point.

danielh said 2 days ago:

According to the about page, "the employees" consist of one guy working on it in his spare time.

bstar77 said 2 days ago:

I am not a drinker myself (drink 1-3 times a year), but in the past I have coded while slightly buzzed on a few occasions. I could not believe the level of focus I had. I never investigated it further, but I'm pretty sure the effects of alcohol on our coding abilities is not nearly as bad as it affects our motor skills. Imo, fatigue is far worse.

henearkr said 2 days ago:

When I was not yet a teetotaler, each time I was hitting my maths textbooks after a few drinks, I could not believe my level of focus, and everything was clear and obvious. Textbook pages were flying at a speed never seen.

Of course the next day, when re-reading the same pages, I was always discovering that the previous day I had everything wrong, nothing was obvious, and all my reasoning when with alcohol was false because simplistic and oblivious of any mathematical rigor.

ThrowawayR2 said a day ago:

Not a great idea for studying anyway because of https://en.wikipedia.org/wiki/State-dependent_memory

In short, ability to recall memories is at least in part dependent on being in a similar state to the time when memories are formed, e.g. something learned while being intoxicated will be more easily recalled only when intoxicated again.

centimeter said 2 days ago:

Similar effect with psilocybin or LSD - you think you had a really profound and insightful experience, but once you think back on it you realize that (most of the time) you just got the impression that it was profound and insightful.

gassius said 2 days ago:

Its there any difference between having a profound experience and having "just the impression" of it?

Also, nothing is comparable between alcohol and psychodelics

said a day ago:
[deleted]
emodendroket said 2 days ago:

I've found just the opposite. Any booze at all and I basically can't work for hours.

physicles said 2 days ago:

Me too, partly because of the reduced short-term memory (which coding relies on heavily). But more than that, my motivation drops through the floor because I can’t shake the thought that life is too short to fight with computers all day.

3np said 2 days ago:

For me it's really hit and miss - it can either increase focus and motivation, invigorate the mind, while reducing distractions. Other times it has the opposite effect. Same with cannabis. The cannabis absolutely took a decade or so of recreational use until I discovered/developed coding well under the influence.

Though I'm talking one or two drinks here, not firing up vscode after a night out or going through a bottle of rum.

said 2 days ago:
[deleted]
caspii said 2 days ago:

I have no employees. I only have myself to blame

murillians said 2 days ago:

I think it's just a joke

beervirus said 2 days ago:

Sounds like they weren't trying to do work on production.

yelloweyes said 2 days ago:

lol it's a scoreboard app

john_moscow said 2 days ago:

Just my 2 cents. I run a small software business that involves a few moderately-sized databases. The day I moved from a fully managed hosting to a Linux VPS, I have crontabbed a script like this to run several times a day:

    for db in `mysql [...] | grep [...]`
    do
        mysqldump [...] > $db.sql
    done
    
    git commit -a -m "Automatic backup"
    git push [backup server #1]
    git push [backup server #2]
    git push [backup server #3]
    git gc
The remote git repos are configured with denyNonFastForwards and denyDeletes, so regardless of what happens to the server, I have a full history of what happened to the databases, and can reliably go back in time.

I also have a single-entry-point script that turns a blank Linux VM into a production/staging server. If your business is more than a hobby project and you're not doing something similar, you are sitting on a ticking time bomb.

candiddevmike said 2 days ago:

Anyone reading the above: please don't do this. Git is not made for database backups, use a real backup solution like WAL archiving or dump it into restic/borg. Your git repo will balloon at an astronomical rate, and I can't imagine why anyone would diff database backups like this.

john_moscow said 2 days ago:

It really depends on your database size. This works just fine for ~300MB databases. Git gc takes pretty good care of the fluff and once every couple of years I reset the repository to prune the old snapshots. The big plus is that you can reuse your existing git infrastructure, so the marginal setup costs are minimal.

You can always switch to a more specialized solution if the repository size starts bugging you, but don't fall into the trap of premature optimization.

candiddevmike said 2 days ago:

Git GC won't do anything here unless you're deleting commits or resetting the repo constantly. Every commit will keep piling up, and you will never prune anything like you would a traditional backup tool. The day you do decide to start pruning things, expect your computer to burst into flames as it struggles to rewrite the commit history!

Using a real database backup solution isn't a premature optimization, it's basic system administration.

john_moscow said 2 days ago:

I haven't dug into too much detail, but doing git gc did have a noticeable effect on size and subsequent update performance. I assume, some temporary artifacts got consolidated.

Also resetting the repository once every 1-2 years, and keeping the old one for a while is fine for smaller setups.

Depending on your business size and the amount of resources you want to allocate towards "basic system administration", accomplishing the same task with fewer tools could have its advantages.

danmur said 2 days ago:

Agree. Nothing wrong with doing what works for most things, but with backups it's easier to use an appropriate process from the start. It's easy enough and saves you grief later.

said 2 days ago:
[deleted]
Izkata said 2 days ago:

There wouldn't be anything to prune, but git gc also does compression.

fauigerzigerk said 2 days ago:

>It really depends on your database size.

This isn't just about size though. You're storing all customer data on all developer machines. You're just one stolen laptop away from your very own "we take the security of our customers' data very seriously" moment.

robjan said a day ago:

Nobody said the devs have access to those repos

fauigerzigerk said a day ago:

True. It depends on how exactly those repositories are structured and managed. Hopefully it's not quite as bad as I imagined.

I still think that database size is not the only consideration.

jugg1es said 2 days ago:

Not every database is huge. It could be a good solution in certain circumstances.

wolfgang000 said 2 days ago:

I don't believe having a massive repo with backups would be the ideal solution. Couldn't you just upload the backup to an s3 bucket instead?

Ayesh said 2 days ago:

This is what I do too.

The mysqldump command is tweaked to use individual INSERT clauses as opposed to one bulk one, so the diff hunks are smaller.

You can also sed and remove the mysqldump timestamp, so there will be no commits if there are no database changes, saving the git repo space.

mgkimsal said 2 days ago:

Any issues with the privacy aspect of that data that's stored in multiple git repos? PII and such?

john_moscow said 2 days ago:

These are private repos on private machines communicating over SSL on non-standard ports with properly configured firewall. The risk is always there, but it's low.

bufferoverflow said 2 days ago:

You should really compress them instead of dumping them raw into Git. LZ4 or ZStandard are good.

adzm said 2 days ago:

But then you don't have good diffs.

hinkley said 2 days ago:

Git repositories are compressed.

amingilani said 2 days ago:

Happens to all of us. Once I required logs from the server. The log file was a few gigs and still in use. so I carefully duplicated it, grepped just the lines I needed into another file and downloaded the smaller file.

During this operation, the server ran out of memory—presumably because of all the files I'd created—and before I know it I'd managed to crash 3 services and corrupted the database—which was also on this host—on my first day. All while everyone else in the company was asleep :)

Over the next few hours, I brought the site back online by piecing commands together from the `.bash_history` file.

tempestn said 2 days ago:

Seems unwise to have an employee doing anything with production servers on their first day, let alone while everyone else is asleep.

amingilani said 2 days ago:

It does but that was an exceptional role. The company needed emergency patches to a running product while they hired a whole engineering team. As such, I was the only one around doing things, and there wasn't any documentation for me to work off of.

I actually waited until nightfall just incase I bumped the server offline because we had low traffic during those hours.

nullsense said 2 days ago:

What's the story behind this company/job? Was it some sort of total dumpster fire?

amingilani said 8 hours ago:

I wouldn't classify it as that but they had had trouble in the past which lead to a lot of their team leaving, and were now looking to recover from it.

I was only there for a short time though. Hopefully they figured things out.

netheril96 said 2 days ago:

Why does the DB get corrupted? Does ACID mean anything these days?

theamk said 2 days ago:

Not original poster, but up to 2010, default MySQL table type was MyISAM, which does not support transactions.

thdrdt said 2 days ago:

When a server runs out of memory a lot of strange things can happen.

It can even fail while in the middle of a transaction commit.

So transactions won't fix this.

tannhaeuser said 2 days ago:

No. That is exactly what a transactional DB is designed to prevent. The journal gets appended with both the old and the new data and physically written to disk, and only then the primary data representation (data and B-tree blocks) gets updated in memory, then eventually that changed data is written to DB files on disk. If the app or DB crashes during any stage, it will reconstruct primary data based on journalled, comitted changes. DBs shouldn't attempt to allocate memory during the critical phase, and should be able to recover even on failed allocations at any time by just crashing and let regular start-up recovery clean up. Though a problem on Linux might be memory overcomitting.

Edit: and another problem is disk drives/controller caches lying and reporting write completion when not all data has actually reached stable storage

vanviegen said 2 days ago:

Transactions should fix this. That's what the Write Ahead Log and similar techniques are for.

amingilani said 8 hours ago:

It was an older MongoDB in my case. :)

xtracto said 2 days ago:

This happened to me (someone in my team) a while ago but with mongo. The production database was ssh-tunneled to the default port of the guys computer and he ran tests that cleaned the database first.

Now... our scenario was such that we could NOT lose those 7 hours because each customer record lost meant $5000 usd penalty.

What saved us is that I knew about the oplog (binlog in mysql) so after restoring the backup i isolated the last N hours lost from the log and replayed it on the database.

Lesson learned and a lucky save.

fma said 2 days ago:

Same happened to me many years ago. QA dropped the prod db. It's been many years but if I recall, I believe in the dropdown menu of the MongoDB browser, exit & drop database were next to each other...Spent a whole night replaying the oplog.

No one owned up to it, but had a pretty good idea who it was.

vanviegen said 2 days ago:

> No one owned up to it, but had a pretty good idea who it was.

That sounds like you're putting (some of) the blame on whoever misclicked. As opposed to everyone who has allowed this insanely dangerous situation to exist.

rocqua said 2 days ago:

Misclicking is a tiny forgivable mistake.

Not immediately calling up your boss to say "I fucked up big" is not a mistake, it is a conscious bad action.

cutemonster said a day ago:

Another thought: the company culture and approach to hiring and firing, can cause people to try to hide mistakes, although they don't really want to?

ojnabieoot said a day ago:

It sounds like whoever did it might not even be aware they were responsible:

> in the dropdown menu of the MongoDB browser, exit & drop database were next to each other

So maybe they signed off for the night without realizing anything was wrong.

cutemonster said a day ago:

MongoDB can sell an enterprise version with the buttons further apart

xtracto said a day ago:

This. The person that erased the database in my case came forward to me as soon as we realized what had happened. At that moment I was very happy it was an "inside job", it meant I could discard hacking.

As its said before: he made a mistake. The error was allowing the prod database to to be port forwarded from a non prod environment. As head of eng that was MY error. So I owned to it and we changed policies.

cutemonster said a day ago:

How do you prevent forwarding ports? Then one needs to disable ssh access?

Nice that you were a person he felt ok with sharing the mistake with, I suppose that's an important part of being head of eng.

fogihujy said a day ago:

`AllowTcpForwarding No`

There are ways around it, of course, but it prevents the scenario described above.

cutemonster said 16 hours ago:

Thanks

xtracto said a day ago:

Nope . The solution is to password protect and not give the pass to developers. Or only give read only access.

3np said 2 days ago:

A dangling port-forward was my first thought to how this happened.

unnouinceput said 2 days ago:

Quote: "Note that host is hardcoded to localhost. This means it should never connect to any machine other than the developer machine. Also: of course we use different passwords and users for development and production. We’re too tired to figure it out right now.

The gremlins won this time."

No they didn't. Instead one of your gremlins ran this function directly on the production machine. This isn't rocket science, just the common sense conclusion. Now it would be a good time to check those auditing logs / access logs you're suppose to have them enabled on said production machine.

skytreader said 2 days ago:

> Instead one of your gremlins ran this function directly on the production machine.

Exactly my first hypothesis too. But then keepthescore claims,

> of course we use different passwords and users for development and production.

How would this hypothesis explain that?

---

Metadialogue:

John Watson: "I deduce that someone changed the dev config source so that it uses the production config values."

Sherlock Holmes: "My dear Watson, while that is sensible, it seems to me that the balance of probability leans towards their production instance also having the development access credentials."

---

Just my way of saying, I think this case isn't as shut and closed as most comments (including parent) imply. I personally find the /etc/host mapping a likelier hypothesis but even that can't explain how different credentials failed to prevent this. Without more details coming from a proper investigation, we are just piling assumptions on top of assumptions. We are making bricks, without enough clay, as Holmes would say.

thamer said a day ago:

Agreed, it seems like most people making suggestions above are missing the point about credentials. The code they present explicitly references `config.DevelopmentConfig`:

    database = config.DevelopmentConfig.DB_DATABASE
    user = config.DevelopmentConfig.DB_USERNAME
    password = config.DevelopmentConfig.DB_PASSWORD
One way this could happen is if the objects under `config` were loaded from separate files, and the dev file was changed to a symlink to the prod file. So `config.DevelopmentConfig` always loads /opt/myapp/config/dev.cfg but a developer had dev.cfg -> prod.cfg and the prod credentials and connection details were loaded into `config.DevelopmentConfig`.

Just an idea.

ir123 said 2 days ago:

If managed DBs on DigitalOcean are anything like those on AWS, you can not run commands directly on them since SSH is prohibited. EDIT: there's also the deal with different credentials for dev and prod envs.

toredash said 2 days ago:

2 cents his hosts file points localhost to the prod db IP

pvorb said 2 days ago:

Yep. Nowadays a kubectl port-forward makes something like this all too easy. They accidentally had the kubecontext point at the production cluster instead of dev, set up the port-forward to the database, and whoops! At least that's how this could happen to me, even with my years of experience in doing unexpected things to production databases.

junon said 2 days ago:

That whole ecosystem is a devexp nightmare. I try to stay away from it entirely having worked with it extensively.

Docker + Kubernetes are the biggest socially-acceptable hacks in the industry at the moment.

pvorb said a day ago:

Why do you think they are hacks? Could you please elaborate?

In my opinion, they nicely abstract over server hardware and services running on them, so one is able to have simple infrastructure-as-code management of otherwise complicated setups.

said a day ago:
[deleted]
radu_floricica said 2 days ago:

I'm betting on a tunnel, myself. And grandparent is probably wrong, they most likely have dedicated mysql machines so "localhost" will never be the db.

kjaftaedi said 2 days ago:

I think so as well.

It doesn't even make sense to connect to a managed database using 'localhost'.

Managed databases are never localhost. They are hosted outside your VPS and you use a DNS name to connect to them.

drdaeman said 2 days ago:

That could've happened if the database is not accessible from the Internet and they were using a tunnel which binds to localhost (e.g. `ssh -L`).

It does make sense to connect to localhost on dev machines. But if that's the setup, I guess one should avoid from tunneling to localhost to avoid potentially dangerous confusion (hmm... I think I'm guilty of that on a couple projects, need to check if that's true...)

dingdingdang said 2 days ago:

Yeah, my guess would be that the script got executed on prod server by, ups was I in that terminal window, accident! Localhost is after all the-local-host, no matter what server it's on. Better to also have clear convention regarding name of prod db name versus test db (i.e. "test_mysaas" versus "mysaas").

Plus of course using git with a hook specifically for preview versus production (i.e. "git push production") that way local specific scripts can be stripped even if in same repo.

robryan said 2 days ago:

Yeah this is how I once deleted my production database. One thing I did to mitigate this was colour code the prompts for local/staging/production.

rwbhn said a day ago:

Yeah, if one has to regularly interact directly with prod, some sort of visual indicator is super helpful.

buzer said a day ago:

My guess would be something like PgBouncer. Someone may have installed it to the production server at some point in the past.

robryan said 2 days ago:

One solution to this is to make sure only the production servers can connect to the production database.

dschuetz said a day ago:

As someone else said it below: "hardcoded to localhost" doesn't mean it's hardcoded. It means it goes to whatever localhost resolves to. Really hardcoded should ALWAYS mean: 127.0.0.1

muststopmyths said 2 days ago:

>Note that host is hardcoded to localhost. This means it should never connect to any machine other than the developer machine. We’re too tired to figure it out right now. The gremlins won this time.

Obviously, somehow the script ran on the database host.

some practices I've followed in the past to keep this kind of thing from happening:

* A script that deletes all the data can never be deployed to production.

* scripts that alter the DB rename tables/columns rather than dropping them (you write a matching rollback script ), for at least one schema upgrade cycle. you can always restore from backups, but this can make rollbacks quick when you spot a problem at deployment time.

* the number of people with access to the database in prod is severely restricted. I suppose this is obvious, so I'm curious how the particular chain of events in TFA happened.

amluto said 2 days ago:

I have a little metadata table in production that has a field that says “this is a production database”. The delete-everything script reads that flag via a SQL query that will error out of it’s set in the same transaction as the deletion. To prevent the flag from getting cleared in production, the production software stack will refuse to run if the “production” flag is not set.

jerf said 2 days ago:

This is also one place where defense-in-depth is useful. "Has production flag" OR "name contains 'prod'" OR "hostname contains 'prod'" OR "one of my interfaces is in the production IP range" OR etc. etc. You really can't have too many clauses there.

Unfortunately, the "wipe & recreate database" script, while dangerous, is very useful; it's a core part of most of my automated testing because automated testing wipes & recreates a lot.

asddubs said 2 days ago:

one silly last resort measure I did on a project a while back was having a IS_STAGING file somewhere, only existing on localhost, and every request it would check if the hostname is that of the live site, and if so, delete that file. the file itself wasn't enough to make the server think it's in staging mode, but it was the only thing in the chain that if it were to go wrong, would fix itself automatically almost immediately (and log an error)

asimjalis said 2 days ago:

I would flip the logic. If database does not have flag that says it is non-production assume it is production.

amluto said a day ago:

It's a BOOLEAN NOT NULL. I don't recall off the top of my head whether TRUE means production or TRUE means testing.

at_a_remove said 2 days ago:

Nice. Very nice.

mcpherrinm said 2 days ago:

The blog mentions it's a managed DigitalOcean database, so the script likely wasn't run on the host itself.

More likely, I'd suspect, is something like an SSH tunnel with port forwarding was running, perhaps as part of another script.

StavrosK said 2 days ago:

Someone SSHed to production and forwarded the database port to the local machine to run a report, then forgot about the connection and ran the deletion script locally.

cutemonster said a day ago:

That has happened? Or was it a thought about what could have happened elsewhere?

StavrosK said a day ago:

Oh, no, that's my guess as to what happened here.

PeterisP said 2 days ago:

One aspect that can help with this is separate roles/accounts for dangerous privileges.

I.e. if Alice is your senior DBA who would have full access to everything including deleting the main production database, then it does not mean that the user 'alice' should have the permission to execute 'drop database production' - if that needs to be done, she can temporarily escalate the permissions to do that (e.g. a separate account, or separate role added to the account and removed afterwards, etc).

Arguably, if your DB structure changes generally are deployed with some automated tools, then the everyday permissions of senior DBA/developer accounts in the production environment(s) should be read-only for diagnostics. If you need a structural change, make a migration and deploy it properly; if you need an urgent ad-hoc fix to data for some reason (which you hopefully shouldn't need to do very often), then do that temporary privilege elevation thing; perhaps it's just "symbolic" but it can't be done accidentally.

jlgaddis said 2 days ago:

> the number of people with access to the database in prod is severely restricted

And of those people, there should be an even fewer number with the "drop database" privilege on prod.

Also, from a first glance, it looks like using different database names and (especially!) credentials between the dev and prod environments would be a good idea too.

colechristensen said 2 days ago:

This is bad operations.

That it happened meant that there were many things wrong with the architecture, and summing up the problem to “these things happen” is irresponsible, most importantly your response to a critical failure needs to be in the mindset of figuring out how you would have prevented the error without knowing it was going to happen and doing so in several redundant ways.

Fixing the specific bug does almost nothing for your future reliability.

jrochkind1 said a day ago:

The lack of the seriousness/professionalism of the postmortem seemed odd to me too. So, okay, what is this site?

> KeepTheScore is an online software for scorekeeping. Create your own scoreboard for up to 150 players and start tracking points. It's mostly free and requires no user account.

And also:

> Sat Sep 5, 2020, Running Keepthescore.co costs around 171 USD each month, whilst the revenue is close to zero (we do make a little money by building custom scoreboards now and then). This is an unsustainable situation which needs to be fixed – we hope this is understandable! To put it another way: Keepthescore.co needs to start making money to continue to exist.

https://keepthescore.co/blog/posts/monetizing-keepthescore/

So okay, it's basically a hobby site, for a service that most users probably won't really mind losing 7 hours of data, and that has few if any paying customers.

That context makes it make a little bit more sense.

ricksharp said a day ago:

Are you sure it was the production database that was affected?

If you are not sure how a hard coded script that was targeting localhost affected a production database, how do you know you were even viewing the production database as the one dropped?

Maybe you were simply connected to the wrong database server?

I’ve done that many times - where I had an initial “oh no“ moment and then realized I was just looking at the wrong thing, and everything was ok.

I’ve also accidentally deployed a client website with the wrong connection string and it was quite confusing.

In an even more extreme case: I had been deploying a serverless stack to the entirely wrong aws account - I thought I was using an aws named profile and I was actually using the default (which changed when I got a new desktop system). I.e. aws cli uses —profile flag, but serverless cli uses —aws-profile flag. (Thankfully this all happened during development.)

I now have deleted default profiles from my aws config.

cblconfederate said 2 days ago:

> Computers are just too complex and there are days when the complexity gremlins win.

Wow. But then again it's not like programmers handle dangerous infrastructure like trucks, military rockets or nuclear power plants. Those are reserved for adults

yunruse said 2 days ago:

I feel that computers make it easier for this danger to be more indirect, however. The examples you give are physical, and even the youngest of child would likely recognise they are not regular items. A production database, meanwhile, is visually identical to a test database, if measures are not made to make it distinct. Adults though we may be, we're human, and humans can make really daft mistakes without the right context to avoid them

PurpleFoxy said 2 days ago:

There are also countless safety measures on physical items that have been iterated on over decades to prevent all kinds of accidents. Things like putting physical locks on switches to prevent machinery being turned on while people are working on it.

Can you imagine if instead of a physical lock it just said “are you sure you wish to turn on this machine”. “Of course I want to turn it on, that’s why I pressed the button”

Some software makes it a lot harder for the user to mess up now. When deleting a repo on GitLab you have to type the name of the repo before pressing delete and then it puts it in a pending deletion state for a month before it’s actually deleted. Unfortunately for developers we typically get minimal cli tools which will instantly cause a lot of damage without any way to undo.

im3w1l said 2 days ago:

So, silly idea. What if, to work on the production database, you had to go into the special production room, colored in the special production color, scented with the special production perfume, and sit on a just tiny bit uncomfortable production chair.

Basically make it clear even to the caveman brain that things are different.

pontifier said 2 days ago:

I actually really like this idea... but who am I kidding, it's a luxury I don't have time for when I've got to fix stuff.

Waterluvian said 2 days ago:

All of those items have a lot of safety software in them.

ineedasername said 2 days ago:

Yep, I hate when I'm dealing with a system literally comprised of logic and a magic gremlin shows up to ruin my day.

Seems like they though a casual "everyman" type of explanation would suffice, but really who would trust them after this?

cblconfederate said 2 days ago:

I understand that web interfaces are trivial, unimportant work (it s what i do), but how can one sleep with such an unresolved mystery?

geofft said 2 days ago:

I'm not sure I follow your point - I think you'll find the same attitude towards complexity by operators of military rockets and nuclear power plants. If you look at postmortems/root-cause analyses/accident reports from those fields, you'll generally find that a dozen things are going wrong at any given time, they're just not going wrong enough to cause problems.

codegladiator said 2 days ago:

We already know humans make mistakes. But for this particular scenario lets blame computers.

eezurr said 2 days ago:

One explanation for the author feeling that way is that the system is has too much automation. Being in a situation where you take on more responsibilities of the system at a shallower level leads to less industry expertise. This, as it turns out, places the security of the system in a precarious position.

These is pretty common, as devs tool belts have grown longer over time.

I think at some point we will stop automating or reverse some of the automation.

codegladiator said 2 days ago:

> too much automation

Literally just the automation of the test suite. That's 1 automation.

> These is pretty common

? Waiting for FB to delete their db

emodendroket said 2 days ago:

This post is embarrassing. "yeah we were drinking and accidentally nuked the prod DB. Not sure why. Shit happens!" Who would read this and think they should trust this company? Any number of protections could have been taken to prevent this and production access in any state other than fully alert and attentive shouldn't happen unless it is absolutely necessary for emergency reasons

bstar77 said 2 days ago:

I think it's kind of funny they chose to post this story rather than do a typical post mortem.

corobo said a day ago:

This reply is embarrassing. It's a person working on their side project. Have a glass of wine mate.

emodendroket said a day ago:

A previous post said it was "almost free" and I feel like a less cavalier attitude is called for if people are paying for the service. Otherwise sure it doesn't matter

tcbasche said 2 days ago:

Yeah why should I treat anything this company does with any level of seriousness? Why should anyone?

It's lucky it's just some online scoreboard because I'm sure as shit this stuff has happened before with more critical systems and it scares the hell out of me that engineers are fine blaming "gremlins" instead of taking responsibility for their own incompetence.

Aeolun said a day ago:

> taking responsibility for their own incompetence.

I think they’re doing that with this post? At least I find it hard to imagine myself writing down that I’d drunk a few glasses of wine and dropped the production database.

You cannot expect all engineers to be fully versed in the vagarities of database administration. Especially if they’re the only ones working on something.

tcbasche said a day ago:

Not really, they blamed 'complexity' and 'computer gremlins' rather than admitting that, perhaps, they made a shitty mistake.

> It’s a function that deletes the local database and creates all the required tables from scratch

Why would anyone have this? It's just dumb and embarrassing

Aeolun said 9 hours ago:

> Why would anyone have this? It's just dumb and embarrassing

Anyone doing some form of decent integration testing?

Generally it’s a different database than the one used to develop locally, but the concept is the same.

emodendroket said 7 hours ago:

If you don't put your database in a known state for tests they're not repeatable

mbroshi said 2 days ago:

I love this post. This sort of thing happens to everyone, most people just are not willing to be so open about it.

I was once sshed to the production server, and was cleaning up some old files that got created by an errant script, one which file was '~'. So, to clean it up, I type `rm -rf ~`.

meowface said 2 days ago:

Somewhat similar story from many years ago. Was in ~/somedirectory, wanted to clear the contents, ran `rm -rf *`. Turns out somewhere in between I had done a `cd ..`, but I thought I was still in the child directory. Fastest Ctrl+C ever once I saw some permission errors, but most of the home directory was wiped in that second or two.

Didn't have a backup of it unfortunately, though thankfully there wasn't anything too critical in there. Mostly just lost a bunch of utility scripts and dotfiles. I feel like it's beneficial in the long run for everyone to make a mistake like this once early on in their career.

wruza said 2 days ago:

It would be beneficial for `rm` to have a way for making `trash` mode the default mode of operation. Unix shell is prone to errors like this beyond all reason. And nobody would die typing:

  fs.delete(shell.expand('*'), recursive:yes)
  fs.undo()
or something like that with completion helpers. Our instruments usually lack any safety in general, and making them safe is hard, especially when you're just a worker and not a safety expert. All the world today benefits from mistake-friendly software, except for developers who constantly walk on ui minefields.
heelix said 2 days ago:

Ah man, these things happen. One of our developers - very new to elastic - was asked to modify some indexes. Folks were a bit too busy to help or heading out on holiday. One stack overflow answer later... delete and recreate it... and she was off to the races. When the test was tried, it looked like things still worked. A quick script did the same to stage and prod, in both data centers. Turns out that is not a great way to go about it. It deleted the documents. We got lucky, as we still had not killed off the system we were migrating off of and it only took three days of turn and burn to get the data back on the system.

So many lessons learned that day. I trust her with the master keys at this point, as nobody is more careful with production than her now. :)

fideloper said 2 days ago:

RDS is so very worth paying for this type of issue (in many cases, obviously $60 to multiple thousands a month isn’t great for everything).

Otherwise having a binlog based backup (or WAL, I guess, but i don’t know PG that well) is critical.

The key point there is they provide point in time recovery possibilities (and even the ability to rewrite history).

latch said 2 days ago:

Barman (1) is really easy to setup and lets you avoid the many pitfalls of RDS (lower performance, high cost, no superadmin, no arbitrary extensions, waiting for releases, bad log interface).

(1) https://www.pgbarman.org/

lysp said 2 days ago:

I had a client who had prod database access due to it being hosted internally. They called up saying "their system is no longer working".

After about an hour of investigation, I find one of the primary database tables is empty - completely blank.

I then spend the next hour looking through code to see if there's any chance of a bug that would wipe their data and couldn't find anything that would do that.

I then had to make "the phone call" to the client saying that their primary data table had been wiped and I didn't know what we did wrong.

Their response: "Oh I wrote a query and accidentally did that, but thought I stopped it".

dvdbloc said 2 days ago:

At my job, the company computers are configured to send “localhost” to the local company DNS servers, of which they happily reply with the IP address of the last machine that’s gotten a DHCP lease with the hostname “localhost”. Which happens often. Needless to say, our IT dept isn’t the best.

madsbuch said 2 days ago:

Things like these happen, and we should be compassionate towards them.

Often small changes to the structure drastically reduce probability of stuff like this happening.

Eg. we use docker to setup test and dev databases and seed from (processed) dumps. When we need to clean our database, we simply put down the docker container. Ie. we do not need to implement destructive database cleanup eliminating structure that could potentially fail.

Having policies about not accessing production database directly (and allow the extra time for building tooling around that policy), good preview / staging environments, etc. All fail eliminating structure.

junglejoose said 2 days ago:

You aren’t a real engineer until you do this. So congrats on the promotion! :)

zmmmmm said 2 days ago:

Indeed - after incidents like this I usually say, "This is called experience that you can't pay to get for any price. Learn from it well, and value your lesson."

ISL said 2 days ago:

I'm a physicist, interested in consulting on both data analysis and precision metrology/hardware projects in general.

That said, I will happily accept consulting fees in return for deleting someone's database in prod, should they so desire.

Edit: Heck, being a white-hat licensed-to-create-mayhem chaos monkey for a few hours a week sounds pretty fun. Email in profile.

li4ick said a day ago:

Yeah, imagine if a bridge engineer said the same thing: "You aren't an engineer until your bridge collapses. Congrats!" I am starting to hate tech culture. Nobody cares about correctness and discipline. Mention "math" and everybody spreads like cockroaches.

emerongi said 2 days ago:

I wrote a migration that dropped columns for a functionality that was no longer to be used.

Then the client wanted that functionality back. Oops.

Guest42 said 2 days ago:

Definitely, the likelihood of these things happens goes way up alongside the amount of concurrent tasks, meetings, or other forms of distraction. I’ve seen my fair share of production administration as a side task.

smadge said 2 days ago:

SSH tunnel from localhost to prod on database port?

macNchz said 2 days ago:

A likely culprit. Having worked on a bunch of early-stage products where best practices are a distant future dream, I’ve developed a few “seatbelt” habits I use to avoid these kinds of things. One of them is to always use a random high-number local port if I’m tunneling to a production service.

Another is to change my terminal theme to a red background before connecting to anything in production...never want to click the ‘psql ...’ tab, run “truncate table app_user cascade” and realize afterwards it was a lingering connection to production...

rmrfrmrf said a day ago:

what's a situation where you'd be tunneling to a production service?

macNchz said a day ago:

I think the most common reason I’ve had in the past was to connect to the RabbitMQ web admin while dealing with some emergent issue with task throughput, a common problem area with hacked together web apps that start to get real traffic. It’s also handy to be able to use a more advanced SQL client that’s not installed on the server (pgcli, emacs, etc) when digging around for something that’s causing errors in production.

sfkdjf9j3j said 2 days ago:

That's my guess too. Naughty naughty. That means his production creds are the same as his development creds!

EdwardDiego said 2 days ago:

Yeah that's my guess, and likewise biggest concern - if the credentials are the same in dev and prod, it increases risk surface.

rmrfrmrf said 2 days ago:

vscode remote extension perhaps

said 2 days ago:
[deleted]
Negitivefrags said 2 days ago:

If you are using postgres, configure it to keep the WAL logs for at least 24 hours.

They could have used point-in-time recovery to not lose any data from this at all.

yjftsjthsd-h said 2 days ago:

If you can do this, then yes by all means do it, but that has significant impact on disk usage.

koolba said 2 days ago:

It doesn’t have to be local. In fact it shouldn’t be local anyway as backups accessible to be deleted from the source aren’t real backups any way.

You can the restore from a recent base backup and roll forward the WAL to just before the snafu.

yjftsjthsd-h said a day ago:

Yeah, but that can still create a situation where you can create WALs faster than they're uploaded. (And if there's a good solution to that, I'm interested, because we ran a server out of disk this way last month at work)

prh8 said 2 days ago:

Very unsettled by the flippancy of the entire article

3np said 2 days ago:

And all the emojis on top of that

rafamvc said a day ago:

A very similar thing happened to living social during their brightest years, but the replication and the backups had failed too. The oldest valid backup was about a week old. It took the whole company offline for 2 days. It took a team of 10 people and some extra consultants to come up with a half backed version of the latest database based on elastic cache instances, redis caches and other "not meant to be a backup" services. It was insane walking in an office that had hundreds of employees and see they all gone while we rebuild this cobbled together db.

At one point someone called it "good enough" and they basically had to honor the customer word if they had purchased something and it wasn't there.

It was a mess.

It was on all major news, and it was really bad press. In the end, they actually had a massive bump in their sales afterwards. Everyone went to checkout their own purchases and ended up buying something else, and the news was like free ads.

https://www.washingtonpost.com/business/capitalbusiness/the-...

suzzer99 said 2 days ago:

We have something similar with AWS Cognito. If a user signs up but doesn't go through with the verification process, there's no setting to say "remove them after X days". So we have to run a batch job.

If I screw up one parameter, instead of deleting only unconfirmed users, I could delete all users. I have two redundant checks, first when the query is run to get the unconfirmed users, and then again checking the user's confirmed status before deleting them. And then I check one more time further down in the code for good measure. Not because I think the result will be different, but just in case one of the lines of code is altered somehow.

I put BIG LOUD comments everywhere of course. But it still terrifies me.

6nf said 2 days ago:

Soft deletes reduces the scariness

a-b said 2 days ago:

Recreate and seed test database is totally ok in RoR world.

I think the main reason of this accident is lack of separation between development and operations.

jacquesm said 2 days ago:

This happens more often than you might think.

xnyan said 2 days ago:

localhost is an abstraction, it's a non-routable-outside-your-machine network...except it's not. It's nothing more than normal TCP traffic except with a message to the OS and other programs that whatever is on that local computer network, you don't want it routed outside the local computer.

There's absolutely nothing stopping anything with access to localhost from routing it anywhere that process wants. Does not even take a malicious actor, all kinds of legit programs expose localhost. It's really not something you should use for anything except as a signal to other well-behaving programs that you are using the network stack as a machine-local IPC bus.

defen said 2 days ago:

The fact that the production db has the same username/password as the development one is perhaps more troubling.

cortesoft said 2 days ago:

It likely doesn't... it probably reads it from the environment or a config role, and since it was in production it had the production credentials.

defen said 2 days ago:

The code explicitly referenced “DevelopmentConfig” though

Biganon said 2 days ago:

They say in the article that it's not the case.

kerng said 2 days ago:

Yes! This is the biggest mistake probably

bArray said 2 days ago:

Hmm, there seems to be some holes in their system. A database might go down for any reason.

I also have daily backups, but I write logs (locally and regularly copy from the production server) for all database actions to disk for the purpose of checking through them if something goes wrong, or having the option to replay them encase something like this happens. SO you have your database backups as "save points" and the logs to replay all the "actions" for that day.

0kl said a day ago:

> Note that host is hardcoded to localhost. This means it should never connect to any machine other than the developer machine.

Just to help with the postmortem:

1) “localhost” is just a loopback to whatever machine you’re on

2) the user and pw are pulled from config

So someone was running this from the production server or had the production DB mapped to localhost and ran it with a production config for some reason (working with prod data maybe). The hard coding to localhost will only ensure that it works for the machine it’s called on - in this case the prod server.

Things you might do to avoid this in the future include a wide spread of things, the main recommendations I’d have are:

1) only put production artifacts on prod

2) limit developer access to prod data

Best of luck

boltefnovor said 2 days ago:

Companies tend to get really good at backups after events like this.

jlgaddis said 2 days ago:

The wonderful thing about computers is that they do exactly what they are told to do.

The worst thing about computers? They do exactly what they are told to do.

fma said 2 days ago:

I do a lot of work with middle/high school students. Without fail someone would yell "Why is it doing x!"...to which my standard reply is "Because you told it to".

wruza said 2 days ago:

they do exactly what they are told to do

Well, tell them to UNDROP and see what happens.

yla92 said 2 days ago:

I am sorry this happens.

> local_db = PostgresqlDatabase(database=database, user=user, password=password, host='localhost', port=port)

I am guessing this part. Even though the host is hardcoded as "localhost" , when you do a ssh port-forwarding, the localhost might actually be the real production. e.g sudo ssh user@myserverip -L 3333:localhost:3306

webel0 said 2 days ago:

This is my greatest fear when it comes to terraform:

> terraform destroy

(And either a confirmation or a flag) and everything is deleted.

I know you can add some locks but still :/

malwrar said 2 days ago:

You can save yourself from scary operations like deleting everything by a.) not rooting your entire infra in the same main.tf and b.) using Terraform's lifecycle meta-argument: https://www.terraform.io/docs/configuration/resources.html#l...

I like to use the lifecycle feature for suuuper core things that will never be deleted (VPC, r53 zone, etc) and eventually when I start targeting multiple DCs w/ lots of infra I'll eventually move to many state roots (or use tools like Terragrunt, which make things mildly scary again).

gkop said 2 days ago:

Also a human should review every plan and confirm before applying, right?

said 2 days ago:
[deleted]
said 2 days ago:
[deleted]
codegladiator said 2 days ago:

> Computers are just too complex and there are days when the complexity gremlins win.

> However, we will figure out what went wrong and ensure that that particular error doesn’t happen again.

How can you say statement 2 just after statement 1 ? Isn't statement 1 just plain acceptance of defeat ?

And looking at all the replies here, is this a feel good thread for the mistakes you made ?

caconym_ said 2 days ago:

In context, statement 1 regards proactively eliminating all bugs and risks. Statement 2 regards understanding the root cause of this particular incident and reactively fixing it so it won’t happen again.

Acknowledging statement 1 doesn’t mean giving up—it simply means being clear and realistic about the nature and scale of the problem we’re facing when we try to build complex software systems. In the face of that we can give up, or we can just do the best we can, and it sounds more like these people are doing the latter.

ssalka said 2 days ago:

I like to think that, by addressing the known bugs as they pop up, over time you can box the complexity gremlins into tighter and more predictable spaces. Though as long as humans are building these systems, that box will always be there, and the predictability of those gremlins' behavior will only go so far.

codegladiator said 2 days ago:

I am not sure what the bug was here ? Everything worked as intended to be.

smarx007 said a day ago:

A question to the DBA experts from a developer: is there a way in MySQL and Postgres to configure a log specifically for destructive SQL queries so that it's easier to investigate a situation like this? I.e. to log most queries except for usual SELECT/INSERTs.

Also, @oppositelock pointed out that WAL would contain the destructive query too. How does one remove a single query from a WAL for replay or how does one correctly use WAL to recover after a 23-hour old backup was restored?

Finally, how does one work on the WAL level with managed DBs on AWS or DO if they block SSH access?

said a day ago:
[deleted]
fencepost said 2 days ago:

Does the database name value allow specifying a host as part of it?

mijoharas said 2 days ago:

This is what I came to say. Postgres usually allows a connection string that contains all the details as the database name postgres://username:password@host:port/dbname and I think it will take priority over separately specified host, depending on the client.

Tough way to learn that lesson though.

said 2 days ago:
[deleted]
exabrial said 2 days ago:

If you keep configuration in the environment (/etc/default/app-name) rather than in the application package, it's nearly impossible to make this mistake (especially with proper firewall rules). You can even package your config as a deb and keep it encrypted version control.

restlessbytes said 2 days ago:

While I'm very sympathetic to "we accidentally nuked our prod DB" because, let's admit it: we've all been there at some point, I'm also a bit baffled here because I don't think that the problem lies with too much wine, Postgres permissions or scripts on localhosts but the fact that recreating a database by FIRST dropping all tables and THEN recreating them is like deliberately inviting those gremlins in.

But, as I said, that happens and blaming doesn't fix anything, so, for the future:

1. make a temporary backup of your database 2. create tables with new data 3. drop old tables

glintik said 2 days ago:

"at around 10:45pm CET, after a couple of glasses of red wine, we deleted the production database by accident". That's not an accident, guys..

Stop drink and deploy something on production, especially at late evening time.

vox17 said a day ago:

This line's the winner for me : "Thankfully nobody’s job is at risk due to this disaster. The founder is not going to fire the developer – because they are one and the same person."

ClumsyPilot said a day ago:

To be fair he could, and least in theory - he could get someone else to do development, for money or equity, for the project and do something else himself.

benhurmarcel said a day ago:

It's just one guy working in the evening on a side project. There's no revenue, and it runs on $171/month [1]. He's not going to hire anyone.

[1] https://keepthescore.co/blog/posts/monetizing-keepthescore/

cmeacham98 said a day ago:

While several other users have posted takeaways for how to prevent this from happening, I'd be interested in if anybody has an idea of how this happened given the code that was posted?

Presumably, a managed DB service should essentially never be available on `localhost`. Additionally, it would be very weird for `config.DevelopmentConfig` to return the production database credentials.

flurdy said a day ago:

In one of my first jobs I deleted the last 30 days of our production data.

Shit happens. You learn and try to never repeat it. And share with others so hopefully they learn.

Ps. Don't do knee-jerk late at night quick patches. For example don't stop a database that has run out of disk space, try to migrate the data in memory first... And also do proper backup monitoring, and restores. Having 30 days of 0 byte backups is not that helpful. :)

jugg1es said 2 days ago:

I totally sympathize with you and yours, I've made sphincter-clenching mistakes a handful of times during my 20 years of experience.

This is an abject lesson that understanding human psychology is actually a huge part of good architecture. Automating everything you do in production with a script that is QA tested prior to use is the best way to avoid catastrophic production outages.

It does take a bit longer to get from start to finish, and younger devs often try to ignore it, but it is worth putting a layer of oversight between your employees and your source of revenue.

ing33k said 2 days ago:

I run a replicated ClickHouse server setup, Clickhouse uses zookeeper to enable replication. The zookeeper instance was not replicated.it was a single node. The server on which zookeeper was running ran out of hard disk and Clickhouse went into read only mode. Luckily,no data was lost while this happened because we use RabbitMq to store the messages before it gets written to the db. Thanks to RabbitMq's ACK mechanism.

jzer0cool said 2 days ago:

Could someone explain more what caused the prod wipe? The snip here indicates it is using a 'dev' credential (it is a different pass than prod right?) - how does a db connection occur at all?

fma said 2 days ago:

Good catch. Wouldn't surprise me if there's 1 username/password for all their DB environments.

jzer0cool said 2 days ago:

When looking at the snip again, there is a note that says it uses different a username / password for prod & dev.

I can't see how the OP indicates making a connection to db without the correct credentials to begin with.

ystad said a day ago:

> Thankfully our database is a managed database from DigitalOcean, which means that DigitalOcean automatically do backups once a day. Do cloud providers provide a smaller window for backup. Are there better ways to reduce the backup window here for DBs? Would love to understand any techniques folks use to minimize the backup window?

luord said 2 days ago:

This is what nightmares are made of.

> We’ve learned that having a function that deletes your database is too dangerous to have lying around.

Indeed, anything that might compromise the data, anything that might involve deletion anyway, should require manual confirmation whether you manage the database or it's a service provided.

Sadly, I learned this the hard way too, but at least it was a single column with a non-critical date and not the entire database.

dschuetz said a day ago:

Someone might have copy&pasted it elsewhere and that propagated away. This is why writing code can also be dangerous in open dev. Whoever programmed anything also should be sensible enough to judge their own code whether it could be dangerous in the wild. Once out there (or worse: on stackoverflow) it could wreak havoc.

axegon_ said 2 days ago:

I did something similar once - I had fiddled with my /etc/hosts and subsequently connected to the production database without realizing. I dropped a table but thankfully it wasn't much of a deal - the monitoring rang the bell and I recreated it a few seconds later. All that happened was that I had logged out several hundred users.

aerxes said a day ago:

This postmortem is incomplete: it fails to address the main three roots of the problem:

1. This business is too flippant with their write-able production access.

2. No user should have DROP DATABASE grants on production.

3. Clearly one of their employees was using a port forward to access production.

xupybd said a day ago:

I don't understand how this happened if localhost is hard coded and the password is different. I don't think they fully understand why this happened. At least enough to prevent it from happening again.

rawgabbit said 2 days ago:

The article should be renamed to “We coded a function that drops database tables and were surprised when it broke production.”

sdepablos said 2 days ago:

VPN to production and same hostname for dev and prod?

kerng said 2 days ago:

...and same credentials apparently also - there are lots of things that could have prevented something like this.

tedk-42 said 2 days ago:

Yeah this stuck out to me as well. To be honest if they ARE using the same creds in prod as they are in dev, this is the first thing they should fix because it would have easily prevented the whole mess.

I'll give them the benefit of the doubt and suggest that it wasn't the case and that it's unlikely that a dev or OPs person was tunnelling through a bastion of some kind to run the script.

If they pulled their credentials from a store and populated the config object that way, it's very possible someone actually loaded the production credentials by mistake into the development secrets file. The CI/pipeline system has permissions/network access to deploy to any environment, hence how it ran the script to drop the table.

I'm purely speculating another alternative to the much worse case of the tunnelling scenario outline.

asddubs said 2 days ago:

could also be that they just have an if staging/else somewhere for the credentials (but of course this isn't a counterpoint to the root of your point - the script wiping the database shouldn't be using anything that does something like this, and you probably shouldn't do it at all)

k__ said 2 days ago:

Keep going, I'm writing this down!

jlgaddis said 2 days ago:

Take the "drop database" bit (on the production database) away from your developers, too.

As well as pretty much every other privilege they don't legitimately need to use on a daily basis -- which, for prod, should be most of them (quite possibly including "delete").

If or when they really need to delete a ton of rows all in one go, they can be given a (temporary) set of credentials that they can use to do that, once, and which are then revoked immediately afterwards -- after another set of eyes reviews the script they've written to actually perform the operation, of course.

Basic best practices aren't difficult or some secret thing that only the experts know about. It is necessary to actually implement and follow them, however!

Sure, it can be a pain in the ass sometimes. Will it be worth it when it eventually saves your ass one day, "after a couple of glasses of red wine"? Absolutely.

astura said 2 days ago:

As a developer at all the jobs I've had, I've never even had access to the production database, full stop - only the server admins did. If I needed something from prod I'd go through them. I don't even consider it inconvenient.

Izmaki said 2 days ago:

The distance between the developer and the production database is directly proportional with the size of the company. The same is true for the amount of paperwork needed for even the simplest things. For some company sizes a feedback cycle measured in days is not good enough and may kill the business.

There is a whole lot of things that could be applied which would have prevented this, before needing to hire a separate server admin as an interface to the production database. :D

asddubs said 2 days ago:

that's true, I totally forgot about SQL permissions, there's so many failsafes for this

minkeymaniac said a day ago:

That is why we can only access the prod db from a jump box in our shop.... And even then it's just certain people with less privileges than a sysadmin account. No way you can do this accidentally from your laptop then...

said 2 days ago:
[deleted]
pachico said a day ago:

I still remember how many years ago, when someone in my team told me one Friday afternoon "there's not something like 'undo' for 'drop table', right?". He spent the weekend recreating the data.

pachico said a day ago:

Yes, Manuel, I'm talking about you! :)

gregors said a day ago:
hannofcart said 2 days ago:

It takes a great deal of integrity to admit that you deleted a database because you were mucking around in your infra after red wine.

And it bodes well for your firm that that doesn't get you fired either.

These things happen to the best of us but having dealt with it responsibly and honestly as a team is something you can be proud of IMO.

bachmeier said 2 days ago:

> And it bodes well for your firm that that doesn't get you fired either.

Maybe there's more than one interpretation of "bodes well" but not knowing how to do the one thing customers were paying you to do isn't consistent with my definition.

> having dealt with it responsibly and honestly as a team is something you can be proud of IMO.

"We were drinking wine and deleted the database and now your data's gone LOL" is not something that should make you proud.

hannofcart said 2 days ago:

Please re-read what I wrote. I was praising their honesty. Not egging them on to be more sloppy.

throwdbaaway said 2 days ago:

The script also unnecessary complicates things. If it just does the equivalent of rake db:drop, this incident wouldn't have happened, since postgres wouldn't allow a database with active connections to be dropped.

majkinetor said a day ago:

Once it happened to me, and now all scripts in last 10 years have

    if (Env.Name -like '*prod*' ) { then throw }
and similar to all destructive stuff.
jbverschoor said a day ago:

> Why? This is something we’re still trying to figure

Probably the admin has set the hba config to trust localhost. Solution - don't use the same db name in prod jsut to be sure

nix23 said a day ago:

To all the people who say "that could never happen to me" work less then 5y in the industry. That can happen to anyone anytime.

Remember: You just fix the errors YOU can think of.

info781 said a day ago:

Why did they not have archive log mode on in production? Losing a db is one thing, but should have only been one hour of data lost.

AdrianB1 said a day ago:

Focus, time, expertise and cost. For a side project with no revenue, cost is a very important factor. The others come with having side projects.

beefbroccoli said 2 days ago:

Anyone else think this is just a clever ad for DigitalOcean?

info781 said a day ago:

Not really they lost a day of data, toy database.

noja said a day ago:

A function called database_model_create() should not drop something.

Here it would have failed to create the already existing tables and raised an error.

freitasm said 2 days ago:

Lost seven hours of data? Daily backup with no transaction log backup?

Whoa.

lisper said 2 days ago:

Yeah, this. The problem is not that the production database was deleted by accident. The problem is that it was possible to (unrecoverably) delete the production database by accident.

said 2 days ago:
[deleted]
said 2 days ago:
[deleted]
jrockway said 2 days ago:

You'll find that most users of cloud databases are in this boat. For example, on GCP, deleting the database instance deletes the backups! You have to write your own software if you want to survive that button click.

anonunivgrad said 2 days ago:

I wouldn’t want to be on the wrong side of a lawsuit, defending drunk employees working on production data. What outrageous recklessness. And how imprudent to admit this to the public. Some things are best kept to yourself. No one needs to know that.

tedk-42 said 2 days ago:

If you built good systems that have reliable backups or rollback mechanisms, dev should be able to have a beer on a friday and do a deployment without worrying about the fire they just caused.

I'd rather a culture where people admit to their mistakes than one where they try hide or get whipped for owning up to them.

We're people after all and some of us like a glass of wine and unwind while still performing our duties as engineers. After all, it's not like we're in charge of life support or critical systems which absolutely cannot fail.

robjan said 2 days ago:

This is someone's side project. I'm doubtful there will be any lawsuits involved

colesantiago said 2 days ago:

thank you for your highly valued expert opinion.

edit: a stunning new record of bots flagging my post within 3 minutes. woo hoo...

never change hn.

zerr said a day ago:

It reads like the db was deleted intentionally for the sake of blog post - marketing that is. :)

ummonk said 2 days ago:

What a frustrating post :( Provides just enough technical detail to pique our curiosity then leaves us hanging.

said 2 days ago:
[deleted]
jw360 said 2 days ago:

If your script connects to a production database by accident, you have a whole different issue.

said 2 days ago:
[deleted]
said 2 days ago:
[deleted]
jlgaddis said 2 days ago:

Convenience and/or laziness, typically.

Things like this would have been much less likely to happen in the past -- you know, when the developers only had access to the development database.

But then someone had an idea... think of how great it would be if we got rid of the operations folks and gave responsiblity for prod to the developers, too!

fallingfrog said a day ago:

I once replaced a bunch of customer photos with a picture of Spock as part of a test on my first week on the job.. The dB admin had just overwritten a sales force dev dB from production and a previous developer had hardcoded the IP address of production in the code of a script somewhere..

tzs said 2 days ago:

Here is how we had our database deletion error, about 15 years ago. Our DBs had been on leased servers at a hosting company in New York City. They were getting out of the datacenter business so we had to move. We were moving to colocated servers at a Seattle datacenter.

This was the procedure:

1. Restore DB backups in Seattle.

2. Set up replication from NYC to Seattle.

3. Start changing things to read from Seattle, with writes still going to NYC.

4. After everything is reading from Seattle and has been doing so with no problems for a while, change the replication to be two-way between NYC and Seattle.

5. Start switching writes to Seattle.

6. After both reads and writes are all going to Seattle and it has been that way for a while with no problems, turn off replication.

7. Notify me that I can wipe the NYC servers, for which we had root access but not console access. I wasn't in the IT department and wasn't involved in the first 6 steps, but had the most Unix experience and was thought to be the best at doing a thorough server wipe.

My server wipe procedure was something like this.

8. "DELETE FROM table_name" for each DB table.

9. "DROP TABLE table_name" for each DB table.

10. Stop the DB server.

11. Overwrite all the DB data files with random data.

12. Delete all the DB data files.

13. Delete everything else of ours.

14. Uninstall all packages we installed after the base system install.

15. Delete every data file I could find that #14 left behind.

16. Write files of random data to fill up all the free space.

The problem was with step #6. They declared it done and turned it over to me for step #7 without actually having done the "turn off replication" part of step #6. Step #8 was replicated to Seattle.

It took them a while to figure out that data was being deleted and why that was happened.

We were split across three office buildings, and the one I was in had not yet had phones installed in all the offices, and mine was one of the ones with a phone. None of the people whose offices did have phones were in, so they lost a few more minutes before realizing that someone would have to run a couple blocks to my office to tell me to stop the wipe.

It took about 12 hours or so afterwards for them to restore Seattle from the latest backup, and then replay the logs from between the backup time and the start of the deletes.

After that they were overly cautious, taking a long time to let me resume the NYC wipe. They went right up to the point where I told them if we didn't start now we might not finish, and reminded them that those machines had sensitive customer personal information on them and were probably going to end up being auctioned off on eBay by the hosting company. They came to their senses and told me to go ahead.

rullelito said 2 days ago:

LOL, I forgot how few safety measure startups have in place.

tluyben2 said 2 days ago:

Like this never happens in big corps: sure, mostly on departmental level but still enterprises, not startups. They try to escape the annoying and slow as molasses dba’s/devops and install/create/buy/saas systems to avoid the red tape. But then the same things as with startups go wrong. And rather often too; we often had these calls asking if we could restore.

Obviously there are plenty of large enterprise wide data breaches, which, I would say is actually worse than losing a day of data in a lot of cases. So also not so many satefy measures, again, worse than startups; at least they have an excuse of being understaffed and underfunded.

xwdv said 2 days ago:

Given the short time frames that startups aim to capture value in its not worth investing time in safety until you have plenty of capacity to spare later.

ineedasername said 2 days ago:

Related: Little Bobby Tables (https://xkcd.com/327/)

tus88 said 2 days ago:

Well u have backups right...or did the gremlins eat them?

tijuco2 said 2 days ago:

That's the reason I never authorize dev machines connect to production. And that's the reason developers hate my security team.

EugeneOZ said 2 days ago:

I love the honesty, self-irony and transparency of the article. It's sad and annoying to see so many young naive devs writing ”oh they are so bad, it will never happen to me”.

Yes, people are not perfect and computer systems are complex. Admit it and don't be so overconfident.

”Errare humanum est”, prepare your backups.

sergiotapia said 2 days ago:

I did the same thing once by accident and thankfully only lost 1 hour of data. The single lowest point of my career and thinking about the details of that day makes my stomach sink even now as I type this.

I ran a rake db schema dump command of some kind and instead of it returning the schema of the database, it decided to just completely nuke my entire database. Instantly. It's very easy to fuck up so cover your ass gents, and backup often and run your restores periodically to make you can actually do them in case of an emergency.

gcc_programmer said 2 days ago:

The maturity of the article is laughable. I'm sure my age is the same as the people who wrote it, but this is unacceptable: dropping databases in prod is a serious issue, not a joke. I think the culture of the company is toxic and not professional at any level. #change-my-mind

azeirah said a day ago:

It's one individual's side-project, chill.

sanmak said a day ago:

That's awesome and crazy at the same time. It has thrill along with foolishness as well. It happens with one of my colleague friend's and he was fired at that time only. Best of luck!

temptemptemp111 said 2 days ago:

Step 0) automated backups Step 1) full manual testing of said automated backup Step 2) weekly test of the supposedly automated process ... Now you can continue to your hair-brained engineering "processes"

known said 2 days ago:

Sack whoever is responsible;

Biganon said 2 days ago:

Yeah, this person working alone on this hobby project should sack themself. Thank you for your macho no-bs insight.

boltefnovor said 2 days ago:

So sack management?

f223ff23 said 2 days ago:

Haha alcoholics complains they deleted a productive database by mistake :D

NikolaeVarius said 2 days ago:

Thinking that localhost is anything special is like year 1.5 developer mistake