Hacker News

PostgreSQL's Imperfections(medium.com)

331 pointschmaynard posted 2 months ago138 Comments
topspin said 2 months ago:

If I could have one thing on that list fixed it would be #9 - no planner hints.

I used Oracle (6 through 11) for both bespoke applications and to back large third party systems. I never saw widespread abuse of hints. Yet they were immensely helpful during development and troubleshooting. I put perhaps two queries into production with hints over 10+ years. No one ever had a reason to complain about either.

There are no perfect query planners. There is no perfect analysis. The notion that one must submit entirely to the mercy of PostgreSQL's no-hints dogma causes me to harbor some resentment. Fortunately you can frequently abuse CTEs to achieve a desired access pattern because (until recently) CTEs were an "optimization fence." But I've also resorted to creating functions and other hacks.

So I conclude the policy is simply wrong headed; there is no legitimate reason to fear hint abuse and the premise that hints aren't necessary is false.

hibikir said 2 months ago:

While I won't say that the Postgres ideosyncrasies are correct here, I can tell you that your experience is not necessarily typical. In my career, hint abuse has always been rampant. From telcos to biotech companies, a high percentage of complex queries I had to interact with had hints in them. In one extremely egregious case, the company decided to purchase an Exadata server, and since its performance characteristics had little to do with the previous servers, a vast majority of hinted queries were very wrong, leading to months of developers rebuilding every query, despite having no schema changes.

So yes, a lot of people in the industry have been burned by hints to make the Postgres perspective understandable. At the same time, this doesn't make it good: Postgres' default settings over the years have lead to specific kinds of tables requiring extra love and care to make the query planner not do silly things. The most traditional failure case being a transaction table with an always increasing timestamp, where a vast majority of queries only care about today: The traditional thing to do was to convince Postgres that yes, this table needs very frequent stats recalculation, as to make it learn that there are more than 3 rows today, so nested loops will not do. Whether the Postgres quirks are better or worse than hint hell, I am still not sure of: A competent engineer can handle things either way.

viraptor said 2 months ago:

> In one extremely egregious case, the company decided to purchase an Exadata server, and since its performance characteristics had little to do with the previous servers

I don't get this point. If the hinted queries changed performance for the worse, why wouldn't you expect unhinted queries to also change for the worse after migration? After all, the hints were there to overcome such issues already.

It sounds like the lesson should be "with large enough system, plan for extended time for query rewriting if you plan to replace your db engine", rather than anything about hints themselves.

said 2 months ago:
ric2b said 2 months ago:

> If the hinted queries changed performance for the worse, why wouldn't you expect unhinted queries to also change for the worse after migration?

The hints can introduce wrong assumptions to the planner, which can lead to worse performance than if those wrong assumptions weren't there.

viraptor said 2 months ago:

Same with lack of hints. We're not talking about simple queries which can be expected to just work. These were previously found to be badly optimised by the first engine, so we're in unknown territory either way.

raverbashing said 2 months ago:

Bad hints can always be dropped.

Every system is prone to abuse, I don't think hints are too out there in this aspect

asah said 2 months ago:

Came here to say this. We are not (all) children.

hinkley said 2 months ago:

> leading to months of developers rebuilding every query, despite having no schema changes.

But doing so is not free. You may be on a completely different scale than they were.

jaakl said 2 months ago:

Few years back we had nation-wide panic that Parliament election results were unavailable for hours after closing polls due to “IT issue”. Analysis showed that everything was right, load tested, cached etc, except real life situation of added few million rows and few thousands queries a second late. Automatic query planner just failed and all the smartest experts in country could do was to wait until it self-heals. Imagine national TV screening for hours “no results available” and You giving updates to the PM every 5 minutes. And all you know that your most important DB in country cannot figure it out... True story

bragh said 2 months ago:
yjftsjthsd-h said 2 months ago:

> Analysis showed that everything was right, load tested, cached etc, except real life situation of added few million rows and few thousands queries a second late

...then in no useful sense was it load tested.

scosman said 2 months ago:

Related, after a major version upgrade it doesn't automatically run an ANALYZE on every table. Without this the query planner can't properly plan. It treats a billion row table the same as a 20 row table until you think to manually kick off the magic command.

orthoxerox said 2 months ago:

I once couldn't convince Oracle planner to use the index at all. It was an index on the one-char status column and all the query had to do was to return the count of unauthorized rows. Almost always 100% of the rows were authorized, so no matter how many times you ran the analyzer on the table the planner remained convinced that the index was useless and opted for a full table scan instead, wasting minutes. I had to nail down the index with a hint.

blattimwind said 2 months ago:

In postgres, queries involving aggregates on joins generally seem to result in join-then-aggregate, even if the join is entirely foreign-key-independent of the aggregate.

foxhill said 2 months ago:

so, i absolutely agree. i do find it odd that no one has mentioned pg_hint_plan[1]. we use it, and it is great.

[1]: https://pghintplan.osdn.jp/pg_hint_plan.html

SigmundA said 2 months ago:

Completely agree coming from years of experience with MSSQL using it before it had good hinting, its very necessary.

I don't care how good your optimizer is, its not going to replace a human that knows what they are doing anytime soon.

I try real hard not to use hints and most of the time you don't, but sometimes you just do. To not have it will simply make you find whatever hacky way there is to beat the optimizer into submission to get the job done.

Its funny to see similar complaints about the V8 jitter now app developers get to experience a black box optimizer making different brain dead choices in production vs development environments.

DrJokepu said 2 months ago:

A series of smaller queries moving data through temporary tables is what I do when the query is too complicated for the query planner. It’s also easier to maintain than a giant query, with or without CTEs or hints.

topspin said 2 months ago:

Indeed. I too have resorted to implementing my own plan using temporary tables.

Easier to maintain though? I can't recall ever resorting to this technique and thinking it was a maintenance improvement. When what might have been a single query evolves into a facade to hide the temporary table gymnastics I always feel it's at least a maintenance setback.

radiowave said 2 months ago:

Yes, I regard temporary tables as a last-resort. Although, a sometimes very useful thing that this approach has going for it is that you can add indexes to those temporary tables which are much more specialized than what could be achieved by working solely with the indexes of the source tables, no matter how sophisticated the query plan.

DrJokepu said 2 months ago:

Maybe it’s a subjective thing, for me at least I think breaking up a larger problem into a series of smaller problems tends to make it easier to understand what’s going on.

hobs said 2 months ago:

I see hint abuse constantly and everywhere in enterprise databases, it's really more about the lineage you work with.

topspin said 2 months ago:

> it's really more about the lineage you work with.

(disclaimer: I'm not entirely certain what 'lineage' means here, but I'll assume it refers to the nature of the products for which database schemas are created and the quality of the development process.)

In my case the lineage was two unrelated ERPs at consecutive employers and a variety of in-house applications. I don't recall ever seeing a hint in either ERP except in some "one off" upgrade operations. Both employers had development guidelines and peer review processes for bespoke work. They did not explicitly preclude hints but if you had been foolish enough to offer slapdash work -- abusive use of hints, for instance -- you wouldn't get very far.

That was my experience with hints. I don't doubt there are reckless people who abuse them. I just resent being denied an affordance because they exist.

jeffdavis said 2 months ago:

I largely agree, but a couple points I'd like to challenge:

Is hardware corruption really happening and making it into the WAL stream with checksums on?

The next point, on planner hints: it's really just something that hasn't been done. If a few engineers made plans to tackle the problem, a lot could be done in a couple releases' worth of work. In the mean time, people are getting by with various half-measures anyway, such as extensions[1], planner tunables[2], and statistics tweaks[3].

The only dogma is that a half-baked solution isn't wanted. It's got a lot of architectural impact and long-term supportability implications. And a lot of different use cases that need to be considered that may drive different technological solutions. "Make plans stable/managable" is a different use case than "I know something the planner doesn't" which is different from "Make the planner do this thing because I said so".

[1] https://github.com/ossc-db/pg_hint_plan/blob/master/doc/pg_h...

[2] https://www.postgresql.org/docs/current/runtime-config-query...

[3] https://www.postgresql.org/docs/12/sql-createstatistics.html

DiabloD3 said 2 months ago:

Hardware corruption can happen anywhere to anything.

If you care about your data: Use ECC, and use a checksumming filesystem like ZFS, and also on top of this all, export your WALs to a second machine.

jeffdavis said 2 months ago:

Well, I'm asking specifically if the author saw corruption that snuck past the checksumming features that postgres has.

Extra layers are always good, but since I was one of the main authors of checksums in Postgres, I'd like to know if there's room for improvement. (Aside: the page checksum is only 16 bits, so if you have frequent corruption it's entirely believeable that a few sneak past. But I haven't seen it personally.)

ahachete said 2 months ago:

Few production clusters have checksums turned on. Since they cannot be turned online, only clusters of newer project turn it on --and even then, they are still unknown to many. So we cannot rely on this, and WAL stream corruption is a notable risk.

ptman said 2 months ago:

Also, pg has initdb --data-checksums

scurvy said 2 months ago:

Doesn't this require a full dump/import? Can you stream from a primary server w/out checksums to a replica that has checksums enabled?

Or does this fall into the "everything neat with PostgreSQL requires major downtime" category? (features, version upgrades, etc).

jeffdavis said 2 months ago:

No dump/reload, but you do need to shut down the server cleanly, enable checksums[1], and start it back up. Note that enabling checksums is expensive, but not as expensive as dump/reload.

[1] https://www.postgresql.org/docs/current/app-pgchecksums.html

scurvy said 2 months ago:

That tool is only available in the latest versions of Postgresql (11 and 12), which very few stable shops are running in production.

Checksums were introduced in 9.3, yet not made the default until 9.6. Converting it requires lengthy downtime, back to the original author's point of major version upgrade pain.

Everything worthwhile in PG is introduced over a long time, requires a lot of pain to adopt, and if not adopted quickly, suddenly becomes "lol why aren't you doing this it's been there for years." That's the disconnect between people who actually run these clusters and the somewhat ivory-tower views of the postgresql devlopers.

jeffdavis said 2 months ago:

The postgres developers are not living in an ivory tower, they are solving real problems for real customers. There is so much enthusiasm that the original author felt there was not enough dissent.

Maybe you should learn and understand why people are happy with postgres in the first place, and then understand your complaints in that broader picture. Then you wouldn't resort to mockery and other unproductive comments.

scurvy said 2 months ago:

I'm not resorting to mockery. I'm sympathizing with the original article's pain. Everything worthwhile in Postgresql involves major downtime and pain to adopt. It's not nearly as seamless as other RDBMS. That was his point, and that is mine. I use checksums as an example, but it is relevant to major version upgrades and other feature introductions.

I don't understand why it takes 4 releases before a tool like pg_checksum becomes available after the larger feature is introduced. If the whole thing isn't ready, don't release it.

I'm not saying it's easy. I never said it was.

mb4nck said 2 months ago:

There is a fork/spoon here, which supports 9.4+: https://github.com/credativ/pg_checksums

It is also available as a Debian package for the above versions via apt.postgresql.org.

karyon said 2 months ago:

There's also one inherent data privacy problem in MVCC that I've been running into.

Suppose you have an app that lets people anonymously vote or comment on stuff, but only once. The vote in the DB must not have any connection to the person. So, you give the person a flag whether or not they voted already, and store the vote separately.

Now, you'd want to set both values in the same transaction for obvious reasons. But, since Postgres uses MVCC, the two tuples that are added to the database both contain the same transaction ID (XID), so there's the connection between user and vote again.

There seems to be no way to instruct postgres to "clean" those XIDs in any way. What we're doing now is periodically and manually updating every tuple in each affected table with dummy changes, essentially duplicating all tuples with all new XIDs, and then running VACUUM to delete the tuples with the old, potentially-deanonymizing XIDs. We haven't found anything easier...

dragonwriter said 2 months ago:

> Now, you'd want to set both values in the same transaction for obvious reasons. But, since Postgres uses MVCC, the two tuples that are added to the database both contain the same transaction ID (XID), so there's the connection between user and vote again.

The simple solution is that at each change, you rewrite the whole election, not just the new votes, and clear out all outdated tuples (basically, that you make the "periodic and manual" process you are currently doing automatic and integrated with the "real" transactions rather than additional side process.)

Alternatively, you don't do the changes in the same database transaction but in the same business domain transaction which is managed outside the database, and where any database artifacts related to the management of the business transaction are deleted and vacuumed after the transaction is completed.

karyon said 2 months ago:

Re "same business transaction": That would probably still allow for correlating the votes with the flags because they'll still be added with some XID, which are monotonically increasing. Of course, it would require more effort and probably even guesswork, but still :)

Rewriting the election in the same transaction sounds smarter than the periodic solution. I need to discuss that with the team, thanks!

yardstick said 2 months ago:

Interesting issue, never thought about xid being used in that way before!

Rather than store the user ID, can you store a bcrypt hash of it?

What’s the attack vector you are addressing? I guess access is given to the database for outside auditing?

I’m just thinking if the db or app server is compromised anyway then someone could install a trigger to log the table changes, or change the app logic to log the vote somewhere else, etc.

karyon said 2 months ago:

The attack vector would be an attacker getting their hands at a DB dump including XIDs. Without any particular measures, they'd be able to de-anonymize every vote ever made, which is kind of a worst case scenario for an anonymity-focused app. Granted, since usually there are no dumps including XIDs lying around (e.g. backups don't contain them), this would effectively require an attacker to gain access to the database. When they're that far they could of course do all sorts of bad things, but most of those could probably be fixed with recreating the VM with a backup. Deanonymizing existing votes cannot be fixed :)

We also thought about hashing user IDs, but with any practical number of users in the database it would be possible for an attacker to just hash all existing user IDs and checking them all. So that would be more an obfuscation, making de-anonymization harder, but not impossible.

edit: we considered putting the user's password into that hash as well. that would also enable us to let the user edit their vote later, while still retaining anonymity. But then we'd need to ask them for their PW when voting, or we put a hash of the PW into their session data, and we'd need to restrict changing the PW until the election is over, and it seemed not worthwhile.

said 2 months ago:
devit said 2 months ago:

The other problem is that PostgreSQL doesn't rewrite data in place, so an attacker can determine the order of the user votes from the physical order of data in the database, and the order of issue votes from constantly scraping your website, thus allowing to deanonimize everything.

You need to use another database for this, specifically one designed to always overwrite data in place, and erase the WAL immediately after commit: it should be easy to write it yourself, assuming the dataset fits in RAM and so you don't need any data structures on disk other than a simple array of records.

Also you need to ensure that higher storage layers don't keep snapshots and don't do copy-on-write.

Could also look into a cryptography-based solution, although not sure if there is a feasible one.

karyon said 2 months ago:

Good points. We believe we have fixed the physical-order-and- scraping-issue by using random uuids as primary keys, showing stuff in the UI always ordered by PK, and periodically doing a CLUSTER, which physically rearranges the table after some index (pk in our case).

We haven't thought about higher-order storage layers. I guess we should do that... Thanks!

rossmohax said 2 months ago:

> There seems to be no way to instruct postgres to "clean" those XIDs in any way


karyon said 2 months ago:

Sadly, since 9.4, that does not clear the XID anymore, see that blue box here: https://www.postgresql.org/docs/9.4/routine-vacuuming.html#V...

rossmohax said 2 months ago:

Thanks, that was news to me. Interestingly replacing XID with frozen one still makes a valid tuple, so it might be just a few lines patch for you to get what you want.

Doctor_Fegg said 2 months ago:

> The on-disk binary format is incompatible across major versions

This is my major bugbear. If Postgres were able to upgrade its datastore on the fly (optionally, of course) that would make a massive difference. Instead I’ve had heart-in-mouth moments when Homebrew has decided that it wants to upgrade Postgres. (Yes, I do now use brew pin, until I transition off Homebrew for good.)

#2 for me is inefficient enum storage. Each value takes up 4 bytes. A single-byte enum would vastly reduce my database size.

yardstick said 2 months ago:

Re upgrades, have you tried pg_upgrade for upgrades in place?

Re enums, we had a similar thing and simply went with a smallint column instead of enum.

yjftsjthsd-h said 2 months ago:

> Re upgrades, have you tried pg_upgrade for upgrades in place?

This only works when you have both versions available at the same time, which seems likely to break when a package manager jumps major versions, and also doesn't work when running postgres in Docker (https://github.com/docker-library/postgres/issues/37).

nasalgoat said 2 months ago:

I've found that it's incompatible with the PostGIS extension due to version differences. I'm still on 9.5 because I can't afford the downtime to properly migrate.

sho said 2 months ago:

Homebrew added an extremely useful `brew postgresql-upgrade-database` option which will handle that. You do have to remember to run it separately.

By the way, I used to have your encyclopaedia of all world knowledge

int_19h said 2 months ago:

This sounds like an issue with Homebrew, to some extent. On Debian, every major Postgres version gets its own package.

mulander said 2 months ago:

> Every time an on-disk database page (4KB) needs to be modified by a write operation, even just a single byte, a copy of the entire page, edited with the requested changes, is written to the write-ahead log (WAL). Physical streaming replication leverages this existing WAL infrastructure as a log of changes it streams to replicas.

First, the PostgreSQL page size is 8KB and has been that since the beginning.

The remaining part. According to PostgreSQL documentation[1] (on full page writes which decides if those are made), a copy of the entire page is only written fully to the WAL after the first modification of that page since the last checkpoint. Subsequent modifications will not result in full page writes to the WAL. So if you update a counter 3 times in sequence you won't get 3*8KB written to the WAL, instead you would get a single page dump and the remaining two would only log the row-level change which is much smaller[2]. This is further reduced by WAL compression[3] (reducing the segment usage) and by increasing the checkpointing interval which would reduce the amount of copies happening[4].

This irked me because it sounded like whatever you touch produces an 8KB copy of data and it seems to not be the case.

[1] - https://www.postgresql.org/docs/11/runtime-config-wal.html#G...

[2] - http://www.interdb.jp/pg/pgsql09.html

[3] - https://www.postgresql.org/docs/11/runtime-config-wal.html#G...

[4] - https://www.postgresql.org/docs/11/runtime-config-wal.html#G...

anarazel said 2 months ago:

That is correct. And neither is a full page write logged if the page is initialized from scratch. And even without WAL compression, the "hole" in the middle of the page if the page is not full, is "compressed" out.

That's not to say that FPWs are not a problem. The increase in WAL volume they can cause can be seriously problematic.

One interesting thing is that they actually can often very significantly increase streaming replication / crash recovery performance. When replaying the incremental records the page needs to be read from the os/disk if the page is not in the postgres' page cache. But with FPWs we can seed the page cache contents with the page image. For the pretty common case where the number of pages written between two checkpoints fits into the cache, that can be a very serious performance advantage.

SigmundA said 2 months ago:

Great list I would add one thing to it:

No query plan caching not even for sprocs or functions.

Was surprised by this one, looks like the optimizer is much simpler than other db's so it usually take less time to create to the plan but the overhead is still there.

This is why you see the recommendation to use prepared statements and many client libraries try to automatically, but a prepared statement cannot be shared between sessions so its only good if your repeating the same statement over and over on the same connection.

If your app calls the same statements over and over from different connections which most apps tend to do it can save significant overhead and reduce response times. It was pretty much mandatory to make sure you where using parameterised SQL or sprocs back in the day to make sure it was using a cached plan properly.

jedberg said 2 months ago:

I agree with everything on this list. Been bit by most of these at one point or another. XID wraparound was the worst.

Ironically we avoided a lot of the replication bugs by accidentally deciding to use logical replication from the start, but that of course brought in a whole different set of bugs instead.

I'm surprised there wasn't a complaint about the vacuumer. That was probably my biggest single pain of running a large active cluster. There was never a good time to vacuum, but if you skipped it, it eventually happened automatically, usually at the worst possible time, when the database was most active.

To be fair I haven't managed postgres since 8.3, so maybe that got better?

Tostino said 2 months ago:

It's pretty darn different than 8.3 today.

KingOfCoders said 2 months ago:

The 'best database' article recently linked here, says

"PostgreSQL picks a method of concurrency control that works best for high INSERT and SELECT workloads. [...] tracking overhead for UPDATE and DELETE."

This one says "INSERT and UPDATE operations create new copies (or “row versions”) of any modified rows, leaving the old versions on disk until they can be cleaned up."

I do think this one is wrong, but this is my wild guess as I am no expert in any way. INSERT should be fine. Or how would INSERT "create new copies"?

[Edit] See authors comment.

rbranson said 2 months ago:

Author here. This should not have included INSERT. I updated the post to reflect that. Thanks for picking this up.

KingOfCoders said 2 months ago:

Thanks for your fast reply, much appreciated.

DeathArrow said 2 months ago:

I'm a humble web developer and I'm not very knowledgeable about databases.

I am glad I deal with an ORM for both personal and work projects instead relying on database specifics. That way, the app is DB agnostic and I can switch the database with ease. If your resource are limited, I think that is good.

When you have the resources, it's better to hire an architect and a DBA to tell you what DB to use and maintain it.

goatinaboat said 2 months ago:

I can switch the database with ease

In my experience an organisation is far, far more likely to switch operating systems or hardware platforms or programming languages than they are the database. But no programmer bothers to code in a clever but restricted syntax that would be a valid program in both C# and Java. Or restricts themselves to a core set of OS features or hardware instructions just in case. It really is quite bizarre to watch.

karatestomp said 2 months ago:

Plus it’s fairly common to end up with more than one program, in more than one language, reading from and writing to a DB. The more you’ve avoided using DB features (in featureful databases, anyway) the slower (in performance and dev time) and riskier (you will have more bugs, and maybe some pretty bad ones) this is.

vp8989 said 2 months ago:

I don't think anyone would bother to port code from C# to Java in 2020, but I agree with your point. Not using a technology to it's fullest in favor of portability is not smart.

I think most things we build on top are sufficiently complex now that even with these self-imposed limits, porting everything over from X to Y is a major undertaking.

jtdev said 2 months ago:

Every developer that works with relational data should learn the basics of ANSI SQL and at least one or two of the major SQL platforms (Postgres, MySQL, SQL Server, etc.).

It’s really not that complicated (if you can figure out redux...) and the knowledge will make you a more well rounded developer. These are skills and knowledge that will be valuable and applicable for many years.

Don’t allow your ORM to be a knowledge crutch.

gizmondo said 2 months ago:

If your app can be DB agnostic, then what DB you use doesn't matter, so you will never need to switch it.

Also you don't need an "architect" and a "DBA" to know how to use databases properly.

pinkfoot said 2 months ago:

Er, once you know how to use databases properly, you are a DBA.

vp8989 said 2 months ago:

They are 2 distinct skillsets:

#1 - The administration of a database system #2 - Being able to write code that uses said system effectively

If you don't have skillset #2, you are going to design and build bad systems and eventually a DBA will need to bail you out. The trend is that companies are reducing the number of DBAs on the payroll because of things like AWS Aurora, so you had better get skillset #2.

And why wouldn't you want it? It's like knife skills for a chef. You should know your tools inside out.

hacker_newz said 2 months ago:

You definitely need a DBA to optimally set up a production database.

mamcx said 2 months ago:

Ok, I have mine:

When declare a custom type with a check, the error not show the row/table that cause the problem, only that something happen:

    CONSTRAINT non_empty CHECK (length(VALUE) > 0);
However, doing the check inline show the error in full.

This cause me to rewrite all the tables, twice (one adding the new type thinking will help, once again inlining everything).

rossmohax said 2 months ago:

Criticism is valid, but he talks about cases of millions connections to a single db, that is a significant scale many companies will never see. In addition to that, probably no database can serve under significant load without careful tuning, preferably with understanding of DB internals and knowing compromises DB authors took when designin it.

PostgreSQL is constantly improving. At least some of the problems with scaling with number of connections have more to do with locking rather than process-per-connection architecture, it is being worked on with impressive results doubling number of transactions per second for 200 connections: https://www.postgresql.org/message-id/20200301084638.7hfktq4...

scurvy said 2 months ago:

I'll trust postgresql more when it can support a few thousand connections without resorting to running middleware (pgbouncer) all over. That was his point. PostgreSQL is just abysmally bad in this area.

The process per connection model works great for "my first rails project" so every developer brings it to $dayjob. Then they are caught off guard when they start getting real traffic. It's terrifying to watch a couple hundred connections take a moderately sized server (~100 threads) down the native_queued_spin_lock_slowpath path to ruin. That's just sad.

anarazel said 2 months ago:

> I'll trust postgresql more when it can support a few thousand connections without resorting to running middleware (pgbouncer) all over. That was his point. PostgreSQL is just abysmally bad in this area.

Depending on your workload it's entirely possible to run PG with 2000 connections. The most important thing is to configure postgres / the operating system to use huge pages, that gets rid of a good bit of the overhead.

If the workload has a lot of quick queries it's pretty easy to hit scalability issues around snapshots (the metadata needed to make visibility determinations). It's not that bad on a single-socket server, but on 2+ sockets with high core counts it can be significant.

We're working on it (I'm polishing the patch right now, actually :)). Here's an example graph https://twitter.com/AndresFreundTec/status/12346215343642419...

My local 2 socket workstation doesn't have enough cores to show the problem to the same degree unfortunately, so the above is from an azure VM. The odd dip in the middle is an issue with slow IPIs on azure VMs, and is worse when the benchmark client and server run on the same machine.

> It's terrifying to watch a couple hundred connections take a moderately sized server (~100 threads) down the native_queued_spin_lock_slowpath path to ruin. That's just sad.

Which spinlock was that on? I've seen a number of different ones over time. I've definitely hit ones in various drivers, and in both the generic parts of the unix socket and tcp stacks.

anarazel said 2 months ago:

One more thing: There's definitely a significant overhead implied by the process-per-connection model - I don't want to deny that.

In my opinion it's at the moment not the most urgent issue wrt connection scalability (the snapshot scalability is independent from process v threads, and measurably the bottleneck), and the amount of work needed to change to a different model is larger.

But I do think we're gonna have to change to threads, in the not too far away future. We can work around all the individual problems, but the cost in complexity is bigger than the advantages of increased isolation. We had to add too much complexity / duplicated infrastructure to e.g. make parallelism work (which needs to map additional shared memory after fork, and thus addresses differ between processes).

scurvy said 2 months ago:

>> native_queued_spin_lock_slowpath path >Which spinlock was that on? I've seen a number of different ones over time. I've definitely hit ones in various drivers, and in both the generic parts of the unix socket and tcp stacks.

Not sure yet. It was on a server with 1000 stable connections. Things were fine for a while, then suddenly system would jump to 99% on all 104 threads and native_queued_spin_lock_slowpath was indicated by perf.

Ironically we cleared it up by having sessions disconnect when they were done. Boggled the mind that increasing connection churn improved things, but it did.

anarazel said 2 months ago:

That sounds like you could have hit transparent hugepage / compaction related issues. They, IME, tend to hit more often with lots of long running processes, than when there's a lot of churn. It has gotten a lot better in more recent kernel versions, but if you're on an older kernel, it can be really bad.

scurvy said 2 months ago:

THP => never. I thought about that, too.

minusf said 2 months ago:

pgbouncer is also a speed optimisation, not just scaling, all agreed, pgsql lags in this currently.

but the default 150ish connections ouf of the box mean 150 workers which means 20ish 8 core VMs for your e.g. django app (1 worker/core), which is a lot of scaling already and a good business problem to have, not just an app demo. Most internal projects never make it even there.

phamilton said 2 months ago:

Moving to PostgreSQL on Amazon Aurora simplifies all the replication issues listed. We (Remind) use an autoscaled PostgresQL Aurora cluster and have been pretty happy with it.

icheishvili said 2 months ago:

We were quite unhappy with Aurora because of the terrible performance of writes against tables with secondary indices. What does your workload look like?

phamilton said 2 months ago:

We trimmed out most of our secondary indexes long before we moved to Aurora because performance of postgres writes with secondary indexes is always a bit rough. The few we have remaining do well enough.

Overall our workload is very read heavy. At peak, if we compare our CPU on the writer vs the aggregate CPU on the readers, we do about 10x more read work than writes.

Remind is an education messaging program, so our workload is partially user management (which users belong to which schools and which classes) and partially user generated content (messages being sent). Our user generated content (more like 2-3x read vs write) is all backed by DynamoDB and our user management is in a couple of Aurora database clusters.

MuffinFlavored said 2 months ago:

How much do you pay a month for what size?

phamilton said 2 months ago:

Cost is complicated because we autoscale from 2-3 readers up 10+ readers during peak traffic. Instead of running 10 replicas 24/7, we can spin one up and within about 10 minutes it's handling reads. So we have a few instances that only run for about 3 hours a day and others that run maybe 6 hours. That is a big cost savings over running them all 24 hours a day. We couldn't bring up new replicas like that when we were on normal Postgres RDS.

Pricing is also surprising compared to vanilla Postgres RDS because reader nodes double as spare writers. A multi-az deployment of Postgres RDS plus two single-az replicas is more expensive than a 3 node (1 writer and two readers) Aurora cluster. E.g. on 2xlarge instances, this Aurora setup is $3.48/hour vs $4/hr on RDS for similar effective hardware and fault tolerance.

Running directly on EC2 is going to be much cheaper obviously. $1.51/hr for three 2xlarge instances(if you want to failover to an active replica) or $2.01/hr for 4 if you want a dedicated failover instance (like RDS does).

agacera said 2 months ago:

Aurora is indeed amazing. However, my experience with it regarding costs is that main cost you have running it is not compute, but IO ($0.2 per million requets [1]).

This cost is kind of hidden since to estimate this in the early stages of a project is an art. In one project on my team the IO cost is about 8x more than cost of instances. But imo it is still worth and I never actually calculated how much we would pay if we were running on RDS + provisioned IOPS.

phamilton said 2 months ago:

It is really hard to estimate because it isn't quite apples to apples. With Aurora we pay for what we use, not what we provision. We used to have some big write spikes for about 5 minutes each hour and so had to massively over-provision IOPS. With Aurora we just pay for the IOPs we use on those spikes.

Right now, on a multiple of the traffic we had before we moved to Aurora we are paying less than half what we used to for IO.

rossmohax said 2 months ago:

How much knoweldge is transferable? Isn't Aurora just protocol and SQL dialect compatible, but underneath it has nothing to do with postgres?

phamilton said 2 months ago:

Underneath it is postgres, with a few key things rewritten.

It has a custom storage layer that isn't too different from a very fancy SAN. Replication is where things get to be very different. All instances use the same underlying store, so replication of storage isn't part of the postgres layer. However, reader nodes need to invalidate cache when writes occur. So they use postgres replication, but the readers skip writing to storage.

Lots of rewritten components under the hood to support this different storage paradigm, but the engine itself is still postgres.

vp8989 said 2 months ago:

Each flavor (and version) of Aurora is compatible with a corresponding version of the open source software. For example Aurora MySQL 1.* is compatible with MySQL 5.6

At my current gig we use it in Prod, but we are also able to run our software during development pointing to locally installed open-source versions of MySQL just fine. I imagine it's the same for Postgres.

setr said 2 months ago:

I believe he was asking whether an understanding of "under-the-hood" of postgres transfers to Aurora -- that is, does your tuning knowledge transfer as well, or is it just you can migrate your codebase transparently

Though now that I think about it, I think Aurora gives relatively little in tuning accees, so it's more of whether the hueristics transfer (eg the ol' avoid all joins, which I've always been suspicious of, but still don't know if it's a useful saying)

phamilton said 2 months ago:

There are a number of gotchas around tuning though. Buffer cache, for example, was fully redesigned for Aurora postgres. Aurora defaults are pretty good, but in our case we had some tuning of cache settings in place before moving over and the result was terrible. Wiping out our tuning and just using the defaults was a good fix. Also, pgbouncer didn't play very well with the aurora reader endpoints and we had to mess around with that.

But from the application perspective, it all runs pretty seamlessly. We've never had a behavioral difference between development against postgres and production with Aurora. Perf has really been the only difference, and perf in development never represents perf in prod at large scale anyway.

setr said 2 months ago:

I believe he was asking whether an understanding of "under-the-hood" of postgres transfers to Aurora -- that is, does your tuning knowledge transfer as well, or is it just you can migrate your codebase transparently

kqr said 2 months ago:

What are latencies like compared to a pg server on physical servers?

phamilton said 2 months ago:

Query latency is about the same, especially if you are serving from cache. If you have to hit the storage layer it probably slows down but we have a 99%+ buffer cache hit rate so we don't see it that often.

Replication lag is pretty steady at 15ms.

gregn610 said 2 months ago:

I'm surprised that no-one is complaining about the lack of Active Directory integration for authorization.

#disclaimer - Author of an AD integration solution that never got off the ground.


LunaSea said 2 months ago:

I think it's a bad idea for a database to start implementing third-part vendor related features.

That's the type of feature that should be implemented as a plugin.

nijave said 2 months ago:

You could just call it "ldap authentication". AD comes with an LDAP interface.

anarazel said 2 months ago:

Postgres does have ldap based auth, and also can authenticate against AD using sspi/gssapi.

The problem with that is that it requires users to have been created inside postgres first, and that you can't manage group membership inside AD.

gregn610 said 2 months ago:

yup, that's what I meant by authorisation, keeping the roles and groups in pgsql up to date.

fanf2 said 2 months ago:

Or Kerberos authentication (which AD also supports) https://www.postgresql.org/docs/current/gssapi-auth.html

jtdev said 2 months ago:

Yes, and that AD LDAP interface is riddled with MS specific deviations that require complying with MS’s way of doing LDAP.

mrfusion said 2 months ago:

Can I bounce this idea off you guys.

Would there be a market for a dba to charge maybe 100-200. Just comes in, listens to your DB use cases, and recommends various config/setting changes, hardware, etc?

It seems so much better than having a team of programmers study Postgres settings for a week. That was my last experience with it at least.

pas said 2 months ago:

Percona does that, and they grew to a pretty decent size. If you can find enough clients and visit them every few weeks/months it can work.

codegladiator said 2 months ago:

Why wouldn't an automated interface be able to give similar recommendations ?

no-s said 2 months ago:

>Why wouldn't an automated interface be able to give similar recommendations ?

you being sarcastic? If only we had an automated interface to make decisions about what code to write, then we wouldn't need programmers. Look how well that turned out IRL. We don't really need many assembly language programmers any more, but now we have all these nifty new programming languages...

codegladiator said 2 months ago:

I am under the impression that all the combinations of all configurations would still be small enough to present neatly maybe in a wizard.

jakearmitage said 2 months ago:

Does anyone know any good book with PostgreSQL tips and tricks and cool snippets? Coming from MySQL, I'm always finding exciting things: checks, date ranges and GIST, to_tsvector(title) @@ websearch_to_tsquery('foobar')...

codegladiator said 2 months ago:

Wow this article and this thread is so full of useful information for getting started to ventur into DB management/engineering.

osrec said 2 months ago:

Not sure if this makes sense, however, I've always preferred MariaDB to Postgres because it feels lighter. And I haven't really come up against any significant limitations in MariaDB that would make me want to switch to Postgres.

stickfigure said 2 months ago:

Transactional DDL. I for the life of me cannot understand how anyone survives in production without this. Half-applied migrations will prevent you from both deploying new code and rolling back to the old. Expect significant downtime.

edoceo said 2 months ago:

Heavy writes, mulitstep transactions is where PG has always won for me vs Maria and MySQL - it was even more pronounced when I decided I'm all in on PG c2003

redis_mlc said 2 months ago:

1) 2003 was 17 years ago.

2) PG has a write amplification problem with multiple indexes that MySQL Innodb doesn't have.

edoceo said 2 months ago:

2003 was when I picked PG as preferred. Since then I've continues to work with DBs and of course evaluated Maria at least every 18 months (or a client was paying me to do this). PG continues to win for nearly every case I throw at it. Like when NoSQL was a fad (why tho!?) but PG had native json, then jsonb types. I didn't even have to look at that silly Mongo, PG baba!

djd20 said 2 months ago:

You have a problem if you're trying to index every column...

asah said 2 months ago:

I'm curious how innodb solves this, given that the theory of indices pretty much demands write amplification by definition. Postgres' WA is worse?

edoceo said 2 months ago:

Its from an Uber post - more story and follow up in this post https://dba.stackexchange.com/questions/226879/did-postgresq...

djd20 said 2 months ago:

Yeah - they indexed every single column.... also - what may not have worked for uber, will be just fine for 99% of usecases. How often do you work with that kind of load. What they moved to isn't so much mysql, rather mysql used as a key value store with 0 relational database usage. Not really a postgres issue, rather one of being too big for relational databases.

philliphaydon said 2 months ago:

Does MariaDB have documentation?

osrec said 2 months ago:
philliphaydon said 2 months ago:

Ahh under server documentation. The site could use some TLC. Not easy to navigate.


rishav_sharan said 2 months ago:

As a not very tech savvy person, object/hierarchical data querying in pg is horrible. The query syntax is alien and just feels tacked on.

Tostino said 2 months ago:

Mind giving an example?

asah said 2 months ago:

What do you consider natural? It's not hard to add something else.

said 2 months ago:
Mountain_Skies said 2 months ago:

Anyone else getting a 500 error for the link?

andy_ppp said 2 months ago:

Hopefully a database problem


said 2 months ago:
thayne said 2 months ago:

why does postgresql use a process per connection? Is there some advantage to doing this over using threads for each connection (or even asynchronous connection handling)?

int_19h said 2 months ago:

Postgres dates back to the era when multiple processes were the normal way of doing this kind of thing on Unix-like OSes, and POSIX threads were often poorly supported on free ones.

animalnewbie said 2 months ago:

Anyone got postgres (or general db) tips for a CoW-fs? I'm using postgres on zfs.

p_l said 2 months ago:

Ensure that the dataset containing your postgres data is configured with record size equal to postgres page size or close enough (Lots of places use 8kB ZFS records for 4kB pages).

This will reduce write amplification due to excessive read-modify-write cycles.

yjftsjthsd-h said 2 months ago:

So, with the major caveat that I am not an expert and your mileage will vary: After some playing around with it, I intentionally reverted our postgres datasets back to the default ZFS size (EDIT: 128K) because we weren't super performance sensitive and the smaller pages killed compression. Obviously compression ratio vs speed is going to depend very heavily on exactly what you're doing, but it seems to have been a good trade for us.

p_l said 2 months ago:

an interesting hack is to create two tablespaces, one with record size of 8kB, one with recordsize set to maximum, and then appropriately assign tables to them according to ones performance needs. Rarely-written (for example historical) data can be put into partitions living on the large record tablespace (for example 1M recordsize) and have indexes redone with 100% fillfactor.

Of course all of that should be informed by getting actual data about performance first ;)

Tostino said 2 months ago:

Pg uses 8kb pages by default. You do really want your fs/db page size to match though except in very very specific scenarios.

p_l said 2 months ago:

Ahh, I didn't know it used 8kB page - that makes it even better, because it removes any read-modify-write cycle if you run with record size of 8k.

animalnewbie said 2 months ago:

Why doesn't pg query the fs for the default scenario I wonder

Tostino said 2 months ago:

Because it's a compile time flag, not a init or config parameter.

I agree it would be nice if the page size was more adaptive to just not have FS page size alignment issues.

rb808 said 2 months ago:

Its scary that my company is replacing old Oracle DBs and teams of specialist DBAs with developers running their own PostSQL instances. Sure it saves money but its gonna blow one day soon...

Tostino said 2 months ago:

Getting rid of Oracle, fine... But getting rid of specialists is probably not the smartest move. I don't think Postgres will have anything to do with the problems your company may face.

skyde said 2 months ago:

is it one postSQL per app replacing many app on a single Oracle instance ?

moving away from some sharing fix most of the problem DBA are hired to mitigate