We do the same thing, every employee can access our main financial/back office SQL database, but we just use PostgreSQL with row level security[0]. We never bothered to complicate it like the post does.
I want to build a shared postgres db with hundreds of small apps (OLTP) accessing shared tables using a RLS model model against well defined tables.
What are other limitations and mitigations folks have used or encountered to support stability and security? Things like
- Query timeouts to prevent noisy neighbors
- connection pooling (e.g. pgbouncer) also for noisy neighbors
- client schema compatibility (e.g. some applications running older versions, have certain assumptions about the schema that may change over time)
If you have people running crappy SQL SELECT, it can be a problem. statement-timeout[0] is your friend here. You still have to be on watch, and teach your users not to write crappy SQL.
You can also limit it by creating read-only replica's and making SELECT's happen on the replica. We don't usually bother, since 99% of our users are employees of ours, we can teach them to not be stupid. Since their usage doesn't change much over time, we can usually just hand them a SQL query and say: here run this instead.
Most of our employees don't even know they have SQL access, it's not like we force people to learn SQL to get their job done. Because of RLS and views, the ones that do SQL don't have to know much SQL, even if they do happen to use it. SELECT * from employees; gets them access to basically all the employee info they could want, but only to the employees they have access to. If you are a manager with 10 people, your select returns only your 10 people.
The payroll staff runs the same query and gets all of the employees they handle payroll for. Since our payroll is done inside of PostgreSQL(thanks plpython[1]), we can do some crazy access control stuff that most systems would never even dream about. Whenever new auditors come in and see that our payroll staff is limited to seeing only the info they need to do payroll, and only for their subset of employees they actually pay, they are awestruck.
The random vendors that can't be taught, we usually hand them a nightly SQLite dump instead. I.e let them pay the CPU cost of their crappy SQL.
Around client schema compatibility. This happens with other models too(API, etc). It's not unique to PG or SQL Databases. You have to plan for it. Since most all of our users interact with views and not with the actual underlying tables, it's not usually that big of a deal. In the extreme cases, where we can't just keep around a view for them, we have to help them along(sometimes kicking and screaming) into a new version.
I'd be so uncomfortable with this. It sounds like you're placing the full burden of access on a single boundary. I mean, maybe there's more to it that you haven't spoken about here, but "everything rests on this one postgres feature" is an unacceptably unsafe state to me.
It's not like RLS is just some random feature they are misusing. It's specifically for security and is absolutely reliable. Maybe you should do a bit more research before making comments like this.
You can (and in some cases should) combine this with other boundaries, like access control or specific query logic. RLS moves the core checks closer to the data being controlled (i.e. the database) specifically to prevent the errors like forgetting to add the "where user_id = xxx" clause. It is super-valuable in compliance scenarios where someone like your DB Admin has permission to control access but not see any data, and consumers (both devs and end users) can write queries to see different levels of access but are not allowed to control data.
Obviously it's not a silver bullet and the isolation can be confusing when debugging, but generally a single point for your applying RBAC is a feature not a shortcoming. The next level of security might be how you define your roles.
I actually believe the simplest, most secure client scenario is physical isolation, where you give the user/consumer only the data they are allowed to use and then don't try to control it (someone mentioned this above, using parquet & duckdb). There's downsides here too: doesn't work for write scenarios, can be resource intensive or time delayed, doesn't handle chain of custody well, etc. You typically have two strategies:
1. pick the best approach for the specific situation.
2. pick your one tool as your hammer and be a d!ck about it.
Correct, but you have yet to mention any other solution where there is more than one boundary at the data access layer. Because there really isn't any.
You can code it yourself in your bespoke app, have your vendor maintain it with their bespoke access control, or let RLS do it. There aren't really any other options that I'm aware of.
Personally, having done the "code it yourself in your bespoke app" it's a PITA and it's generally not nearly as good as RLS. That's what we did before RLS and it sucked.
On top of that, you can do things like SSO, data encryption, etc, but those are not data access layers, those are different layers. We do these things too(tho very little of the data encryption part, since it's such a PITA to make work very reliably even with vault/boa holding the encryption keys for us).
> Correct, but you have yet to mention any other solution where there is more than one boundary at the data access layer. Because there really isn't any.
I have. I gave the example of using RLS where users still provide the token to gain RLS privileges but an app brokers and constraints the connections. I have also given the example of encryption, to which your response is that encryption is hard, which I don't think is true but doesn't really change anything. Encryption is absolutely a data access layer control.
This is the real world not everybody can get a perfectly isolated database instance. Also you do realize that is not necessarily any more secure than RLS right? Something still has to control what database the code connects to. That could have a flaw just as much as you could have a flaw when setting up RLS.
RLS is one of the simplest things to set up properly, if you can't figure it out I don't think I'm the one who doesn't know what they're talking about.
row level security is not a feature specific to Postgres, but more a pretty standard and acceptable way to control access in a multitenant or multicontext environment that pretty much every data provider supports/implements. When it comes to answering a single specific question (like the one RLS targets) I believe you DO want a single, simple answer, vs. something like "it uses these n independent things working in conjunction..."
Your SSO system is a lot of pressure on one control too. Nobody seems to have problems with Azure or Okta or whatever SSO system you use having every key to the kingdom.
RLS has been around a long time and is very stable and doesn't change much. SSO providers keep adding stuff ALL the time, and they regularly have issues. PG RLS is very boring in comparison.
I don't remember the last CVE or outage we had with PG that broke stuff. I can't remember a single instance of RLS causing us access control problems on a wide scale. Since we tied their job(s) to their access control many years ago, it's very rare that we even have the random fat-fingered access control issue for a single user anymore either. I think the last one was a year ago?
> Your SSO system is a lot of pressure on one control too. Nobody seems to have problems with Azure or Okta or whatever SSO system you use having every key to the kingdom.
Some do, which is why they want MFA on the target side as well as on their SSO. But yes, SSO is very scary and there's a ton of security pressure on it. I don't think that's a very good argument for why we should think that every system should only require one layer of defense.
I'm going to sort of skip over any comparison to SSO since I'm not going to defend the position of "SSO is fine as a single barrier", especially as SSO is rarely implemented with one policy - there's device attestation, 2FA, etc.
> RLS has been around a long time and is very stable and doesn't change much.
RLS is great, I'm a fan.
> I don't remember the last CVE or outage we had with PG that broke stuff.
It doesn't really matter. The fact is that you're one CVE away from every employee having access to arbitrary data, including financial data. I feel a bit like a broken record saying this.
> It doesn't really matter. The fact is that you're one CVE away from every employee having access to arbitrary data, including financial data. I feel a bit like a broken record saying this.
Sure, but it's the same with pretty much any other app architecture.
Either your app has all the data access and you put your access control there, or you do the access control in the database. There really aren't other options here. There isn't access control defense in depth here. The best you can really do is do some data encryption of the data in the tables. We do some of that, but it's such a PITA that we relegate it to special stuff only.
> especially as SSO is rarely implemented with one policy - there's device attestation, 2FA, etc.
Sure but ALL of that relies on the SSO system behaving as advertised, so you think of it as separate policies, but it really isn't. It's once SSO CVE away from giving away the store. We use SSO with PG, that's how they authenticate to PG, we are fans of SSO too.
> Either your app has all the data access and you put your access control there, or you do the access control in the database. There really aren't other options here.
That's odd, I just clearly delineated an option in which this is not the case. The DB enforces RLS, users provide the RLS token, and an app gates access to the db.
It's not really any different than using pgbouncer or something similar. All it buys you is not having to use the PG protocol on the client.
There is no extra security here though. One could even argue you lose security here, since now you have to rely on the app to not get tokens confused, since they hold and use the tokens on behalf of the user. A single bad CVE in the app and one can become any user the app currently knows about.
I think the conversation is suffering from trying to solve a problem that's underspecified, which is going to just lead to a lot of "what if" questions. I've explained my concern with relying on a single boundary, which should not be contentious and I have given multiple examples of potential additional controls.
Most companies, or at least the ones I've worked at, don't use row level security at all. Instead, the application just accesses the multi tenant database.
It's absolutely littered with broken access control vulnerabilities. You have to remember to put the user key and group in every query across the entire application. And then there's dynamic queries and ORMs, which make everything worse. Now you cant even audit the source code, you have to observe behavior.
Most people don't know their applications have these vulnerabilities, but they're very common.
I have never suggested that there's something wrong with RLS nor that RLS and an application should be mutually exclusive, as I've pointed out multiple times now.
Right, I'm just saying that, in practice, most web applications don't use it. From what I've seen. So in a relative sense, this is an improvement. And I think that's generally how we should view security. Within the context of the threat model.
Well, I mean it's not only RLS, but yes it's only PostgreSQL doing the access control as far as if they can see a particular table or row.
Every user gets their own role in PG, so the rest of the PG access control system is also used.
We have your normal SSO system(Azure) and if Tootie employee doesn't need access to Asset Control, they don't get any access to the asset schema for instance.
What would be your method?
You would have some app that your dev team runs that handles access control, so your app gets unrestricted access to the DB. Now your app is the single boundary, and it forces everyone to go through your app. How is that better? It also complicates your queries, with a ton of extra where conditions.
A bunch of bespoke access control code you hope is reliable or a feature of the database that's well tested and been around for a long time. pgtap[0] is amazing for ensuring our access control (and the rest of the DB) works.
If some random utility wants to access data, you either have to do something special access wise, or have them also go through your app(let's hope you have an API and it allows for whatever the special is). For us, that random utility gets SQL access just like everyone else. They get RLS applied, etc. They can be naive and assume they have total control, because when they do select * from employees; they get access to only the employee column and rows we want that utility to have.
We have a bunch of tools over the decades that need access to various bits of our data for reason(s). Rather than make them all do wacky stuff with specialized API's, they just get bog standard PG SQL. We don't have to train vendor Tito how to deal with our stuff, we just hand them their auth info to PG and they can go to town. When people want Excel spreadsheets, they just launch excel, do a data query and their data just shows up magically. All from within Excel, using the standard excel data query tools, no SQL needed.
Conceptually that's no different to any security measures that prevent you from accessing data you're not supposed to? At the end of the day with all data that is colocated you're trusting that some permission feature somewhere is preventing you from accessing data you're not supposed to.
We trust that Amazon or Google or Microsoft are successful in protecting customer data for example. We trust that when you log into your bank account the money you see is yours, and when you deposit it we trust that the money goes into your account. But it's all just mostly logical separation.
We just create mini data "ponds" on the fly by copying tenant isolated gold tier data to parquet in s3. The users/agent queries are executed with duckdb. We run this process when the user start a session and generate an STS token scoped to their tenant bucket path. Its extremely simple and works well (at least with our data volumes).
Reasons 1-3 could very well be done with ClickHouse policies (RLS) and good data warehouse design. In fact, that’s more secure than a compiler adding a where to a query ran by an all mighty user.
Reason 4 is probably an improvement, but could probably be done with CH functions.
The problem with custom DSLs like this is that tradeoff a massive ecosystem for very little benefit.
as a clickhouse architect, i can’t help but feel they would have been better served by understanding their db’s native capabilities better before jumping into implementing this.
row level access control, resource quotas, scheduling policies, session settings, etc. all could have been used in concert to achieve a very similar outcome with a dozen or so ddl/dcl statements.
We did this with MotherDuck, and without introducing a new language. Every tenant has their own isolated storage and compute, so it’s trivial to grant internal users access to specific tenants as needed. DuckDB’s SQL dialect is mostly just Postgres’ with some nice ergonomic additions and a host of extra functionality.
> How do you let users write arbitrary SQL against a shared multi-tenant analytical database without exposing other tenants' data or letting a rogue query take down the cluster?
For query operations I would try to find a way to solve this with tools like S3 and SQLite. There are a few VFS implementations for S3 and other CDNs.
Curious howw you handle a user who accidentally writes a query that does a full table scan across billions of rows does the quota kick in fast enough or have you had cases where it still causes noticeable impact on other tenants?
78 comments
0: https://www.postgresql.org/docs/18/ddl-rowsecurity.html
How do you enforce tenant isolation with that method, or prevent unbounded table reads?
We do something similar for our backoffice - just with the difference that it is Claude that has full freedom to write queries.
What are other limitations and mitigations folks have used or encountered to support stability and security? Things like
You can also limit it by creating read-only replica's and making SELECT's happen on the replica. We don't usually bother, since 99% of our users are employees of ours, we can teach them to not be stupid. Since their usage doesn't change much over time, we can usually just hand them a SQL query and say: here run this instead.
Most of our employees don't even know they have SQL access, it's not like we force people to learn SQL to get their job done. Because of RLS and views, the ones that do SQL don't have to know much SQL, even if they do happen to use it. SELECT * from employees; gets them access to basically all the employee info they could want, but only to the employees they have access to. If you are a manager with 10 people, your select returns only your 10 people.
The payroll staff runs the same query and gets all of the employees they handle payroll for. Since our payroll is done inside of PostgreSQL(thanks plpython[1]), we can do some crazy access control stuff that most systems would never even dream about. Whenever new auditors come in and see that our payroll staff is limited to seeing only the info they need to do payroll, and only for their subset of employees they actually pay, they are awestruck.
The random vendors that can't be taught, we usually hand them a nightly SQLite dump instead. I.e let them pay the CPU cost of their crappy SQL.
Around client schema compatibility. This happens with other models too(API, etc). It's not unique to PG or SQL Databases. You have to plan for it. Since most all of our users interact with views and not with the actual underlying tables, it's not usually that big of a deal. In the extreme cases, where we can't just keep around a view for them, we have to help them along(sometimes kicking and screaming) into a new version.
0: https://www.postgresql.org/docs/current/runtime-config-clien...
1: https://www.postgresql.org/docs/current/plpython.html
Obviously it's not a silver bullet and the isolation can be confusing when debugging, but generally a single point for your applying RBAC is a feature not a shortcoming. The next level of security might be how you define your roles.
I actually believe the simplest, most secure client scenario is physical isolation, where you give the user/consumer only the data they are allowed to use and then don't try to control it (someone mentioned this above, using parquet & duckdb). There's downsides here too: doesn't work for write scenarios, can be resource intensive or time delayed, doesn't handle chain of custody well, etc. You typically have two strategies:
1. pick the best approach for the specific situation.
2. pick your one tool as your hammer and be a d!ck about it.
You can code it yourself in your bespoke app, have your vendor maintain it with their bespoke access control, or let RLS do it. There aren't really any other options that I'm aware of.
Personally, having done the "code it yourself in your bespoke app" it's a PITA and it's generally not nearly as good as RLS. That's what we did before RLS and it sucked.
On top of that, you can do things like SSO, data encryption, etc, but those are not data access layers, those are different layers. We do these things too(tho very little of the data encryption part, since it's such a PITA to make work very reliably even with vault/boa holding the encryption keys for us).
> Correct, but you have yet to mention any other solution where there is more than one boundary at the data access layer. Because there really isn't any.
I have. I gave the example of using RLS where users still provide the token to gain RLS privileges but an app brokers and constraints the connections. I have also given the example of encryption, to which your response is that encryption is hard, which I don't think is true but doesn't really change anything. Encryption is absolutely a data access layer control.
> every employee can access our main financial/back office SQL database
This means that there is no access gate other than RLS, which includes financial data. That is a lot of pressure on one control.
RLS has been around a long time and is very stable and doesn't change much. SSO providers keep adding stuff ALL the time, and they regularly have issues. PG RLS is very boring in comparison.
I don't remember the last CVE or outage we had with PG that broke stuff. I can't remember a single instance of RLS causing us access control problems on a wide scale. Since we tied their job(s) to their access control many years ago, it's very rare that we even have the random fat-fingered access control issue for a single user anymore either. I think the last one was a year ago?
> Your SSO system is a lot of pressure on one control too. Nobody seems to have problems with Azure or Okta or whatever SSO system you use having every key to the kingdom.
Some do, which is why they want MFA on the target side as well as on their SSO. But yes, SSO is very scary and there's a ton of security pressure on it. I don't think that's a very good argument for why we should think that every system should only require one layer of defense.
I'm going to sort of skip over any comparison to SSO since I'm not going to defend the position of "SSO is fine as a single barrier", especially as SSO is rarely implemented with one policy - there's device attestation, 2FA, etc.
> RLS has been around a long time and is very stable and doesn't change much.
RLS is great, I'm a fan.
> I don't remember the last CVE or outage we had with PG that broke stuff.
It doesn't really matter. The fact is that you're one CVE away from every employee having access to arbitrary data, including financial data. I feel a bit like a broken record saying this.
> It doesn't really matter. The fact is that you're one CVE away from every employee having access to arbitrary data, including financial data. I feel a bit like a broken record saying this.
Sure, but it's the same with pretty much any other app architecture.
Either your app has all the data access and you put your access control there, or you do the access control in the database. There really aren't other options here. There isn't access control defense in depth here. The best you can really do is do some data encryption of the data in the tables. We do some of that, but it's such a PITA that we relegate it to special stuff only.
> especially as SSO is rarely implemented with one policy - there's device attestation, 2FA, etc.
Sure but ALL of that relies on the SSO system behaving as advertised, so you think of it as separate policies, but it really isn't. It's once SSO CVE away from giving away the store. We use SSO with PG, that's how they authenticate to PG, we are fans of SSO too.
> Either your app has all the data access and you put your access control there, or you do the access control in the database. There really aren't other options here.
That's odd, I just clearly delineated an option in which this is not the case. The DB enforces RLS, users provide the RLS token, and an app gates access to the db.
It's not really any different than using pgbouncer or something similar. All it buys you is not having to use the PG protocol on the client.
There is no extra security here though. One could even argue you lose security here, since now you have to rely on the app to not get tokens confused, since they hold and use the tokens on behalf of the user. A single bad CVE in the app and one can become any user the app currently knows about.
Most companies, or at least the ones I've worked at, don't use row level security at all. Instead, the application just accesses the multi tenant database.
It's absolutely littered with broken access control vulnerabilities. You have to remember to put the user key and group in every query across the entire application. And then there's dynamic queries and ORMs, which make everything worse. Now you cant even audit the source code, you have to observe behavior.
Most people don't know their applications have these vulnerabilities, but they're very common.
Every user gets their own role in PG, so the rest of the PG access control system is also used.
We have your normal SSO system(Azure) and if Tootie employee doesn't need access to Asset Control, they don't get any access to the asset schema for instance.
What would be your method?
You would have some app that your dev team runs that handles access control, so your app gets unrestricted access to the DB. Now your app is the single boundary, and it forces everyone to go through your app. How is that better? It also complicates your queries, with a ton of extra where conditions.
A bunch of bespoke access control code you hope is reliable or a feature of the database that's well tested and been around for a long time. pgtap[0] is amazing for ensuring our access control (and the rest of the DB) works.
If some random utility wants to access data, you either have to do something special access wise, or have them also go through your app(let's hope you have an API and it allows for whatever the special is). For us, that random utility gets SQL access just like everyone else. They get RLS applied, etc. They can be naive and assume they have total control, because when they do select * from employees; they get access to only the employee column and rows we want that utility to have.
We have a bunch of tools over the decades that need access to various bits of our data for reason(s). Rather than make them all do wacky stuff with specialized API's, they just get bog standard PG SQL. We don't have to train vendor Tito how to deal with our stuff, we just hand them their auth info to PG and they can go to town. When people want Excel spreadsheets, they just launch excel, do a data query and their data just shows up magically. All from within Excel, using the standard excel data query tools, no SQL needed.
0: https://pgtap.org/
We trust that Amazon or Google or Microsoft are successful in protecting customer data for example. We trust that when you log into your bank account the money you see is yours, and when you deposit it we trust that the money goes into your account. But it's all just mostly logical separation.
Reason 4 is probably an improvement, but could probably be done with CH functions.
The problem with custom DSLs like this is that tradeoff a massive ecosystem for very little benefit.
row level access control, resource quotas, scheduling policies, session settings, etc. all could have been used in concert to achieve a very similar outcome with a dozen or so ddl/dcl statements.
It’s a sql that compiles to the real database sql based on configuration.
> How do you let users write arbitrary SQL against a shared multi-tenant analytical database without exposing other tenants' data or letting a rogue query take down the cluster?
For query operations I would try to find a way to solve this with tools like S3 and SQLite. There are a few VFS implementations for S3 and other CDNs.