Slightly Saner Server-Side Symmetric Encryption in PostgreSQL
This is the first of a series of articles on using advanced database techniques to hide the complexity of database encryption and improve its security. The focus is conceptual, not operational, with the hope that readers can apply some of these techniques to their databases. Every site is different and you must always take local requirements into account.
Some of the articles will seem to be a major step backward but it’s usually temporary and solely to make the concepts easier to understand.
Concepts
There are two types of ciphers used in encryption. Historically we only had “symmetric” ciphers. They were straightforward and relatively fast but required everyone have a copy of the key. This is a huge problem — how do you distribute the key? How can you prevent someone from illicitly modifying the data? It goes on and on.
The second type of cipher is an “asymmetric” cipher. This is a cipher based on mathematical concepts where it’s easy to go one way but difficult to go the other way unless you have some private information. A good example is that it’s easy to multiply together two large numbers but difficult to perform to identify the individual numbers given only the product (given some basic assumptions).
Despite popular belief, public-key encryption doesn’t encrypt everything using an asymmetric cipher. Instead, it uses a hybrid approach — the message itself is encrypted using a standard symmetric cipher and a random key — and the random key is encrypted using an asymmetric cipher. This has a lot of benefits, esp. when used with features like HMAC digests that can detect whether the encrypted content has been modified, but it results in a “ciphertext” that’s far larger than what a symmetric cipher creates.
PostgreSQL PGCrypto Extension
PostgreSQL has supported symmetric ciphers and the ability to encrypt database contents for many years, via the pgcrypto extension. Unfortunately, it was just the raw tools to properly support symmetric ciphers and typical implementations were painful to maintain and insecure.
For this reason, among others, the extension added support for public-key encryption a few years back. The implementation is much more robust and easier to use — but it means that each record has become bloated relative to symmetric ciphers.
This isn’t a problem if you only encrypt potentially large text fields — but could be a problem if you wish to encrypt fields containing nothing but a numeric value or date. A properly encrypted field will always require a minimum of around 40 bytes but that’s a fraction of what’s required by the smallest public key encrypted field.
Threat Model
Every discussion of security must start with the threat model. We want to ensure our actions are sufficient to address the identified threats but not so excessive that it interferes with usability.
The precise threat model will depend on the industry (e.g., legal requirements), individual company, individual purpose, etc., and far beyond what I can discuss here. I can address several high-level concerns.
Data in Flight
“Data in Flight” refers to the security of the data when it’s in transit between systems (or even between processes on a single system). For our purposes I consider this a solved problem:
- use TLS encryption — even on internal networks
- use a VLAN (if possible) or VPN
In the past few companies bothered with TLS encryption is entirely inside of a firewall but after the Snowden documents came out many companies became concerned that this isn’t enough. TLS and VPNs come at a modest performance hit but they’ve available and well-tested if you need to protect data in flight.
Data at Rest
“Data at Rest” refers to the security of the data when it’s written to a file, e.g., the database’s journal. It can also refer to temporary files (e.g., anything written to swap space) or even something sitting in system memory. This can quickly become a very deep problem once you start thinking about implementation details like the CPU’s internal cache memory.
For our purposes we’re only concerned about two places:
- the database’s journal (aka the server’s hard disk). This must be protected against both attackers with access to a live system and attackers with access to backup media
- the database client as it receives the data from the network.
In current practice, it’s common for the client application to be responsible for encrypting and decrypting the data. This is the most flexible — and it allows the application to be written so that only the code that needs to see the unencrypted data has access to it — but it’s also the riskiest since it requires distribution and maintenance of all encryption keys to all clients.
The alternative is to have the database itself manage encryption and decryption of the data. This might seem pointless — but this is where user permissions come into play. It is trivial for a database to limit access to schemas, tables, stored procedures, etc., on a user-by-user basis. An authorized user will see a standard table while an unauthorized user wouldn’t be able to see the table at all, or could get a database error message if they tried to retrieve encrypted information from it.
This is the approach I’ll be taking. The ultimate goal is to have tables that look like any other table to authorized users — but which are inaccessible to unauthorized users even if they’re illicitly been granted access to the schemas, tables, etc.
SQL Injection
“SQL Injection” refers to the security of the database even if an attacker can modify the SQL produced by the application software. In the worst case, the user can run arbitrary commands as the application user.
There are a few standard countermeasures to limit the damage possible, e.g., there should always be separate “owner” and “user” accounts. The “owner” can modify the schema (e.g., what tables are present) but often can’t modify the contents of the tables. The “user” can modify the data but can’t modify the schema.
Another approach, currently out of fashion, is to eliminate direct SQL queries directly and perform all actions via stored procedure calls. The stored procedures may still have security vulnerabilities but fewer than code written by the average application developer.
Note to self: this would be a very good topic to return to, e.g., why it’s much better to use a stored procedure or user-defined function (UDF) like bool isValidPassword(username text, password text) than to return the salted password hash and rely on the application to perform the same tests.
Topics
I plan to cover the following topics, subject to change:
- type-safe functions using user-defined ENUM types
- type-safe functions using user-defined composite type
- restricted access via a separate schema and SECURITY DEFINER
and possibly
- user-defined types in C and/or PL/Java
- accessing external key vaults such as Hashicorp Vault
Additional Resources
This work is inspired by Cryptography in the Database but the world has changed in the 16 years since that book was published. This series will show how to apply some of the concepts using advanced SQL and the pgcrypto extension but will ultimately diverge since we will now rely on REST services such as Hashicorp Vault for our credentials instead of custom key servers.