Postgres: Thread safe insert/update
I don’t like stored procedures, but sometimes creating a materialized view can be reasonable. Business logic doesn’t belong in database procedural code, but, I guess, everyone has a few stories about how easily performance considerations overcome design principles. I think I had just one materialized view that I consider nice in my carrier until now. It was maintaining aggregated view of one quite heavy table. That materialized view was used instead of queries like
SELECT ... GROUP BY event_time::date It was long time ago, but up to present day I think that it was performance optimization in its purest form. No business logic was involved.
I’d like to share a trick that served me well in those triggers and I think it was not the last time. Every once in a while we need to create a row in a table if it isn’t there yet or update it otherwise. It can be implemented as easy as it sounds, with solid feeling of accomplishment, until one day you face process races.
Let’s suppose that we have two processes running the same code. Then the following sequence is possible. You probably already know that if you came here by googling, looking for a solution. But anyway:
- process A does
SELECTand finds out that row does not exist
- process B does
SELECT– still nothing there
- B inserts a row
- A tries to insert the same row
- BOOM! depending on whether you have unique constraints or not, process A gets an exception or database gets an extra row
Both outcomes, extra row and exception, may not be the worst things that can happen to some “blog” application, but fixing such bugs is good for self esteem. So let’s see what can be done here.
Race conditions are fixed with proper synchronization, and proper synchronization is based on locks. Postgres has nice implementation of shared and exclusive locks. We need an exclusive lock here which is gained by performing
SELECT * FROM table WHERE conditions FOR UPDATE. After such query, transaction owns the lock up to it’s end. There is no other pretty way to release locks.
We have locks implementation at hand, but, in above scenario, we may not have a row to lock yet. Okay, let’s make one. Hello, Giant Mutex:
-- somewhere in schema initialization / migration CREATE TABLE locks( name VARCHAR PRIMARY KEY ); INSERT INTO locks VALUES('MATVIEW_INSERT'); -- and then somewhere in materialized view update trigger ... SELECT 1 FROM locks WHERE name='MATVIEW_INSERT' FOR UPDATE; -- triggers are run inside transactions, so everything below is run under global exclusive lock SELECT INTO id_ id FROM my_matview WHERE .... IF NOT FOUND THEN INSERT INTO my_matview VALUES(...) RETURN; END IF; UPDATE my_matview SET ... WHERE id = id_;
Well, that’s nice. But we killed concurrency quite a bit. And that’s not nice because rows actually do exist most of the time. There is a pattern which is called “double checked locking” and often used in singleton resources initialization. An adaptation for our example would look like this:
SELECT INTO id_ id FROM my_matview WHERE .... IF NOT FOUND THEN SELECT 1 FROM locks WHERE name='MATVIEW_INSERT' FOR UPDATE; SELECT INTO id_ id FROM my_matview WHERE .... IF NOT FOUND THEN INSERT INTO my_matview VALUES(...) RETURN; END IF; END IF; UPDATE my_matview SET ... WHERE id = id_;
Note how we obtain lock only when first “easy” attempt to find row fails. Then we go “serious”, and when we do, we need to perform
SELECT check once again. That’s because, like at the very beginning, several threads can get down to serious business at the same time.
The next things to worry about are deadlocks. But that probably deserves a separate post.