A couple of weeks ago I was given the task of helping a developer figure out why his Postgresql cluster wouldn’t fail over. The specific problem was to test failover prior to launching the Postgresql cluster in to Production. Sounds easy enough,, right? Setting things up wasn’t too bad, but finding answers quick was a challenge.

Setting up

The last thing I wanted to do was to start potentially destructive testing on a cluster that was to be used in Production. Instead I opted to create a couple Postgresql instances that I could set replication up on. Without thinking much, I set up instances on my laptop and ran them under Windows. I got to thinking about that, and realized that wasn’t going to reproduce the same results as on existing Linux cluster. So I spun up a couple of Ubuntu VMs.

Once getting the VMs spun up on my laptop, I had to install Postgresql, which wasn’t that difficult. In fact, it was a pretty simple process of making sure Ubuntu was updated, installing Postgresql per the documentation, and then configuring the HBA file to allow me to connect. The challenge came in setting up Replication for Postgresql. However, once setup it was pretty simple to get replication configured.

Getting the fail over to work

The way Postgresql handles replication is counter intuitive to the way SQL Server is done. Of course, to be fair I’m sure a Postgresql DBA would think the same about SQL. The Master-Slave relationship is just that,  Master telling the Slave what to do. If the Master goes down, the Slave is available for read. If the Slave goes down, the Master keeps going. In this relationship, the Master is the only write node. So when you promote the Slave to Master, the previous Master is orphaned. To rejoin the cluster the previous Master must be reconfigured as a Slave. Good times there. No failing over in a back and forth fashion.

My issue was with the promotion step kept failing. I would get this jacked error that:

pg_ctl: server did not promote in time

You have to love Linux for it’s vagueness. Turns out, it was pretty easy fix. When I was following the setup guide, I had copied in a recovery.conf file. Only, I didn’t modify the owner of the file. So had I taken more than 2 seconds to remind myself to check the log file, I would have discovered this problem much sooner. As it turns out, I made a post about this on DBA Stackexchange and was reminded to check the log file. In my defense, my Linux is a little rusty but that doesn’t excuse the fact I know to always check the log.

Turns out, changing the owner of the recovery.conf file made the promotion work flawlessly. I should know better. Always check the log file.

Leave a Reply