Strategies to Restore corrupted Postgres Database

I have posted this question on Server Fault as well just to see if I can get some help from any postgres gurus. I’m reaching out here as well to see if there’s anything I can try.

First and foremost, I’ve verified that all of the git repositories are ok and we have many backups of our storage box that contains them. As such, the only data that’s at risk of being lost are the 30 or so user accounts and the organization as well as any permissions I had setup. Even still, I would like to avoid the tedium if possible.

We’re running gitea and postgres through docker using the following docker compose config.

version: "3.9"
services:
  gitea:
    image: gitea/gitea:latest
    restart: always
    hostname: ###.###.###.###
    environment:
      - USER=git
      - USER_UID=1000
      - USER_GID=998
      - GITEA__database__DB_TYPE=postgres
      - GITEA__database___HOST=giteadb:5432
      - GITEA__database__NAME=gitea
      - GITEA__database__USER=gitea
      - GITEA__database__PASSWD=###################
    networks:
      - gitea
    ports:
      - 3000:3000
      - 22:22
    volumes:
      - /mnt/###################/gitea/data:/data
      - /etc/timezone:/etc/timezone:ro
      - /etc/localtime:/etc/localtime:ro
    shm_size: 256m
    depends_on:
      - giteadb

  giteadb:
    image: postgres:14
    restart: always
    environment:
      - POSTGRES_USER=gitea
      - POSTGRES_PASSWORD=###################
      - POSTGRES_DB=gitea
    networks:
      - gitea
    volumes:
      - /srv/###################/gitea/db:/var/lib/postgresql/data

  runner:
    image: gitea/act_runner:latest
    restart: always
    depends_on:
      - gitea
    volumes:
      - /mnt/###################/gitea/data/act_runner:/data
      - /var/run/docker.sock:/var/run/docker.sock
    environment:
      - GITEA_INSTANCE_URL=https://git.###################.com
      # When using Docker Secrets, it's also possible to use
      # GITEA_RUNNER_REGISTRATION_TOKEN_FILE to pass the location.
      # The env var takes precedence.
      # Needed only for the first start.
      - GITEA_RUNNER_REGISTRATION_TOKEN=###################
networks:
  gitea:
    external: false

As mentioned in the server fault post, the database was corrupted because I mistakenly spun up two postgres instances that were pointing at the same volume, as seen in the docker ps taken just before shutting down the instances.

$ docker ps
CONTAINER ID   IMAGE                           COMMAND                  CREATED         STATUS                  PORTS                                                                          NAMES
845b1e2b0725   ##################-website   "/usr/bin/supervisor…"   4 months ago    Up 4 months (healthy)   0.0.0.0:8081->8080/tcp, :::8081->8080/tcp                                      ####################-website-1
46cd2f360104   gitea/act_runner:latest         "/sbin/tini -- /opt/…"   6 months ago    Up 4 months                                                                                            docker-runner-1
2173badc8b07   postgres:14                     "docker-entrypoint.s…"   6 months ago    Up 4 months             5432/tcp                                                                       docker-giteadb-1
0990a6a53465   gitea/act_runner:latest         "/sbin/tini -- /opt/…"   7 months ago    Up 4 months                                                                                            docker-compose-configs-runner-1
34fd2b4b3cd9   gitea/gitea:latest              "/usr/bin/entrypoint…"   7 months ago    Up 4 months             0.0.0.0:22->22/tcp, :::22->22/tcp, 0.0.0.0:3000->3000/tcp, :::3000->3000/tcp   docker-compose-configs-gitea-1
84a8132cc11d   postgres:14                     "docker-entrypoint.s…"   7 months ago    Up 4 months             5432/tcp                                                                       docker-compose-configs-giteadb-1
bda89d4a5cfe   176399451347                    "docker-entrypoint.s…"   18 months ago   Up 4 months             5432/tcp                                                                       docker-compose-configs-odoodb-1

Once I shut down the instances and brought them back up, the postgres container was bootlooping with the logs:

$ docker logs 8377a9155386
PostgreSQL Database directory appears to contain a database; Skipping initialization

2024-10-15 02:07:25.301 UTC [1] LOG:  starting PostgreSQL 14.11 (Debian 14.11-1.pgdg120+2) on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
2024-10-15 02:07:25.302 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2024-10-15 02:07:25.302 UTC [1] LOG:  listening on IPv6 address "::", port 5432
2024-10-15 02:07:25.309 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2024-10-15 02:07:25.319 UTC [27] LOG:  database system was shut down at 2024-10-14 23:25:46 UTC
2024-10-15 02:07:25.319 UTC [27] LOG:  invalid primary checkpoint record
2024-10-15 02:07:25.319 UTC [27] PANIC:  could not locate a valid checkpoint record
2024-10-15 02:07:25.706 UTC [1] LOG:  startup process (PID 27) was terminated by signal 6: Aborted
2024-10-15 02:07:25.707 UTC [1] LOG:  aborting startup due to startup process failure
2024-10-15 02:07:25.736 UTC [1] LOG:  database system is shut down

I found many posts about “PANIC: could not locate a valid checkpoint record” and This stack overflow post in particular was very helpful. Following the steps from the top two answers, I was able to get the gitea server and database back up and running. However, there was significant data loss. While there were some accounts still accessible, a seemingly random set of accounts (as seen in the admin panel) were missing.

A couple github issues that I also found relevant were:

What Now?

So at this point, I’m faced with a few possibilities and associated questions:

  1. It’s possible that I might be able to recover the database, but I don’t know enough about postgres to know all of my options with that. As such, if anyone has suggestions or ideas for things I can try, let me know and I’ll give them a chance.
  2. I may need to re-setup the server and have everyone recreate their accounts and permissions. In this case I’d like to know how one might go about linking back all of the repositories to their respective accounts and organizations in a clean way, and hopefully a way that reduces the burden for everyone that was using this server having to re-configure their repositories.

I might be able to look later at how the DB is structured but you could potentially look at whatever schema gitea uses (should be a doc somewhere.) You could run a sql query against the partially restored DB and try to figure out how repos are assigned but given that some users are missing, this might be a fool’s errand.

Quite frankly, I’m wondering if it’s not worth the trouble salvaging, if it’s even possible.

So my two cents-

  1. Set up a new DB and start reconstructing as best you can.

  2. Use something like prodrigestivill/postgres-backup-local docker image in your compose that automatically backs up your DB and safeguard these backups. I doubt you’ll have this issue again since you seem to understand how it happened, how to avoid it, etc. but if it ever does (for a different reason either) you could just throw this into whatever postgres uses (/.docker-entrypoint.db or something?) to restore a backup.

Thank you for your response. Yeah, at this point I’ve decided to have everyone recreate their accounts. Luckily with the repo-adopt feature it’s relatively painless to get everything back in order for our small numbers of users. Much of this was through trial-and-verification to determine exactly what can be salvaged after wiping the entire database.

This experience taught me a few things worth documenting here:

  • The bare repos stored locally on the gitea server are valid repos you can clone, HOWEVER, their LFS files are not stored using standard LFS implementations. In my case, I wanted to try to get some active repos back up somewhere quickly so that my users weren’t blocked, but the LFS system that gitea uses is intentionally incompatible with LFS from the commandline. This is because gitea does some storage saving measures and combines the LFS files of all repositories into one directory. As such, it would be unsafe to reference that directory from all repos as a git lfs prune from any single repository would wipe out the LFS files for all other repos.
  • Packages do not have an adopt feature like repositories have. This might be a good feature for the future, but as of now it would be more work than it’s worth to try to figure out how one might fix package references when it’s easier for users to just re-upload their packages directly.
  • docker compose generates new container names using the config filename and the directory, /unless/ you use the container_name directive in the config. Doing this will stop the creation of identical containers that could cause conflicts like this with a name conflict error. There may be a better way to do this but most people want volumes to be double mounted instead of explicitly /not/ double mounted.

After having everyone remake their accounts we’re back in operation, and now with some well tested backup measures in place.

1 Like