Potential database inconsistency with very large transactions

11 May 2026
Product Affected Versions Related Issues Fixed In
YSQL All #30772 v2025.2.3.0, v2025.1.4.0, v2024.2.9.0

Description

A rare race condition can cause rows written in a very large transaction (more than 100K storage records) to overwrite rows written by later, smaller transactions when a specific sequence of SST flush, restart, and compaction events occurs.

This issue was observed during DDL operations in the following sequence:

  1. A full-database ANALYZE runs on a database whose tables have a total column count greater than or equal to 4300.
  2. A restart of the YB-Master process(es) happens before the system catalog is flushed to disk.
  3. A DDL statement such as ALTER TABLE runs.
  4. In the background, a flush and compaction occurs.
  5. Another DDL runs. At this point, caches are invalidated, and the inconsistency can cause the following effects:
    • Queries against the altered table can fail in v2025.2.0.0 and later, unless additional catalog cache preload flags are enabled on YB-TServer.
    • In other cases, all connections to the database can fail.

To check that there is no latent inconsistency when you may be at step 4, run the following pg_class consistency checker query to verify PostgreSQL catalog consistency for each database on your cluster.

Mitigation

To reduce the likelihood of encountering this issue on affected versions:

  • Do not run full-database ANALYZE on versions earlier than v2025.2.3.0, v2025.1.4.0, or v2024.2.9.0. Instead, run ANALYZE on individual tables in batches. On v2025.2.0.0 and later, prefer Auto Analyze, which is enabled by default when the cost-based optimizer (CBO) is enabled.
  • Upgrade to a release that contains the fix. Upgrading does not correct any data inconsistencies which have already occurred.
  • If you have already encountered the issue and are at Step 5 as described above with database connection failures, setting the YB-TServer flag --ysql_minimal_catalog_caches_preload=true should allow you to connect to the database, though queries on certain tables may still fail. The pg_class entries for affected tables may require manual correction; contact Yugabyte Support for assistance.

Details

A sequence of events was identified in which changes from a later transaction can be partially undone by an earlier large transaction that updates more than 100K RocksDB rows. This scenario in theory can happen on data tables but is extremely unlikely on ordinary user data tables and more likely when DDL modifies the PostgreSQL catalog on databases with roughly 4300 or more columns in total, because those paths use the same transaction infrastructure.

The following DDL example describes the sequence of events that can trigger this bug. The example uses a full-database ANALYZE as the large, earlier transaction and ALTER TABLE ADD COLUMN as the later transaction.

  1. Large transaction commits. A large transaction commits with an update of more than 100K RocksDB rows in unpacked form.
    • A full-database ANALYZE can reach this limit when the total number of columns across all tables in the database is approximately 4300 or greater.
  2. Server restart. Before those rows are flushed to SST files, the YB-Master or the YB-TServer holding the transaction rows restarts.
  3. Partial WAL replay. During restart, WAL replay of these rows occurs. The transaction transfer code path erroneously causes only the first batch of 100K RocksDB rows to be transferred to regular RocksDB. The remaining rows of the transaction still stay in the IntentsDB memtable, awaiting the next WAL replay.
    • In this example, remaining rows from the full-database ANALYZE can include pg_class rows that describe important table metadata, such as the number of columns (relnatts) and the relfilenode.
  4. Newer transaction commits. A newer transaction can update some of those rows that still reside in the intents memtable. No issues are seen at this point, and reads continue to behave as expected because reads merge pending intents with newer updates. Note that on commit, these newer writes move from intents to regular RocksDB.
    • In this example, an ALTER TABLE ADD COLUMN can update the pg_class entry to update the number of columns in the table (relnatts column of pg_class).
  5. Flush and compaction occurs. When enough rows accumulate, a flush and compaction can then occur on the regular RocksDB. During compaction, newer updates (such as, from ALTER TABLE) that use per-column format are merged into existing packed rows. The resulting packed row can retain an older timestamp after per-column updates are applied.
  6. Inconsistency becomes visible. After this compaction, the remaining writes from the older large transaction in IntentsDB, still waiting to transfer to regular RocksDB can have the highest timestamp among remaining entries and become visible to readers.
    • In this example, the compaction step causes the pg_class rows from the earlier ANALYZE transaction in IntentsDB to appear to have the latest timestamp. Consequently, the pg_class row reverts to the column count from the ANALYZE, undoing the ALTER TABLE ADD COLUMN change. This catalog inconsistency causes PostgreSQL backends to error out.
  7. Inconsistency is persisted. A subsequent restart replays the remaining writes correctly, and they are eventually compacted into the existing packed row. However, they merge into the packed row after a later transaction was already merged.
    • In this example, this means that the PostgreSQL catalog now contains the new pg_attribute definition of the new columns added by an ALTER TABLE ADD COLUMN but the relnatts column of pg_class is inconsistent with this new column.