Potential database inconsistency with very large transactions
| Product | Affected Versions | Related Issues | Fixed In |
|---|---|---|---|
| YSQL | All | #30772 | v2025.2.3.0, v2025.1.4.0, v2024.2.9.0 |
Description
A rare race condition can cause rows written in a very large transaction (more than 100K storage records) to overwrite rows written by later, smaller transactions when a specific sequence of SST flush, restart, and compaction events occurs.
This issue was observed during DDL operations in the following sequence:
- A full-database
ANALYZEruns on a database whose tables have a total column count greater than or equal to 4300. - A restart of the YB-Master process(es) happens before the system catalog is flushed to disk.
- A DDL statement such as
ALTER TABLEruns. - In the background, a flush and compaction occurs.
- Another DDL runs. At this point, caches are invalidated, and the inconsistency can cause the following effects:
- Queries against the altered table can fail in v2025.2.0.0 and later, unless additional catalog cache preload flags are enabled on YB-TServer.
- In other cases, all connections to the database can fail.
To check that there is no latent inconsistency when you may be at step 4, run the following pg_class consistency checker query to verify PostgreSQL catalog consistency for each database on your cluster.
Mitigation
To reduce the likelihood of encountering this issue on affected versions:
- Do not run full-database
ANALYZEon versions earlier than v2025.2.3.0, v2025.1.4.0, or v2024.2.9.0. Instead, runANALYZEon individual tables in batches. On v2025.2.0.0 and later, prefer Auto Analyze, which is enabled by default when the cost-based optimizer (CBO) is enabled. - Upgrade to a release that contains the fix. Upgrading does not correct any data inconsistencies which have already occurred.
- If you have already encountered the issue and are at Step 5 as described above with database connection failures, setting the YB-TServer flag --ysql_minimal_catalog_caches_preload=true should allow you to connect to the database, though queries on certain tables may still fail. The
pg_classentries for affected tables may require manual correction; contact Yugabyte Support for assistance.
Details
A sequence of events was identified in which changes from a later transaction can be partially undone by an earlier large transaction that updates more than 100K RocksDB rows. This scenario in theory can happen on data tables but is extremely unlikely on ordinary user data tables and more likely when DDL modifies the PostgreSQL catalog on databases with roughly 4300 or more columns in total, because those paths use the same transaction infrastructure.
The following DDL example describes the sequence of events that can trigger this bug. The example uses a full-database ANALYZE as the large, earlier transaction and ALTER TABLE ADD COLUMN as the later transaction.
- Large transaction commits. A large transaction commits with an update of more than 100K RocksDB rows in unpacked form.
- A full-database
ANALYZEcan reach this limit when the total number of columns across all tables in the database is approximately 4300 or greater.
- A full-database
- Server restart. Before those rows are flushed to SST files, the YB-Master or the YB-TServer holding the transaction rows restarts.
- Partial WAL replay. During restart, WAL replay of these rows occurs. The transaction transfer code path erroneously causes only the first batch of 100K RocksDB rows to be transferred to regular RocksDB. The remaining rows of the transaction still stay in the IntentsDB memtable, awaiting the next WAL replay.
- In this example, remaining rows from the full-database
ANALYZEcan includepg_classrows that describe important table metadata, such as the number of columns (relnatts) and the relfilenode.
- In this example, remaining rows from the full-database
- Newer transaction commits. A newer transaction can update some of those rows that still reside in the intents memtable. No issues are seen at this point, and reads continue to behave as expected because reads merge pending intents with newer updates. Note that on commit, these newer writes move from intents to regular RocksDB.
- In this example, an
ALTER TABLE ADD COLUMNcan update the pg_class entry to update the number of columns in the table (relnatts column of pg_class).
- In this example, an
- Flush and compaction occurs. When enough rows accumulate, a flush and compaction can then occur on the regular RocksDB. During compaction, newer updates (such as, from
ALTER TABLE) that use per-column format are merged into existing packed rows. The resulting packed row can retain an older timestamp after per-column updates are applied. - Inconsistency becomes visible. After this compaction, the remaining writes from the older large transaction in IntentsDB, still waiting to transfer to regular RocksDB can have the highest timestamp among remaining entries and become visible to readers.
- In this example, the compaction step causes the
pg_classrows from the earlierANALYZEtransaction in IntentsDB to appear to have the latest timestamp. Consequently, thepg_classrow reverts to the column count from the ANALYZE, undoing the ALTER TABLE ADD COLUMN change. This catalog inconsistency causes PostgreSQL backends to error out.
- In this example, the compaction step causes the
- Inconsistency is persisted. A subsequent restart replays the remaining writes correctly, and they are eventually compacted into the existing packed row. However, they merge into the packed row after a later transaction was already merged.
- In this example, this means that the PostgreSQL catalog now contains the new
pg_attributedefinition of the new columns added by anALTER TABLE ADD COLUMNbut therelnattscolumn ofpg_classis inconsistent with this new column.
- In this example, this means that the PostgreSQL catalog now contains the new