Skip to content

Commit

Permalink
chore(docs): add resume WAL troubleshooting (#42)
Browse files Browse the repository at this point in the history
Co-authored-by: Nick Woolmer <[email protected]>
  • Loading branch information
puzpuzpuz and nwoolmer authored Aug 15, 2024
1 parent a5ff935 commit 92fbc15
Show file tree
Hide file tree
Showing 2 changed files with 63 additions and 8 deletions.
67 changes: 59 additions & 8 deletions reference/sql/alter-table-resume-wal.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,27 +40,78 @@ to investigate the table status:
wal_tables();
```

| name | suspended | writerTxn | sequencerTxn |
| ----------- | --------- | --------- | ------------ |
| sensor_wal | false | 6 | 6 |
| weather_wal | true | 3 | 5 |
| name | suspended | writerTxn | sequencerTxn |
| ------ | --------- | --------- | ------------ |
| trades | true | 3 | 5 |

The table `weather_wal` is suspended. The last successful commit in the table is
The table `trades` is suspended. The last successful commit in the table is
`3`.

The following query restarts transactions from the failed transaction, `4`:

```questdb-sql
ALTER TABLE weather_wal RESUME WAL;
ALTER TABLE trades RESUME WAL;
```

Alternatively, specifying the `sequencerTxn` to skip the failed commit (`4` in
this case):

```questdb-sql
ALTER TABLE weather_wal RESUME WAL FROM TRANSACTION 5;
ALTER TABLE trades RESUME WAL FROM TRANSACTION 5;
-- This is equivalent to
ALTER TABLE weather_wal RESUME WAL FROM TXN 5;
ALTER TABLE trades RESUME WAL FROM TXN 5;
```

## Diagnosing corrupted WAL transactions

:::note

If you have [data deduplication](/concept/deduplication/) enabled on your tables and you have access to the original events (for instance, they're stored in Apache Kafka, or other replayable source), you may reingest the data after skipping the problematic transactions.

:::

Sometimes a table may get suspended due to full disk or [kernel limits](/docs/deployment/capacity-planning/#os-configuration). In this case, an entire WAL segment may be corrupted. This means that there will be multiple transactions that rely on the corrupted segment, and finding the transaction number to resume from may be difficult.

When you run RESUME WAL on such suspended table, you may see an error like this:

```
2024-07-10T01:01:01.131720Z C i.q.c.w.ApplyWal2TableJob job failed, table suspended [table=trades~3, error=could not open read-only [file=/home/my_user/.questdb/db/trades~3/wal45/101/_event], errno=2]
```

In such a case, you should try skipping all transactions that rely on the corrupted WAL segment. To do that, first you need to find the last applied transaction number for the `trades` table:

```questdb-sql
SELECT writerTxn
FROM wal_tables()
WHERE name = 'trades';
```

| writerTxn |
| --------- |
| 1223 |

Next, query the problematic transaction number:

```questdb-sql
SELECT max(sequencertxn)
FROM wal_transactions('trades')
WHERE sequencertxn > 1223
AND walId = 45
AND segmentId = 101;
```

Here, `1223` stands for the last applied transaction number, `45` stands for the WAL ID that may be seen in the error log above (`trades~3/wal45`), and `101` stands for the WAL segment ID from the log (`trades~3/wal45/101`).

| max |
| ---- |
| 1242 |

Since the last problematic transaction is `1242`, you can resume the table from transaction `1243`:

```questdb-sql
ALTER TABLE trades RESUME WAL FROM TXN 1243;
```

Note that in rare cases, subsequent transactions may also have corrupted WAL segments, so you may have to repeat this process.
4 changes: 4 additions & 0 deletions troubleshooting/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -184,3 +184,7 @@ scrollable cursors in PostgreSQL.
For more information and for tips to work around, see the
[PostgreSQL compatability seciton](/docs/reference/sql/overview/#postgresql-compatibility)
in our Query & SQL overview.

## My table has corrupted WAL data due to a previous full disk or kernel limits error. What do I do?

You need to skip the problematic transation using the [RESUME WAL](/docs/reference/sql/alter-table-resume-wal/) SQL statement. If there are multiple transactions that rely on the corrupted WAL data, you may need to follow [this instruction](/docs/reference/sql/alter-table-resume-wal/#diagnosing-corrupted-wal-transactions).

0 comments on commit 92fbc15

Please sign in to comment.