Date: 30th January 2023Status: Resolved
Introduction
In the lead up to cheqd v1.x mainnet upgrade, and during the eventual release and upgrade itself, we faced a number of challenges which resulted in a brief network halt of mainnet and the need for a fast-follow patched release.
As of the 8th February 2023, this patched release has now been applied across both testnet and mainnet validators, and the network is working as expected, with the issue identified during the upgrade now resolved.
This RCA provides a brief overview of the root cause of the issues identified, the fix itself released in v1.2.5, and the lessons learnt by the cheqd Product & Engineering team.
Summary of events
On Monday 30th January, the cheqd team initiated a major network upgrade, v1.x, which introduced identity transaction pricing, among other features and fixes to the network (v1.x changelog).
Following the passing of the upgrade proposal (proposal #12), At roughly 09:30 GMT, the network halted at block height 6,427,279, with the upgrade set to begin at height 6,427,280.
Over the subsequent 40 minutes, to 10:10 GMT, validators successfully upgraded to the new version, with consensus reached at 10:19 GMT. Shortly thereafter, following just ~10 blocks signed (6,427,290), a number of validators reported errors and timeouts with their nodes, examples:
ERR Stopping peer for error err="pong timeout" module=p2p peer={"Data":{},"Logger":{\}}Jan 30 13:01:32 ip-172-31-16-181 cosmovisor\[695507]: 1:01PM INF Timed out dur=3000 height=6427331 module=consensus round=0 step=3
Diagnosis
By the end of the day, we were able to identify three major three major types of issues we encountered:
1. Cosmos SDK v0.46.x upstream bug which requires “pruning = nothing”
The pruning issue was largely related to the DragonBerry patch, which was a fix of a high-risk security vulnerability (dubbed “Dragonberry”) related to IBC protocol / ICS23.
Within the patch a strict check was introduced on horizontal states in trees, and as a result, uneven heights, pruning + state-sync no longer were behaving as expected. After introducing the Dragonberry patch it’s been common that the store height does not equal the historic height. The store height not equalling the historic height is what caused our pruning issue. This is largely due to an known issue in the iavl state tree package + Cosmos SDK which occurs on store writes during upgrades.
2. Leftover legacy cheqd-noded versions
On some validators we found there was a v0.6.x binary in the /usr/bin
folder, whereas under Cosmovisor that gets turned to a symlink from /usr/bin/cheqd-noded
to the actual binary in /home/cheqd/.cheqdnode/cosmovisor
.
3. Corrupted database
Some nodes had an issue executing the migration, where they seemed to run into an issue with the new group module that Cosmos SDK introduced in v0.46.x, although this seems to be an edge case.
Separately, we also experienced dependency breaks (dependency hell) on a number of the internal and external servies which use the network APIs following changes to API paths within this upgrade.
How come we didn’t pick this up on testnet?
During the testnet upgrade we experienced a consensus error due to an issue with module versions among other issues (fortunately this is what testnet is for). As a result we started a fresh test network from block 0 + state export (testnet-6). With this fresh network, pruning was set to default ( pruning="default").
This is the default setting for every node. However, this default option also means that approximately 3.5 weeks of the state is kept which is not enough to catch the pruning issue early. Switching to more aggressive setting would’ve caught issues earlier, for example: pruning="custom"
, with:
pruning="custom"pruning-keep-recent=50pruning-interval=10
Resolution
Immediate resolution
Immediate resolution to get the network running was offered to us by our validator community, as it was an issue ~10 Cosmos chains had faced when getting to the latest version.
In order to get the network back up and running, the validators were advised to simply switch their pruning parameters, setting them to pruning = “nothing”. Once consensus was reached with validators who had switched to this, the network was restored.
Long-term solution
Setting pruning to nothing was not an ideal solution, given it meant the network was not being pruned, resulting in an ever increasing chain size and increased storage requirements / higher costs for validators.
After investigating a route to solve this, we found a patch authored by Chill Validator, released in late November 2023, which had already been used as a solution for other Cosmos SDK based networks facing the same issue.
Lessons learnt
1. Testnet extensive checks & snapshots
Going forward we should endeavour to create as near an identical environment for testing upgrades on testnet as possible. While it may not always be possible, due to modules like IBC not being available on testnet, we should make sure that the delta between mainnet and testnet is detailed comprehensively in advance of upgrades to make sure any potential issue can be quickly diagnosed.
Where there is any risk vector of a consensus fault going into an upgrade, full snapshots should be taken of the network before any upgrades are attempted.
2. Check other networks upgrade issues and downtimes in Cosmos ecosystem
The issue we encountered during this upgrade could have been discovered in advance by aligning more proactively with other Cosmos chains. With a Validator community that spans far more than the cheqd Network, we should try and identify any potential issues experienced on other chains before bumping Cosmos SDK or IBC versions. Likewise, we should be circulating any issues we uncover. We’ll be more actively using this space within the product site to report on these going forward.
3. Consider how Cosmosvisor preparations need to change if manual upgrade is required
While we’re thrilled that the Cosmovisor automatic installation process largely worked as intended for this upgrade, however we still need to be mindful of how technical minutiae like symlinks are handled differently with manual upgrades. Having a clear overview of the delta between manual and Cosmovisor upgrade patterns will help isolate any potential issues in future upgrades.
4. Have a non-critical (high voting power) node take backups and snapshots
Reaching consensus initially at upgrade height took significantly longer than expected, due to the node managed by the cheqd team not being upgraded (cheqd node). This was because the cheqd node was taking a snapshot, at the halt height, to ensure there was a backup immediately before the upgrade to roll back to in case of a major upgrade failure. Going forward, rather than using this node for snapshots, a secondary node will be used to take snapshots to avoid delays.
5. Upgrades should NOT take place on Mondays
Generally we have intended to upgrade on Tuesdays or Wednesdays, however this time due to urgency of this upgrade and team vacation we opted for a Monday. This however doesn’t allow much time to get everyone prepared, which we’ll avoid going forward.
6. Improve upgrade height / time forecasting calculations
When submitting the mainnet upgrade proposal, like all other Cosmos-based network, we log the intended upgrade block height. This is calculated by taking the current height, and adding on the number of blocks required to reach the agreed upgrade time and date. The number of blocks to add depends on the average time per block. Unfortunately we were out with our estimated time, with the upgrade height being reached two hours earlier then planned. As a result, certain members of the team were unavailable at the upgrade time. In the future, we'll improve the accuracy of our estimations, ensure team availability is confirmed for a wider time window, and add sufficient buffer time to allow for block time being faster or slower for whatever reason.
7. Create a dependent Internal & External services check-list
With network upgrades on this scale, a number of changes are required in our internal and external services that depend on the core ledger. Following the upgrade we begun identifying which services were knocked out due to dependency changes, but this wasn’t done in a systematic, coordinated manner. Going forward, we’ll create a check-list of all internal and external services that need to be checked and updated following upgrade, to reduce downtime.
8. Streamlining status updates, messaging & communications
Although generally running two forums (Discord & Slack) has worked, during this upgrade there was a clear misalignment which could have been resolved if everyone was conversing in one place. Slack has remained as the dominant location for communications with SSI vendor validators, and Discord for the Cosmos-based validators. Going forward we will review the use of two communities, and decide on how to coordinate better, likely reducing to one group.
We also failed to effectively utility status.cheqd.net to provide updates. This generally needs to be incorporated more into the Product & Engineering teams ways of working, and in particular at critical points like upgrades.
. . .
Thank you to our validators for your patience and support throughout. Fortunately this didn’t cause any significant downtime, however it could have been avoided through more stringent checks on testnet, and more communication with other Cosmos SDK chains.
You can also read more about about plans for the year ahead in our Our Product Vision for 2023 at cheqd 🔮.