v1.x upgrade RCA | Product Docs (2024)

Date: 30th January 2023Status: Resolved

Introduction

In the lead up to cheqd v1.x mainnet upgrade, and during the eventual release and upgrade itself, we faced a number of challenges which resulted in a brief network halt of mainnet and the need for a fast-follow patched release.

As of the 8th February 2023, this patched release has now been applied across both testnet and mainnet validators, and the network is working as expected, with the issue identified during the upgrade now resolved.

This RCA provides a brief overview of the root cause of the issues identified, the fix itself released in v1.2.5, and the lessons learnt by the cheqd Product & Engineering team.

Summary of events

On Monday 30th January, the cheqd team initiated a major network upgrade, v1.x, which introduced identity transaction pricing, among other features and fixes to the network (v1.x changelog).

Following the passing of the upgrade proposal (proposal #12), At roughly 09:30 GMT, the network halted at block height 6,427,279, with the upgrade set to begin at height 6,427,280.

Over the subsequent 40 minutes, to 10:10 GMT, validators successfully upgraded to the new version, with consensus reached at 10:19 GMT. Shortly thereafter, following just ~10 blocks signed (6,427,290), a number of validators reported errors and timeouts with their nodes, examples:

ERR Stopping peer for error err="pong timeout" module=p2p peer={"Data":{},"Logger":{\}}Jan 30 13:01:32 ip-172-31-16-181 cosmovisor\[695507]: 1:01PM INF Timed out dur=3000 height=6427331 module=consensus round=0 step=3

Diagnosis

By the end of the day, we were able to identify three major three major types of issues we encountered:

1. Cosmos SDK v0.46.x upstream bug which requires “pruning = nothing”

The pruning issue was largely related to the DragonBerry patch, which was a fix of a high-risk security vulnerability (dubbed “Dragonberry”) related to IBC protocol / ICS23.

Within the patch a strict check was introduced on horizontal states in trees, and as a result, uneven heights, pruning + state-sync no longer were behaving as expected. After introducing the Dragonberry patch it’s been common that the store height does not equal the historic height. The store height not equalling the historic height is what caused our pruning issue. This is largely due to an known issue in the iavl state tree package + Cosmos SDK which occurs on store writes during upgrades.

2. Leftover legacy cheqd-noded versions

On some validators we found there was a v0.6.x binary in the /usr/bin folder, whereas under Cosmovisor that gets turned to a symlink from /usr/bin/cheqd-noded to the actual binary in /home/cheqd/.cheqdnode/cosmovisor.

3. Corrupted database

Some nodes had an issue executing the migration, where they seemed to run into an issue with the new group module that Cosmos SDK introduced in v0.46.x, although this seems to be an edge case.

Separately, we also experienced dependency breaks (dependency hell) on a number of the internal and external servies which use the network APIs following changes to API paths within this upgrade.

How come we didn’t pick this up on testnet?

During the testnet upgrade we experienced a consensus error due to an issue with module versions among other issues (fortunately this is what testnet is for). As a result we started a fresh test network from block 0 + state export (testnet-6). With this fresh network, pruning was set to default ( pruning="default").

This is the default setting for every node. However, this default option also means that approximately 3.5 weeks of the state is kept which is not enough to catch the pruning issue early. Switching to more aggressive setting would’ve caught issues earlier, for example: pruning="custom", with:

pruning="custom"pruning-keep-recent=50pruning-interval=10 

Resolution

Immediate resolution

Immediate resolution to get the network running was offered to us by our validator community, as it was an issue ~10 Cosmos chains had faced when getting to the latest version.

In order to get the network back up and running, the validators were advised to simply switch their pruning parameters, setting them to pruning = “nothing”. Once consensus was reached with validators who had switched to this, the network was restored.

Long-term solution

Setting pruning to nothing was not an ideal solution, given it meant the network was not being pruned, resulting in an ever increasing chain size and increased storage requirements / higher costs for validators.

After investigating a route to solve this, we found a patch authored by Chill Validator, released in late November 2023, which had already been used as a solution for other Cosmos SDK based networks facing the same issue.

Lessons learnt

1. Testnet extensive checks & snapshots

Going forward we should endeavour to create as near an identical environment for testing upgrades on testnet as possible. While it may not always be possible, due to modules like IBC not being available on testnet, we should make sure that the delta between mainnet and testnet is detailed comprehensively in advance of upgrades to make sure any potential issue can be quickly diagnosed.

Where there is any risk vector of a consensus fault going into an upgrade, full snapshots should be taken of the network before any upgrades are attempted.

2. Check other networks upgrade issues and downtimes in Cosmos ecosystem

The issue we encountered during this upgrade could have been discovered in advance by aligning more proactively with other Cosmos chains. With a Validator community that spans far more than the cheqd Network, we should try and identify any potential issues experienced on other chains before bumping Cosmos SDK or IBC versions. Likewise, we should be circulating any issues we uncover. We’ll be more actively using this space within the product site to report on these going forward.

3. Consider how Cosmosvisor preparations need to change if manual upgrade is required

While we’re thrilled that the Cosmovisor automatic installation process largely worked as intended for this upgrade, however we still need to be mindful of how technical minutiae like symlinks are handled differently with manual upgrades. Having a clear overview of the delta between manual and Cosmovisor upgrade patterns will help isolate any potential issues in future upgrades.

4. Have a non-critical (high voting power) node take backups and snapshots

Reaching consensus initially at upgrade height took significantly longer than expected, due to the node managed by the cheqd team not being upgraded (cheqd node). This was because the cheqd node was taking a snapshot, at the halt height, to ensure there was a backup immediately before the upgrade to roll back to in case of a major upgrade failure. Going forward, rather than using this node for snapshots, a secondary node will be used to take snapshots to avoid delays.

5. Upgrades should NOT take place on Mondays

Generally we have intended to upgrade on Tuesdays or Wednesdays, however this time due to urgency of this upgrade and team vacation we opted for a Monday. This however doesn’t allow much time to get everyone prepared, which we’ll avoid going forward.

6. Improve upgrade height / time forecasting calculations

When submitting the mainnet upgrade proposal, like all other Cosmos-based network, we log the intended upgrade block height. This is calculated by taking the current height, and adding on the number of blocks required to reach the agreed upgrade time and date. The number of blocks to add depends on the average time per block. Unfortunately we were out with our estimated time, with the upgrade height being reached two hours earlier then planned. As a result, certain members of the team were unavailable at the upgrade time. In the future, we'll improve the accuracy of our estimations, ensure team availability is confirmed for a wider time window, and add sufficient buffer time to allow for block time being faster or slower for whatever reason.

7. Create a dependent Internal & External services check-list

With network upgrades on this scale, a number of changes are required in our internal and external services that depend on the core ledger. Following the upgrade we begun identifying which services were knocked out due to dependency changes, but this wasn’t done in a systematic, coordinated manner. Going forward, we’ll create a check-list of all internal and external services that need to be checked and updated following upgrade, to reduce downtime.

8. Streamlining status updates, messaging & communications

Although generally running two forums (Discord & Slack) has worked, during this upgrade there was a clear misalignment which could have been resolved if everyone was conversing in one place. Slack has remained as the dominant location for communications with SSI vendor validators, and Discord for the Cosmos-based validators. Going forward we will review the use of two communities, and decide on how to coordinate better, likely reducing to one group.

We also failed to effectively utility status.cheqd.net to provide updates. This generally needs to be incorporated more into the Product & Engineering teams ways of working, and in particular at critical points like upgrades.

. . .

Thank you to our validators for your patience and support throughout. Fortunately this didn’t cause any significant downtime, however it could have been avoided through more stringent checks on testnet, and more communication with other Cosmos SDK chains.

You can also read more about about plans for the year ahead in our Our Product Vision for 2023 at cheqd 🔮.

v1.x upgrade RCA | Product Docs (2024)
Top Articles
Mastoiditis in Children
Folliculitis, Furuncles, and Carbuncles in Children
5 Fastest Ways To Become Rich by Investing in the Stock Market
Forum Phun Extra
Academic Calendar Pbsc
799: The Lives of Others - This American Life
Random Animal Hybrid Generator Wheel
Pbr Oklahoma Baseball
Premier Double Up For A Buck
Editado Como Google Translate
What Does Sybau Mean
Cpcon Protection Priority Focus
Einfaches Spiel programmieren: Schritt-für-Schritt Anleitung für Scratch
Lebenszahl 8: Ihre wirkliche Bedeutung
Sandra Sancc
The Guardian Crossword Answers - solve the daily Crossword
Rick Harrison Daughter Ciana
Breakroom Bw
Budokai Z Pre Alpha Trello
Palmetto E Services
Über 60 Prozent Rabatt auf E-Bikes: Aldi reduziert sämtliche Pedelecs stark im Preis - nur noch für kurze Zeit
Prey For The Devil Showtimes Near Amc Ford City 14
Shawn N. Mullarkey Facebook
Atlanticbb Message Center
How to Learn Brazilian Jiu‐Jitsu: 16 Tips for Beginners
Sweeterthanolives
Roblox Roguelike
619-354-3954
Where To Find Permit Validation Number
Our Favorite Paper Towel Holders for Everyday Tasks
Paris 2024: The first Games to achieve full gender parity
Blue Beetle Showtimes Near Regal Independence Plaza & Rpx
Southeast Ia Craigslist
Tcc Northeast Library
Flixtor The Meg
Pixel Run 3D Unblocked
Dan And Riya Net Worth In 2022: How Does They Make Money?
Shih Tzu Puppies For Sale In Michigan Under $500
Www.1Tamilmv.cfd
Stephanie Ruhle's Husband
How to Survive (and Succeed!) in a Fast-Paced Environment | Exec Learn
Blog:Vyond-styled rants -- List of nicknames (blog edition) (TouhouWonder version)
Chess Unblocked Games 66
Tyson Foods W2 Online
Pulp Fiction 123Movies
Hr Central Luxottica Benefits
Lifetime Benefits Login
Kayla Simmons Of Leak
Thekat103.7
Jersey Mike's Subs: 16 Facts About The Sandwich Chain - The Daily Meal
Nordstrom Rack Glendale Photos
Potion To Reset Attributes Conan
Latest Posts
Article information

Author: Edmund Hettinger DC

Last Updated:

Views: 5341

Rating: 4.8 / 5 (58 voted)

Reviews: 89% of readers found this page helpful

Author information

Name: Edmund Hettinger DC

Birthday: 1994-08-17

Address: 2033 Gerhold Pine, Port Jocelyn, VA 12101-5654

Phone: +8524399971620

Job: Central Manufacturing Supervisor

Hobby: Jogging, Metalworking, Tai chi, Shopping, Puzzles, Rock climbing, Crocheting

Introduction: My name is Edmund Hettinger DC, I am a adventurous, colorful, gifted, determined, precious, open, colorful person who loves writing and wants to share my knowledge and understanding with you.