Blog Article

Decentralized to the Core

Author

Stellar Development Foundation

Publishing date

Decentralization

Protocol upgrade

How SDF’s validators halted, but the network continued to operate as designed

Update:

  • On April 10, 2021, Stellar network validators voted to upgrade the network to Protocol 16, which contains a fix that resolves the issue that caused several nodes to go offline earlier this week.
  • If you are a node operator, and you are having difficulty accessing the network, you probably just need to install up-to-date software.
  • For more information on how to do that, see our Protocol 16 Upgrade Guide.

Early Tuesday morning, the Stellar Development Foundation’s validator nodes temporarily stopped processing ledgers, and the SDF public Horizon instance stopped ingesting them, which meant that for a brief period of time, it was unable to serve requests or submit transactions to the network. To resolve the situation, SDF’s engineering team worked quickly to spin up new nodes and a new Horizon instance, both of which were available by the end of the day.

However, several nodes run by others in the ecosystem were also affected, and so in parallel, SDF engineers continued to investigate the root cause of the problem. They identified it Tuesday evening. On Wednesday afternoon, they rolled out a patch release to address it, and on Thursday, they released Stellar Core v16.0.0, which should address outstanding issues. On Saturday, network validators voted to accept the release and upgrade the network to Protocol 16. At this point, all affected nodes should be able to upgrade to Protocol-16 compatible software and restore network access.

During the entirety of the SDF node downtime, there were still enough validators available to securely process transactions, and so the Stellar network remained online. Which is how a decentralized network is intended to work. Many validators also continued to publish archives that keep track of ledger history, and those archives allow halted nodes to fill in gaps so they can recover quickly.

The following FAQ is intended to share what we know about the issue, and is a companion to a Protocol 16 Upgrade Guide that includes information to help affected organizations bring their nodes back online.

Do you know what caused nodes to go offline?

After careful analysis, SDF engineers narrowed the trigger down to a single ledger, and to a single operation in that ledger. They compared how various nodes dealt with that operation, pushed out a release that includes a fix, and began to prepare the network for an upgrade that will allow affected nodes to recover history and come back online. For more on the protocol upgrade, check out the Protocol 16 Upgrade Guide.

While certain nodes — including those run by SDF and Lobstr — encountered the issue, most of the nodes on the network did not. Again, because there is sufficient validator redundancy, the network continued to function as normal despite the temporary unavailability of SDF's infrastructure. While we realize this was inconvenient for organizations that rely on public network access, it also demonstrates that the Stellar network persists independent of SDF validators.

How was the issue discovered?

At 08:20 UTC, our Stellar ops infrastructure monitoring — which consists of Runscope and Prometheus alerts — fired an alert to our on-call ops team. They immediately began to investigate the alert, realized there was a problem with the SDF validators, and began to contact the rest of the ops team as well as the ops teams for other Tier 1 validators.

How long were your nodes offline? 

Initially, SDF focused its efforts solely on locating the root cause of the halt before opting to spin up new validators and a new public Horizon API to more immediately resolve the issue, while seeking in parallel to identify its origin. We were able to get the validators and the new API up and running within an hour once we opted to create that solution, but the total downtime lasted for about 10 hours. SDF’s three validator nodes and SDF’s public Horizon API instance were back online at 18:28 UTC.


What did SDF do to resolve the problem? 

As soon as our node monitoring system alerted us to the problem, SDF engineers started working to isolate the cause. They dug into the log files, and started comparing what they found in SDF logs to information gleaned from other validators.

Once they identified the problematic ledger, they split up into two teams to cover more ground: one to debug and develop a patch for the problem; one to spin up new validator nodes and a new Horizon instance to sub in for the affected infrastructure.

By the end of the day Tuesday, replacement validators were up, and the public could once again access horizon.stellar.org to submit transactions just as they did before this temporary disruption. SDF worked with organizations throughout the ecosystem to inform them that the public infrastructure was once again available, and to help other affected organizations restore their infrastructure. By Wednesday afternoon, SDF engineers had released a patch as an interim solution to bring affected nodes back online.

Thursday afternoon, SDF engineers released Stellar Core 16.0.0, along with instructions to allow node operators to install software necessary to prepare for a network upgrade. April 10, 2021, at 1500 UTC, validators voted to accept the upgrade, which allows all affected nodes to restore service while keeping a full history of transactions intact.


How much has the network slowed?

The network did not slow down. Validators that stayed in sync continued to process operations and close ledgers in ~5 seconds without disruption. However, some nodes, including those run by SDF and Lobstr, ceased to process transactions for about 10 hours. If you access the network via one of the affected nodes, you were not able to access the network to submit transactions during that time. If, however, you rely on one of the many unaffected nodes, your network access continued unabated.


Is there data missing from the ledger from the time the nodes were offline?

Because the network remained online, validators continued to securely process transactions even though the SDF Horizon instance was temporarily unavailable. While there was a small gap in the SDF's knowledge of the ledger history for a brief period of time, we were able to fill it in by looking at the archives maintained by many other organizations — specifically, the other Tier 1 organizations — which publish the full network history for exactly this purpose.

The Stellar network technically stayed online but many transactions were disrupted. Did this impact deposits and withdrawals from exchanges?

Out of an abundance of caution, several centralized crypto exchanges decided to pause XLM deposits and withdrawals while our engineers worked on a fix. We are communicating with those exchanges directly, and notified them when we re-established access to public infrastructure. At this point, most of those exchanges have restored service, and are once processing deposits and withdrawals. SDF will continue to work with them to ensure they have successfully upgraded their software.

Do you know how many transactions were halted from organizations that rely on SDF’s public infrastructure access points?

Stellar is open participation: anyone can build on it, and anyone can submit transactions to the network. Because the SDF does not control access to the network, and we don't have insight into the internal accounting of all the various organizations that use it, there's not really a way for us to get complete or reliable information about who decided to halt transaction submission.

However, we have been in communication with the ecosystem about the situation, and continue to work with affected organizations to make sure they can continue to access Stellar on behalf of their customers.