Foundation news

State archival issue post-mortem

Author

Stellar Development Foundation

Publishing date

Executive summary

On October 9, 2025 (UTC), the Stellar Development Foundation (SDF) identified an issue in the public network related to a new feature called “Soroban Live State Prioritization.” This feature, introduced in Whisk Protocol 23 (P23), aimed to improve how the network handles and stores data. However, a bug in this feature began corrupting certain data entries as early as September 4, 2025, and remained undetected for 35 days.

The Foundation quickly worked with impacted parties and validators to contain (by October 10, 2025) and resolve (by October 23, 2025).

In total, 478 data entries were originally corrupted, but most of them could be repaired to their expected value. After fixes, only 84 remained corrupted and mitigations had to be taken by protocols and issuers.

Bug description & technical terms explained

The root cause of the incident was a bug in the way the network managed the transition of data entries between different storage areas. The feature is described in more detail in “Soroban Live State Prioritization” that introduces a novel two-tier ledger storage that helps unlock innovation not seen on other chains.

To make this clearer, here are brief explanations of key terms:

  • “Live” ledger state (working storage area): this is the main storage space that represents all states that can be accessed and modified directly by transactions.
  • “Hot-archive” (backup storage area): this is like a secondary storage space where less active data is kept, freeing up space in the main area for more urgent work.
  • Bucket List (structured data history): each storage area is organized as Bucket Lists. A Bucket List is a structured way for the system to organize all historical data, similar to a layered filing cabinet. Each “level” contains records from different periods, helping the system keep track of changes over time.
  • Eviction (moving data from “live” state into “hot-archive”): when the network decides that certain pieces of data are not being used frequently, it moves them from the main working area (“live” state) to a backup storage area called the “hot-archive.”
  • Restoration (moving data from “hot-archive” into the “live” state): when users need to use data that was previously evicted, they invoke a restoration process that moves data back into the “live” state.
  • TTL (Time-To-Live) (how long data remains valid): This setting determines how long a piece of data should stay active before it is considered expired and eligible for removal or archiving.
  • Persistent and Temporary Entries: Persistent entries are entries that move between those different storage areas, temporary entries just get deleted instead of getting moved to the “Hot-archive” on eviction.

The bug occurred during “eviction”: instead of moving the most up-to-date version of each persistent entry, the system sometimes moved an outdated copy (like saving an old draft instead of the latest one). When these outdated entries were later restored, the network ended up operating on inconsistent information.

Anatomy of the bug

Bug Description

During the eviction scan when applying ledgers, the system generated a list of candidates for the “hot-archive” inclusive of their value. The issue was that these candidates were entries encountered during the scan of the Bucket List at arbitrary levels which represent different points in time. To draw from the analogy above, they sometimes were pulled from the wrong layer of the layered filing cabinet. While the code correctly checked for expiration by loading the latest Time-To-Live (TTL) value, it failed to load the latest version of the entry itself if there was one, leading to recording an outdated value in the “hot-archive”.

Note that prior to Whisk (P23), this mechanism was not an issue because eviction was done within the live state, and for that we only needed the up-to-date TTL value: temporaries would be deleted, and persistent entries would just stay untouched in the ledger.

The fix

The fix is for every eviction candidate to load the latest version of the entry from the Bucket List snapshot.

How we got there

  • 2023-24-08 : Initial eviction implementation (CAP-0046-12 included in the original Stellar smart contract Soroban release).
    • https://github.com/stellar/stellar-core/pull/3874.
    • This initial implementation did not have the bug.
    • Initial eviction scan checked that both TTL and code/data entry were up to date.
    • Expired temporary entries were deleted (the only change in scope), persistent entries were left untouched.
  • 2024-25-04: Data/entry check removed for performance
  • 2025-01-06: Bug officially introduced with persistent eviction (implementation of CAP-0062 )
    • https://github.com/stellar/stellar-core/pull/4585.
    • Persistent eviction retrofitted but code/data entry version check not added.
    • Change was classified as low risk because it looked surgical – it was not about what changed but also about what did not change. As a consequence, it only had a single reviewer, in contrast to many other changes in P23 that got the right level of scrutiny.
  • 2025-13-06: Bad test introduced for multiple versions of key in “hot-archive”
    • https://github.com/stellar/stellar-core/pull/4773.
    • One test included multiple versions of an entry in the Bucket List, but did not actually check the restored entry contents.
    • Another test also should have covered this case, “shadowed entries not evicted”, but was not updated for the persistent entry case.
  • 2025-08-07: 23.0.0 stable containing the bug is published.
  • 2025-08-21: security audit of P23 completed, no findings related to state archival, clearing P23 for production.
  • 2025-09-03: P23 voted on public network.

Lessons on avoiding this class of bugs

We spent time with the internal team, external developers, and partners to discuss areas for improvement that we began to action immediately, including:

  • Monitoring And Detection: The team working on the core protocol layer will ensure that invariants are properly implemented as to bias towards safety over liveness whenever possible. SDF will then work with the ecosystem to enable invariant checks specific to protocols and issuers that may not be detectable at the core layer. This will require ensuring that capabilities exist, and are used to provide as much redundancy as possible when it comes to monitoring critical parts of the ecosystem.
  • Validator Coordination And Governance: There’s an opportunity to work with validators to have the information and tooling needed to align on a plan on being better prepared for emergencies.

To further strengthen our network and processes, we are implementing specific strategies:

  • Decoupling Risky and Time-Sensitive Work: High-risk features will be developed and validated separately from urgent updates to avoid issues stemming from complex changes.
  • Ensure Mission Critical Code Is Identified As Such: We will ensure that mission critical code is properly identified as such at design and review time and goes through a systematic code quality checklist during code review. This includes making proper use of abstractions, reducing code complexity, and having proper code invariants.
  • Improve Testing And Validation: Similar to coding standards, we will ensure that the type of testing techniques deployed against specific high-risk areas is adequate. We already make use of many forms of technique from formal verification to property testing and fuzzing. We will make sure that at least one of these is deployed on critical components.
  • Enable Security Auditing Companies: We will rethink how we engage and collaborate with Security Auditing Companies to leverage their strength. While code reviewing has its merits, we can do more to help those companies understand the code better and write better automation so that the code stays resilient over the long term.

By embracing these improvements, SDF reaffirms our commitment to transparency, reliability, and continued collaboration with all stakeholders in support of the Stellar network.