Author
Urvi Savla
Publishing date
This blog post is the next in the Composable Data Platform (CDP) blog post series. The previous blog post, Hubble: Now Faster than Light, went into the details of re-architecture of stellar-ETL using CDP and its benefits. This post explains how we made Horizon reingestion 9x faster using CDP components.
Background
The size of Horizon data is enormous and it is impractical for each instance to store the entire network history. Most Horizon instances, including those hosted by the Stellar Development Foundation (SDF), are configured with a specific retention window–we recommend 30 days– to manage data size.
However, there are scenarios where you may want to retrieve and process older ledger data outside this retention window. Horizon supports this through a process called reingestion, which fetches and reprocesses historical Stellar ledger data.
There are several key situations where a Horizon user might need to perform reingestion:
After a pause in operations: If a Horizon instance has been inactive for a period, reingestion may be necessary to bring it up to date, as it can be faster and more efficient than using live ingestion to catch up to the current network state.
Problem Statement
Reingestion is both time-consuming and resource-intensive, taking several days to process just one month of recent ledger history and months to reprocess the entire history.
The process is demanding because it relies on Captive Core to retrieve and process ledger data.
Solution
Central to CDP is the data lake of precomputed ledger metadata. For details on how to create this data lake using Galaxie, refer to the Introducing Galexie: Efficiently Extract and Store Stellar Data blog post.
Ingesting data from the data lake is significantly faster. What would normally take months to reingest now takes just a few days.
This speed improvement is due to two main factors:
Ledger metadata is readily available for direct download from Google Cloud Storage (GCS). Each file is compressed, making it very small, so minimal network bandwidth is required.
To accommodate the new architecture, the Horizon db reingest command now supports reingestion from the datastore. You invoke it similarly to how you would for Captive Core, but with additional configuration for the datastore, such as specifying the bucket address and data schema. For command and configuration details on using CDP for reingestion, refer to the reingestion guide here.
Benchmarking
To benchmark the reingestion performance of Captive Core and CDP, we conducted tests under the following hardware setup:
Horizon supports parallel reingestion, meaning the reingestion range is divided into subranges and ingested simultaneously. We wanted to assess the level of parallelization achievable with both methods (Captive Core and CDP). So we re-ingested 10,000 ledgers using both methods with varying levels of parallelization and these are the results:
Captive Core: Performance was constrained by disk I/O and showed diminishing returns with more than four workers.
CDP: In contrast, CDP achieved much better parallelization, with optimal results using 16 workers.
For details on configuring parallel ingestion, refer to the guide on parallel ingestion workers.
Using the best parallel setup, we estimated the time to reingest 10,000 ledgers. However, older ledgers are less dense than recent ones, so less time is required to reingest older data. To confirm this, we sampled 10,000 ledgers from each year since the inception of the Stellar network and extrapolated the time required to reingest the entire history.
The results show that reingestion using Captive Core is projected to take approximately 66 days, while the CDP (with precomputed ledger metadata) is expected to take around 7 days.
In this evaluation, Captive Core ran with 2 parallel workers, while CDP ran with 16 parallel workers.
Conclusion
With CDP, Horizon reingestion is now up to 9x faster, cutting down processing times by over 85%. However, even with these improvements, using Horizon to serve full historical data requires massive amounts of storage—around 40 TB and growing fast. In most cases, building your own applications using CDP offers a better path forward.
CDP’s precomputed ledger metadata allows you to build a much smaller custom dataset. And if you’re looking to reingest large amounts of data to populate your new dataset, CDP provides your application with the huge performance benefits, making it ideal for creating efficient, focused applications.
This makes CDP a game-changer for anyone wanting to build flexible, scalable applications beyond Horizon. We encourage you to explore all that CDP offers for your own data needs!
See it in action yourself.
More for you
Article
• Molly Karcher
Developer Tool
Data
This article is the first in an expansive series on the Composable Data Platform, the next generation data-access platform on Stellar. The Composable…
Article
• Urvi Savla
Data
Developer Tool
Galexie is the first component of the Stellar Composable Data Platform, the next generation data-access platform. Galexie is a lightweight…
Article
• Simon Chow
Data
Hubble
Learn how Stellar ETL, which has been refactored to use the full suite of benefits from the Composable Data Platform (CDP), the next generation…
The Newsletters
Hear it from us first. Sign up to get the real-time scoop on Stellar ecosystem news, functionalities, and resources.