Author
Urvi Savla
Publishing date
Data
Developer Tool
Welcome to the latest installment in our series on the Composable Data Platform, the next generation data-access platform on Stellar. Today, we're excited to introduce the first component, Galexie, a lightweight application that extracts ledger data from the Stellar network. Read on to find out more!
One of the main challenges in working with Stellar network data is the tight coupling between the data extraction and transformation processes. Developers have limited tools available to read and save data directly from the Stellar network, which limits flexibility in their application design. To address this, we created Galexie as a dedicated extractor application, decoupling data ingestion from transformation. This approach offers greater versatility compared to hosting a monolithic service, like Horizon or Hubble.
The primary function of Galexie is to efficiently export the entire corpus of Stellar network data and store it in a cloud-based data lake. This creates a foundation of preprocessed data that can be rapidly accessed and transformed by various custom applications tailored to specific needs.
The data exported by Galexie supports tools like Hubble and Horizon in building derived data. Unlike the raw ledgers written to the History Archives, which is an unprocessed view of the network’s history, Galexie exports complete transaction metadata, offering precomputed data that is ready for use.
This design not only supports existing tools but also allows for more efficient data processing and opens up new possibilities for data analysis within the Stellar ecosystem. In the following sections, we'll explore Galexie's features, architecture, and potential applications.
Galaxie is a simple, lightweight application that bundles Stellar network data, processes it and writes it to an external data store. It includes the following capabilities:
Galexie is built around a few key design principles:
Non-goal: Galexie is not intended or designed for live ingestion or interacting with the Stellar network. The captive core is much better suited for live ingestion.
The following an architecture diagram depicting the various components within Galexie.
Like Stellar Core, Galaxie emits transaction metadata in the standard XDR format. This format is compatible with existing systems, like Horizon and Hubble, which allows developers to effortlessly swap the XDR data source of their application backends.
We decided to support storing multiple ledgers in each exported file. This reduces the number of files written to a data store and makes it significantly more efficient to download large ranges of data. We found during experimentation that bulk downloading data for the same ledger range is twice as fast when files contain a bundle of 64 ledgers compared with files containing a single ledger.
All uploaded files are compressed using Zstandard as the default compression algorithm to save upload and download bandwidth as well as storage space.
To allow further organization of data, we added support for partitioning the data via a directory structure. The number of ledgers per file and number of files per partition are both configurable.
We decided to encode additional information into the file and directory names. These names act as a sorting token and eliminate the need for separate indexing. The filename format includes two components:
This naming scheme ensures that files and directories are sorted in descending order, with the most recent ledgers appearing first, optimizing the retrieval for the latest ledgers.
For example, using 64,000 files per directory and 64 ledgers per file, the directory and file containing ledgers 64065 to 64128 would be named:
Galexie’s data store approach significantly enhances the operational efficiency and scalability of Stellar tools. Instead of relying on Captive Core to generate transaction metadata, these tools can now access precomputed transaction metadata from the datastore. This approach improves data retrieval speed by an order of magnitude!
In addition, by removing the dependency on Captive Core which is a compute-heavy component, Galaxie lowers hardware requirements and costs.
Here are two examples demonstrating the effectiveness of using this data store approach.
Before Galexie, Hubble, the Stellar Analytics Platform, used Captive Core to ingest data from the Network. For batch processing, this was resource-intensive because Captive Core’s catch-up phase took longer than reading and processing that batch of data from the Network. By leveraging a data lake, Hubble is decoupled from Captive Core, meaning it can access transaction metadata directly from GCS without incurring the expensive startup cost of Captive Core. This significantly reduces execution time and overhead.
Horizon, the Stellar API server, has the ability to reingest historical data within specified ledger ranges using an internal Captive Core instance. To manage large ledger ranges efficiently, Horizon can run multiple reingestion processes simultaneously, each accessing its own Captive Core. However, this approach is highly resource-intensive. Decoupling Horizon's reingestion from Captive Core enables direct data re-ingestion from the data lake, enhancing both speed and scalability.
Running your own Galexie instance allows you to create a customized data lake with extracted raw ledger data. This approach offers a couple of benefits:
For detailed instructions on how to install and run the service, refer to the Admin Guide.
We created a data lake using Galexie and noted the following observations:
Note: Time estimates related to backfilling a data lake depend on various factors such as the number of ledgers bundled in a single file, network latency, and bandwidth.
Our goal at SDF is to advance truly decentralized applications and build smaller, reusable software components. Galexie aligns with this vision, playing an important role in building the Composable Data Platform. We aim to expand Galexie functionality by adding support for other object stores and integrating with new transformation tools.
We welcome community contributions to support these future developments. Your involvement can help extend Galexie’s capabilities, add support for additional cloud storage solutions, and integrate new tools. For more details on how to get involved, refer to our Developer Guide.
In upcoming blog posts, we will focus on specific use cases, including refactoring Hubble and Horizon backends to use a data lake generated by Galexie. Stay tuned as we delve deeper into these applications and their benchmarks, and provide insights into how Galexie can enhance data handling and performance.
If you have questions, suggestions, or feedback, join us on Discord in the #hubble and #horizon channels. Our team and community members are there to discuss and provide support.
More to Explore
Article
Developer Tool
Data
This article is the first in an expansive series on the Composable Data Platform, the next generation data-access platform on Stellar. The Composable…
Article
Beginning August 1st, 2024, the Stellar Development Foundation's Mainnet instances of Horizon will provide one year of historical data, which will be…
The Newsletters
Hear it from us first. Sign up to get the real-time scoop on Stellar ecosystem news, functionalities, and resources.