Author
Molly Karcher
Publishing date
Developer Tool
Data
This article is the first in an expansive series on the Composable Data Platform, the next generation data-access platform on Stellar. This article will give a high-level overview of the architecture, as well as its goals and capabilities. Follow-ups will go into depth on the different components of the architecture, deep-dive into meaningful use-cases, and explore how you can use this to supercharge your own applications.
The State of the World
Today, SDF owns and operates two products that mediate data from the Stellar blockchain: Horizon, an API that interacts with the network, and Hubble, a full-history analytics dataset. At present, SDF hosts public versions of these products to encourage small-scale development and experimentation on the network. This model has led to challenges in the ecosystem:
In order to support and foster a healthy, decentralized network, it is important to us that this data storage and access is distributed across as many ecosystem participants as possible.
Following both the launch of Soroban and the change in SDF Horizon’s retention window, we are seeing a sharp rise in interest in providing network data as a service from analytics providers, infrastructure providers, indexers, and the like. Unfortunately, given the current state of the world, it’s harder than it should be for these kinds of providers to quickly on-ramp when trying to serve Stellar data. Given the monolithic nature of Horizon and Hubble, they tend to drive an all-or-nothing approach; you’re either nudged into adopting SDF’s chosen data model, or you roll your own integration entirely independent of these tools.
The Composable Data Platform (hereafter referred to as CDP) is a collection of open-source tools and libraries that work together to streamline data access for the Stellar ecosystem. The intent is to allow each ecosystem participant to plug-and-play as needed and customize their solution based on their individual application needs.
The key components that make up CDP include:
At first glance, this might appear quite simple, even obvious. Indeed, this is in essence what Hubble and Horizon both do today. The two share code for some of these components (as they both use the ingest SDK), but for the most part they represent parallel, divergent codebases that often tightly couple most of this functionality together. This makes it extremely difficult (if not impossible) for you to customize deployments of them to suit your own data needs.
CDP reimagines that monolithic architecture by clearly defining and then externally exposing each of those inner components. Components are represented as standalone interfaces that can be configured and operated independently, while seamlessly interlocking with every other component to provide a customized data layer for your application.
This unlocks endless possibilities! Importantly, it gives you, the developer, the power to completely customize your data consumption and access patterns, depending on what kind of application you’re building. For example:
Consider the vastly different backend architectural choices that could be made across these examples:
Ultimately, only you know what data your application needs, so only you can decide the optimal schema (and data store) to hold it. If you need help thinking through options or figuring out how CDP fits in, reach out to us on Discord! We have two channels: #hubble for analytics questions, and #horizon for operational, real-time questions. We’re available to help brainstorm, and your feedback will help us decide what meaningful extensions we add next to CDP.
The first piece of the puzzle, Galexie, is out and available for public use; check it out on github or docker hub! It currently supports a single object storage option (GCS), and we’re eager to hear feedback on what storage mediums may be most valuable to you in the future.
We are leveraging the performance gains and simplicity of CDP by refactoring portions of our existing products, Hubble and Horizon. SDF’s Hubble now uses a Galexie-exported GCS data lake as its backend - see stellar-etl for details. Horizon support for re-ingesting from a Galexie-exported data lake is available in v2.32.0. We’ll have posts in this series which go in-depth into how we refactored our services, as well as what you can expect in terms of cost and wall-clock time if you opt to utilize these new components.
To start building your own application independent of Horizon or Hubble, take a look at our ingest SDK. This encapsulates the Ledger Backend interface of CDP, and this can be used to build your own ingestion pipeline configured to consume from a Galexie-exported data lake.
Stay tuned for next week’s post on Galexie, where we’ll be doing a deep-dive on installation, development, and usage. This is the first major component that makes up the CDP, enabling developers to efficiently export and memoize Stellar network data for processing.
We’re actively working on developing a library to house our processors (or transforms), which will help to transform the raw XDR format into data models that you’re more familiar with if you’re used to utilizing Horizon or Hubble for data access.
This all may sound a little overwhelming and abstract, but we’ll be coming out with extensive example implementations to demystify the new platform. We’ll also be coming out with more content in this series, where we’ll highlight existing key use cases, and illustrate how you could utilize the full power of CDP in your own application.
In the meantime, join us in our Developer Discord to chat through any questions, concerns, or feature requests as we work to modernize data access on Stellar!