Architecture#
The diagram below shows the high level architecture of the OP Labs Data Platform.

Ingestion#
We ingest raw-onchain data from Goldsky, using a Goldsky-owned ClickHouse instance as the data interface.
We also ingest data from numerous third-party sources. Primarily from API endpoints, but also by other means (public gcs buckets, reading onchain contracts, etc).
Storage#
Our primary storage layer is Google Cloud Storage (GCS). All of the data we ingest is first stored in GCS before any further processing.
We use a single GCS bucket for all data, with the following naming convention:
gs://BUCKET/ROOT_PATH/partitionA=0/partitionB=1/file.parquet
The ROOT_PATH
of a dataset is almost like a table name. It can be nested, so we can group related
datasets together in a hierarchy.
The most common partition used is dt
(for date). We use it in almost all our datasets as we
generally stick to daily data processing. For multichain datasets we also use the chain
partition.
The file format used is always Parquet.
Below we show some example root paths in GCS. These paths are for a family of tables called
blockbatch
(for the approach used to process them). As you can see we have multiple levels of
neseting:

ClickHouse Data Warehouse#
We use ClickHouse as our data warehouse. From ClickHouse we can use the S3 Table Engine to read data directly from GCS. We have a number of convenience functions that let us read by specifying the root path of the data in GCS and a partition value.
Processing and Orchestration#
We use dagster running on kubernetes to orchestrate ingestion and processing of data. This includes:
Ingestion, which reads from our input data sources and writes to GCS.
Processing, which reads from GCS and writes back to GCS.
Transformations, which runs on the ClickHouse Data Warehouse.
Exports, which read from GCS or ClickHouse and write to BigQuery. We use this for public tables and also to expose data to insight toosl that cannot read from ClickHouse directly.
Insights#
We use a number of tools to help us create dashboards and reports. Our main tool is Hex but we also use Streamlit and Google Sheets.