Blockbatch Models#
Blockbatch models are a way to process onchain data from the data lake and store results back in the data lake.

The image above shows our mental model of blockbatch
processing (for a single chain), where each
cell represents a batch. For each batch we have the core ingestion root paths and we have many other
root paths that are computed starting from the raw data. These derived datasets we refer to as the
“blockbatch models”.
Processing raw data is expensive, so we compute as many blockbatch models as we can to make further analysis cheaper.
At OP Labs we have simple filtering models, such as:
blockbatch/contract_creation/create_traces_v1
blockbatch/token_transfers/erc20_transfers_v1
blockbatch/token_transfers/erc721_transfers_v1
We also have models that compute fee breakdowns:
blockbatch/refined_traces/refined_transactions_fees_v2
blockbatch/refined_traces/refined_traces_fees_v2
And we have models where we do more complex contract analysis (including decoding):
blockbatch/account_abstraction/enriched_entrypoint_traces_v2
blockbatch/account_abstraction/useroperationevent_logs_v2
blockbatch/account_abstraction_prefilter/entrypoint_logs_v1
blockbatch/account_abstraction_prefilter/entrypoint_traces_v1
Inexpensive Joins#
Blockchain data processing usually involves joining data from the different raw datasets. For example you may want to join logs with traces or transactions with traces.
Outside of blockbatch
data processing this is difficult because compute engines have to perform
a distributed join across many blocks to produce the joined result. It seems possible, but is
difficult to get the compute engine to perform a storage layout based optimization of these joins.
(see for example Storage Partitioned Joins in Apache Spark).
Within blockbatch
processing we can load data for a specific batch into memory without having to
scan the storage layer (O(1) access to batches due to the deterministic path). And so we can
leverage this to much more effectively perform computations that require joins across root_paths.
Regardless of whether the root_path is one of the 4 core datasets or one derived from a model.
DuckDB Compute Engine#
Having tackled the problem of getting the right data into memory in the local context we now have to make a choice: how do we perform computations on this data? Enter DuckDB.
We chose DuckDB as the compute engine for blockbatch
for the following reasons:
Performance. DuckDB is very fast for SQL queries.
Extensibility. DuckDB can be enhanced with native UDFs and Python based UDFs. The Python UDFs are critical for our ability to decode smart contract data.
Streaming capabilities. DuckDB can stream data from the storage layer to the compute engine and most importantly it can also stream write data back to the storage layer. This minimizes the memory footprint required to perform computations.
Documentation. DuckDB has excellent documentation and a very active community.
Interface#
We wanted a single interface to perform all blockbatch
processing. The goal is to be able to
specify three pieces of information and have the system figure out the rest. The three pieces of
information are:
The date range we want to process.
The chains we want to process.
The models we want to compute.
Behind the scenes the system will figure out which batches need to be processed, using markers to determine which batches have already been processed. Depending on the required input data it will decide if the batches are ready to be processed (avoids processing incomplete or missing input).
The interface is implemented as a single python function which can be called like this:
compute_blockbatch(
chains=normalize_chains("ALL,-kroma,-unichain_sepolia"),
models=["refined_traces"],
range_spec="@20240301:20240310",
read_from=DataLocation.GCS,
write_to=DataLocation.GCS,
dryrun=False,
force_complete=False,
fork_process=True,
)
The read_from
and write_to
parameters specify where the data is stored and where the results
will be stored. We support both GCS and the local file system, but in practice we just use GCS most
of the time. The local file system is useful for development and testing.
Model Definitions in Python#
In the interface above models are referenced by their name as a string. The actual definition of
a model is a python function, which is decorated with the @register_model
decorator to specify
the input datasets, auxiliary templates, and outputs that it produces.
An example model definition is shown below:
@register_model(
input_datasets=[
"ingestion/blocks_v1",
"ingestion/traces_v1",
"ingestion/transactions_v1",
],
auxiliary_templates=[
"refined_transactions_fees",
"refined_traces/traces_projection",
"refined_traces/traces_with_gas_used",
"refined_traces/traces_txs_join",
],
expected_outputs=[
"refined_transactions_fees_v2",
"refined_traces_fees_v2",
],
)
def refined_traces(
ctx: DuckDBContext,
input_datasets: dict[str, ParquetData],
auxiliary_templates: dict[str, AuxiliaryTemplate],
) -> NamedRelations:
... LOGIC GOES HERE ...
The input_datasets
and expected_outputs
parameters in the decorator specify the data that is
needed and the data that will be produced. The auxiliary_templates
parameter specifies the location
of .sql
files that contain SQL logic that can be leveraged to compute the model. How the templates
are used is up to the model definition.
When a model is executed the blockbatch
processing system will find out where the input data is
in GCS and will inject it to the model function via the input_datasets
function argument. It will
also inject the loaded SQL templates via the auxiliary_templates
function argument.
The model function must return a NamedRelations
object, which is a mapping of output name to
a DuckDB relation or view object that contains the results.
The system will materialize the output DuckDB objects as parquet files in the storage layer.
Prototyping#
The tricky part when prototyping a model is the scaffolding that is needed to setup and call
the model definition python function. We have created a helper function that can be used from
an IPython notebook to do that. For reference we recommend looking at any notebook in the
notebooks/adhoc/blockbatch_modesls
path in the repository.
Integration Testing#
Integration tests are crucial to ensure that the model logic is correct. Our integration test
framework has a super useful feature which allows you to fetch real production data and use it in
test. When setting up an integration test you specify the range_spec (as blocks or dates) that
you want to test over and the first time you run the test it will fetch the real data and store it
as a parquet file in the testdata
directory in the repo.
With test data in hand the integration test framework will run the model. You can define any kind of assertions you want to verify the results.
For an example of an integration test see the refined_traces
test in the
tests/op_analytics/datapipeline/etl/models/test_refined_traces.py
file.