# Ingestion

## Data Pipeline

```{warning}
Our data pipeline is currently under construction. The information below reflects our latests plans.
```

### Overview Diagram ([link](https://excalidraw.com/#json=1MWxonIgxwWNA5xGOLJuk,W2_v6uGd8zfk_yln9cno6w))

```{image} oplabs-data-pipeline.png
:alt: pipeline
:class: bg-primary
:width: 60%
:align: center
```


### Audited Core Datasets

[Goldsky](https://docs.goldsky.com/) is our current raw onchain data source. Our ingestion process
runs incrementally based on block number. For a given block range it will:
1. Fetch raw onchain data
2. Run audits.
3. If audits pass, then write to audited datsets iceberg tables in GCS.
4. If audits fail, alert the pipeline operators. 

The audit process will help us ensure that the data is valid and self consistent. For example,
there shouldn't be any block nubmer gaps, and the transaction count in a block should match
the number of records in the transactiosn table. 

Audited datsets are stored in GCS as standalone iceberg tables. This makes it easy to access this
data with a variety of query engines, which should future-proof our operations. 

### Processed Datasets

Custom processing also runs incrementally based on block number. Here is where the most upstream
data transformations defined by OP Labs take place. Some are generic, like decoding logs and traces.
Some are more specific, like extracting transfers or user actions from the core data. The data is
also written to GCS as iceberg at this point.

Note that the Ingestion and Processing steps can be fused for efficiency. After auditing we can
run processing with the data in-hand, saving us one write-read trip to GCS. Our plan is to process
data using block batches small enough so that all of the core datsets (transactions, logs, traces)
fit in-memory. Any logic that requires joning across core datasets should be executed during
processing, to take advantage of data locality.  Performing joins downstream after core datsets have
been written out can lead to expensive data shuffling.

The intermediate iceberg tables stored in GCS will be useful when iterating on the processing logic.
We will be able to **backfill** any processed datasets without having to go all the way upstream to
the raw onchain data provider.

### Public Datasets

Public core datasets will be date-partitioned and written to BigQuery. Starting with parquet files
in GCS will allow us to schedule load jobs, which are the most cost effective way to ingest data
into BQ. 

### Downstream Data Modeling

For models downstream of public BigQuery datsets we will use a more standard dbt setup. This
would be similar to other solutions available in the ecosystem, such as [OSO](https://docs.opensource.observer/docs/how-oso-works/architecture#dbt-pipeline) or [Dune Spellbook](https://github.com/duneanalytics/spellbook).