# General Purpose Data Pipelines: Transforms

In the previous section we went over how to build data pipelines that process onchain data using
ClickHouse as the compute engine. In this section we talk about data pipelines for a wider variety
of use cases.

The "transforms" system was developed before we built the blockbatch load system and so we also use
it for onchain data (e.g. the interop transforms group). That said it would be a good idea to
migrate some of the onchain data tables from the transforms system to the blockbatch system, since
the latter has better data integrity guarantees and better monitoring.

## Transform Groups

A transform group is a collection of tables and views that are processed together. You can think
of a group as a mini-pipeline where a series of tables are populated one after the other.

Let's take as an example the `interop` transform group. This group has the following ClickHouse
tables:

- `dim_erc20_with_ntt_first_seen_v1`
- `fact_erc20_oft_transfers_v1`
- `fact_erc20_ntt_transfers_v1`
- `dim_erc20_ntt_first_seen_v1`
- `dim_erc20_oft_first_seen_v1`
- `fact_erc20_create_traces_v2`
- `export_fact_erc20_create_traces_v2`

## Transform Specification

Each table in a transform group is defined by a `CREATE` statement that defines the table schema
and ClickHouse engine (including the very important `ORDER BY` clause).  And an `INSERT` statement
that will be used to populate the table.

### Directory Structure and Naming Convention

The directory structure for a transform group consists of two folders:

- `create/`: Has the `CREATE` statements for all the tables in the group.
- `insert/`: Has the `INSERT` statements for all the tables in the group.

The naming convention for files in each of the folders is the following:

- `[INDEX]_[tablename].sql

Where `[INDEX]` is a number that indicates the order of execution and `[tablename]` is the name of
the table.

Let's take the `fees` transforms croup as an example:
```
src/op_analytics/transforms/fees
├── create
│   ├── 01_agg_daily_transactions_grouping_sets.sql
│   ├── 02_agg_daily_transactions.sql
│   ├── 10_view_agg_daily_transactions_tx_from_tx_to_method.sql
│   ├── 11_view_agg_daily_transactions_tx_to_method.sql
│   └── 12_view_agg_daily_transactions_tx_to.sql
└── update
    ├── 01_agg_daily_transactions_grouping_sets.sql
    └── 02_agg_daily_transactions.sql
```

This group has 2 tables and 3 views.  For each of the two tables we have a corresponding `update`
SQL statement that is used to populate the table.

Note that it is perfectly fine to have any kind of SQL file in the `create` folder. Although it's
better not to get too creative. If you inspect the SQL files for the `views` you will see that they
are straight-up `CREATE VIEW` ClickHouse statements.


## Transform Execution Model

The execution model for a transform is very simple:

- When the system runs a transforms it first runs all the `CREATE` statements in order according to
  their index.

- It then runs all the `INSERT` statements in order according to their index.

- The system provides the execution date as a query parameter (`dt`) that may or may not be
  utilized by the `INSERT` statement.


## Building Data Pipelines

To build a `transforms` pipeline all you need to do is create a new directory in the `src/op_analytics/transforms`
directory and add the appropriate SQL files.


### Execution

We provide a function to execute a transform group for a given date range:

- `src.op_analytics.transforms.main.execute_dt_transforms`

For a detailed description of the parameters see the function's docstring.

### Prototyping and Backfilling

The `execute_dt_transforms` function is used from an IPython notebook to prototype and also to
manually backfill data.  Notebooks for transforms are located in the
`notebooks/adhoc/clickhouse_transforms` directory. Browse that directory for examples.

### Scheduling


All transforms-related Dagster assets are defined in the `src/op_analytics/dagster/assets/transforms.py` file.
There is no hard and fast rule for how to define the assets. Generally we schedule one asset per
group, but there are cases where we may want to have more control over what is executed on each
day and so each asset can decide how it calls the `execute_dt_transforms` function.

Similarly for Dagster jobs, there is no hard and fast rule but we generally define one scheduled
job per transform group.  The jobs are defined in the `src/op_analytics/dagster/defs.py` file.


### Markers

We use markers to track the execution of the transforms. The markers are stored in the
`etl_monitor.transform_dt_markers` table in ClickHouse.


### Monitoring

Unfortunately we have not yet built monitoring tools for the transforms system. I have talked about
the idea of having `quality` tables as part of a transforms group. A `quality` table is a table that
contains information about anything that might be wrong with the data in the group.