# ClickHouseDataset `ClickHouseDataset` is the successor of the [`DailyDataset`](./2-dailydata-gcs.md) from the previous section. This data ingestion pattern was created as a realization that writing data in GCS is good for portability (can read from both BigQuery and ClickHouse) but does not match the functionality we can get by writing directly to ClickHouse, in particular - The state of the ingestion table can be used to do incremental ingestion (only ingesting what we don't already have). - The `ReplacingMergeTree` engine helps deduplicate input data. ## Subclassing the `ClickHouseDataset` class. This is the same as for the `DailyDataset` class. We also follow the same directory structure convention where we have one directory and one `dataaccess.py` per datasource, for example: ```python # FILE: src/op_analytics/datasources/governance/dataaccess.py from op_analytics.coreutils.clickhousedata import ClickhouseDataset class Governance(ClickhouseDataset): # Delegation events DELEGATE_CHANGED_EVENTS = "ingest_delegate_changed_events_v1" # Delegates DELEGATES = "ingest_delegates_v1" # Proposals PROPOSALS = "ingest_proposals_v1" ``` Each enum member corresponds to a table where data will be ingested. ## CREATE TABLE For each ingestion table defined in `dataaccces.py` you are expected to provide a corresponding `CREATE TABLE` SQL file, which goes in the `ddl/` folder right beside `dataaccess.py`. The file name should match the enum member value used in the `ClickHouseDataset` subclass: ``` src/op_analytics/datasources/governance/ddl ├── ingest_delegate_changed_events_v1.sql ├── ingest_delegates_v1.sql ├── ingest_proposals_v1.sql ├── ingest_votes_v1.sql └── ingest_voting_power_snaps_v1.sql ``` The `ClickHouseDataset` class provides a `create_table()` function which locates the DDL and runs it against the database. Most of the time users don't need to call `create_table()` directly because the method is called every time `write()` is called. ## Writing Data The `ClickHouseDataset` class has a `write()` method, very similar to the one in the `DailyDataset` class. The method accepts a polars dataframe and will insert it to the destination table. For example: ```python DaoPowerIndex.CPI_SNAPSHOTS.write(powerindex.snapshots_df) ``` Keep in mind that there will be cases when first going through a polars dataframe is not the best approach. So you can always insert data directly into the ClickHouse table. The enum members of a `ClickHouseDataset` subclass give you access to the db and table name strings so you can use them as needed. If you want to you can even set up your own `INSERT INTO` statements in SQL files and then coordinate execution of the statements from the `execute.py` file. ## Execution Similar to what we do for `DailyDataset` data sources we have an execute.py file that has the entrypoint function for the code that runs to ingest data. Here is an example: ```python from .historical import HistoricalCPI from .powerindex import ConcentrationOfPowerIndex from .dataaccess import DaoPowerIndex def execute_pull(): powerindex = ConcentrationOfPowerIndex.fetch() historical = HistoricalCPI.fetch() DaoPowerIndex.CPI_SNAPSHOTS.write(powerindex.snapshots_df) DaoPowerIndex.CPI_COUNCIL_PERCENTAGES.write(powerindex.council_percentages_df) DaoPowerIndex.CPI_HISTORICAL.write(historical.df) ``` The business logic is isolated in separate dedicate modules (`historical.py`, `powerindex.py`) and the prepared dataframes are written out using the `write()` method. ## Scheduling This is the same as for `DailyDataset`. We have custom built Dagster assets that get assigned to a specific scheduled job in our Dagster `defs.py`. ## Monitoring Monitoring is an area where we can improve `ClickHouseDataset`. We don't use markers here at the moment, so we don't have easy visibility of when data was written on a given table. This is something that must be implemented in the future and will be important as new data sources start adopting `ClickHouseDataset` over `DailyDataset`.