Olu Use Case · MIT-Labs tone

GCP-Native BigQuery Data Mart — CME Group (Public Data)

Conceptual portfolio wiki using public CMEGroup.com pages: governed ingestion, layered BigQuery (Bronze→Silver→Gold), Python ETL (batch + incremental), validation contracts, schema evolution, Composer orchestration, and Terraform IaC.

Tech: BigQuery • Cloud Storage • Pub/Sub • Dataflow • Cloud Composer • Python • Terraform Tags: GCP-Native, BigQuery

1) Scope & Assumptions

  • Domain: Public, non-authenticated CMEGroup.com pages (e.g., product specs, calendar summaries, education). No login · No paywall
  • Purpose: Educational portfolio architecture; not for redistributing licensed market data.
  • Acquisition: Respect robots.txt and Terms of Use; throttle requests; cache raw HTML.
  • PII: None ingested; low-risk posture.

2) Governance & Compliance

Data Contracts

  • Source allowlist, crawl cadence, user-agent, throttling.
  • Dataset contracts: schema, freshness SLA, retention, lineage.
  • Quality bars: null thresholds, code regex, timestamp sanity.

Access & Security

  • Least-privilege IAM; RW separation by dataset.
  • Service accounts per stage; CMEK optional; VPC-SC optional.
  • Labels: layer, owner, domain for governance & cost.

3) Target Architecture

IngestionStaging (Bronze)Transform (Silver)Mart (Gold)Orchestration
CMEGroup.com Public Pages Cloud Storage raw_html/ • extracted_json/ Pub/Sub (events) Dataflow / Beam Parse • Normalize BigQuery (Bronze) BigQuery (Silver) BigQuery (Gold) Cloud Composer
Figure 1. End-to-end architecture with governed ingestion, eventing, and layered BigQuery design.

4) Data Model (Bronze → Silver → Gold)

Bronze (Staging)

  • Raw extracts as normalized JSON with lineage: source_url, crawl_ts, hash_key.
  • Partition by crawl_date; cluster by instrument_code, content_type.
  • Raw HTML persisted in GCS with checksums.

Silver (Conformed)

  • Entity tables: products, contract_specs, calendar_events, education_articles.
  • Type enforcement; dedupe via hash_key; business keys instrument_code + as_of_date.

Gold (Marts)

  • Analytics-ready facts/views: coverage, attribute snapshots, event counts, freshness indicators.
  • Materialized views for hot paths; optional row-level security.

Example DDL

-- bronze.cme_pages
CREATE TABLE IF NOT EXISTS `proj.ds_bronze.cme_pages`
PARTITION BY DATE(crawl_ts)
CLUSTER BY instrument_code, content_type AS
SELECT
  NULL AS instrument_code,
  NULL AS content_type,
  NULL AS source_url,
  NULL AS crawl_ts,
  NULL AS hash_key,
  NULL AS payload_json;

5) Ingestion & ETL Framework (Batch + Incremental)

Python Extractors (Batch)

HEADERS = {"User-Agent": "ResearchBot/1.0 (+contact@example.com)"}
def extract_payload(html) -> dict:
    # Parse & normalize metadata: instrument_code, title, attributes...
    return {"content_type":"contract_specs","instrument_code":None,"attributes":{}}

def make_record(url, html):
    h = sha256((url + html)[:1_000_000].encode()).hexdigest()
    return {"source_url": url, "crawl_ts": CURRENT_TIMESTAMP(), "hash_key": h,
            "payload_json": extract_payload(html)}

Incremental Merge (BigQuery)

MERGE `proj.ds_silver.contract_specs` T
USING `proj.ds_bronze.cme_pages` S
ON T.hash_key = S.hash_key
WHEN NOT MATCHED AND S.content_type = 'contract_specs' THEN
  INSERT (instrument_code, as_of_ts, tick_size, contract_unit, hash_key)
  VALUES (JSON_VALUE(S.payload_json, '$.instrument_code'), S.crawl_ts,
          JSON_VALUE(S.payload_json, '$.attributes.tick_size'),
          JSON_VALUE(S.payload_json, '$.attributes.contract_unit'), S.hash_key);

Changed content can be handled via SCD-2 in Silver or snapshot logic in Gold.

Validation Hooks

def validate(record):
    errs = []
    if not record.get("source_url"): errs.append("missing source_url")
    if not record.get("hash_key"):   errs.append("missing hash_key")
    return errs

Storage Layout

gs://<bucket>/cme/raw_html/{date}/<instrument|misc>/<hash>.html
gs://<bucket>/cme/extracted_json/{date}/<instrument|misc>/<hash>.json

6) Data Validation & Contracts

Expectation Suite

expect_table_row_count_to_be_between(min_value=1)
expect_column_values_to_not_be_null(columns=["source_url","hash_key","crawl_ts"])
expect_column_values_to_match_regex(column="instrument_code", regex="^[A-Z0-9-_]{1,20}$")

Lineage & Provenance

  • _LINEAGE meta tables: source_url, crawl_ts, code_version, container digest.
  • Data Catalog + policy tags (optional).

7) Schema Evolution & Versioning

Pattern

  • Bronze: JSON payload absorbs drift.
  • Silver: typed columns; additive changes via ALTER TABLE ADD COLUMN.
  • Breaking changes: versioned tables (contract_specs_v2).

DDL Examples

ALTER TABLE `proj.ds_silver.contract_specs`
ADD COLUMN IF NOT EXISTS tick_value NUMERIC;

CREATE OR REPLACE VIEW `proj.ds_gold.fact_products` AS
SELECT
  instrument_code,
  DATE(as_of_ts) AS as_of_date,
  ANY_VALUE(tick_size) AS tick_size,
  ANY_VALUE(contract_unit) AS contract_unit
FROM `proj.ds_silver.contract_specs`
GROUP BY 1,2;

8) Orchestration (Cloud Composer / Airflow)

DAG Sketch

with DAG("cme_public_pipeline", schedule="@hourly", catchup=False) as dag:
    start >> extract >> normalize >> load_bronze >> validate >> merge_silver >> build_gold >> end

SLAs & Alerts

  • Freshness: ≤ 2h publish→mart.
  • Retries: exponential backoff; idempotent tasks.
  • Notify: Slack on failure; on-call if SLA breached twice.

9) Infrastructure as Code (Terraform)

Core Resources

resource "google_bigquery_dataset" "bronze" {
  dataset_id = "ds_bronze"
  location   = "US"
  labels     = { layer = "bronze", owner = "olu", domain = "futures" }
}

resource "google_storage_bucket" "raw" {
  name                              = "cme-public-raw"
  location                          = "US"
  uniform_bucket_level_access       = true
}

Composer

resource "google_composer_environment" "main" {
  name   = "composer-cme"
  region = "us-central1"
  config {
    software_config { image_version = "composer-3-airflow-2" }
  }
}

10) Cost & Observability

Optimization

  • Partition + cluster Silver/Gold; cache where possible.
  • Delta-only processing; compact load files (~64MB).
  • Autoscale/slots for predictable cost.

Monitoring

  • Dashboards: freshness, row deltas, failure rate, cost/day.
  • Audit: validation failures, schema drift, DAG runtimes.
  • Correlation IDs: hash_key, dag_run_id.

11) Portfolio Links