1) Scope & Assumptions
- Domain: Public, non-authenticated CMEGroup.com pages (e.g., product specs, calendar summaries, education). No login · No paywall
- Purpose: Educational portfolio architecture; not for redistributing licensed market data.
- Acquisition: Respect
robots.txtand Terms of Use; throttle requests; cache raw HTML. - PII: None ingested; low-risk posture.
2) Governance & Compliance
Data Contracts
- Source allowlist, crawl cadence, user-agent, throttling.
- Dataset contracts: schema, freshness SLA, retention, lineage.
- Quality bars: null thresholds, code regex, timestamp sanity.
Access & Security
- Least-privilege IAM; RW separation by dataset.
- Service accounts per stage; CMEK optional; VPC-SC optional.
- Labels:
layer,owner,domainfor governance & cost.
3) Target Architecture
IngestionStaging (Bronze)Transform (Silver)Mart (Gold)Orchestration
Figure 1. End-to-end architecture with governed ingestion, eventing, and layered BigQuery design.
4) Data Model (Bronze → Silver → Gold)
Bronze (Staging)
- Raw extracts as normalized JSON with lineage:
source_url,crawl_ts,hash_key. - Partition by
crawl_date; cluster byinstrument_code,content_type. - Raw HTML persisted in GCS with checksums.
Silver (Conformed)
- Entity tables:
products,contract_specs,calendar_events,education_articles. - Type enforcement; dedupe via
hash_key; business keysinstrument_code+as_of_date.
Gold (Marts)
- Analytics-ready facts/views: coverage, attribute snapshots, event counts, freshness indicators.
- Materialized views for hot paths; optional row-level security.
Example DDL
-- bronze.cme_pages CREATE TABLE IF NOT EXISTS `proj.ds_bronze.cme_pages` PARTITION BY DATE(crawl_ts) CLUSTER BY instrument_code, content_type AS SELECT NULL AS instrument_code, NULL AS content_type, NULL AS source_url, NULL AS crawl_ts, NULL AS hash_key, NULL AS payload_json;
5) Ingestion & ETL Framework (Batch + Incremental)
Python Extractors (Batch)
HEADERS = {"User-Agent": "ResearchBot/1.0 (+contact@example.com)"}
def extract_payload(html) -> dict:
# Parse & normalize metadata: instrument_code, title, attributes...
return {"content_type":"contract_specs","instrument_code":None,"attributes":{}}
def make_record(url, html):
h = sha256((url + html)[:1_000_000].encode()).hexdigest()
return {"source_url": url, "crawl_ts": CURRENT_TIMESTAMP(), "hash_key": h,
"payload_json": extract_payload(html)}
Incremental Merge (BigQuery)
MERGE `proj.ds_silver.contract_specs` T
USING `proj.ds_bronze.cme_pages` S
ON T.hash_key = S.hash_key
WHEN NOT MATCHED AND S.content_type = 'contract_specs' THEN
INSERT (instrument_code, as_of_ts, tick_size, contract_unit, hash_key)
VALUES (JSON_VALUE(S.payload_json, '$.instrument_code'), S.crawl_ts,
JSON_VALUE(S.payload_json, '$.attributes.tick_size'),
JSON_VALUE(S.payload_json, '$.attributes.contract_unit'), S.hash_key);
Changed content can be handled via SCD-2 in Silver or snapshot logic in Gold.
Validation Hooks
def validate(record):
errs = []
if not record.get("source_url"): errs.append("missing source_url")
if not record.get("hash_key"): errs.append("missing hash_key")
return errs
Storage Layout
gs://<bucket>/cme/raw_html/{date}/<instrument|misc>/<hash>.html
gs://<bucket>/cme/extracted_json/{date}/<instrument|misc>/<hash>.json
6) Data Validation & Contracts
Expectation Suite
expect_table_row_count_to_be_between(min_value=1)
expect_column_values_to_not_be_null(columns=["source_url","hash_key","crawl_ts"])
expect_column_values_to_match_regex(column="instrument_code", regex="^[A-Z0-9-_]{1,20}$")
Lineage & Provenance
_LINEAGEmeta tables: source_url, crawl_ts, code_version, container digest.- Data Catalog + policy tags (optional).
7) Schema Evolution & Versioning
Pattern
- Bronze: JSON payload absorbs drift.
- Silver: typed columns; additive changes via
ALTER TABLE ADD COLUMN. - Breaking changes: versioned tables (
contract_specs_v2).
DDL Examples
ALTER TABLE `proj.ds_silver.contract_specs` ADD COLUMN IF NOT EXISTS tick_value NUMERIC; CREATE OR REPLACE VIEW `proj.ds_gold.fact_products` AS SELECT instrument_code, DATE(as_of_ts) AS as_of_date, ANY_VALUE(tick_size) AS tick_size, ANY_VALUE(contract_unit) AS contract_unit FROM `proj.ds_silver.contract_specs` GROUP BY 1,2;
8) Orchestration (Cloud Composer / Airflow)
DAG Sketch
with DAG("cme_public_pipeline", schedule="@hourly", catchup=False) as dag:
start >> extract >> normalize >> load_bronze >> validate >> merge_silver >> build_gold >> end
SLAs & Alerts
- Freshness: ≤ 2h publish→mart.
- Retries: exponential backoff; idempotent tasks.
- Notify: Slack on failure; on-call if SLA breached twice.
9) Infrastructure as Code (Terraform)
Core Resources
resource "google_bigquery_dataset" "bronze" {
dataset_id = "ds_bronze"
location = "US"
labels = { layer = "bronze", owner = "olu", domain = "futures" }
}
resource "google_storage_bucket" "raw" {
name = "cme-public-raw"
location = "US"
uniform_bucket_level_access = true
}
Composer
resource "google_composer_environment" "main" {
name = "composer-cme"
region = "us-central1"
config {
software_config { image_version = "composer-3-airflow-2" }
}
}
10) Cost & Observability
Optimization
- Partition + cluster Silver/Gold; cache where possible.
- Delta-only processing; compact load files (~64MB).
- Autoscale/slots for predictable cost.
Monitoring
- Dashboards: freshness, row deltas, failure rate, cost/day.
- Audit: validation failures, schema drift, DAG runtimes.
- Correlation IDs:
hash_key,dag_run_id.
11) Portfolio Links
- GitHub Pages: daramoladoes.github.io
- Repo: DaramolaDoes/daramoladoes.github.io