1) Scope & Assumptions
- Domain: Public, non-authenticated CMEGroup.com pages (e.g., product specs, calendar summaries, education). No login · No paywall
- Purpose: Educational portfolio architecture; not for redistributing licensed market data.
- Acquisition: Respect
robots.txt
and Terms of Use; throttle requests; cache raw HTML. - PII: None ingested; low-risk posture.
2) Governance & Compliance
Data Contracts
- Source allowlist, crawl cadence, user-agent, throttling.
- Dataset contracts: schema, freshness SLA, retention, lineage.
- Quality bars: null thresholds, code regex, timestamp sanity.
Access & Security
- Least-privilege IAM; RW separation by dataset.
- Service accounts per stage; CMEK optional; VPC-SC optional.
- Labels:
layer
,owner
,domain
for governance & cost.
3) Target Architecture
IngestionStaging (Bronze)Transform (Silver)Mart (Gold)Orchestration
Figure 1. End-to-end architecture with governed ingestion, eventing, and layered BigQuery design.
4) Data Model (Bronze → Silver → Gold)
Bronze (Staging)
- Raw extracts as normalized JSON with lineage:
source_url
,crawl_ts
,hash_key
. - Partition by
crawl_date
; cluster byinstrument_code
,content_type
. - Raw HTML persisted in GCS with checksums.
Silver (Conformed)
- Entity tables:
products
,contract_specs
,calendar_events
,education_articles
. - Type enforcement; dedupe via
hash_key
; business keysinstrument_code
+as_of_date
.
Gold (Marts)
- Analytics-ready facts/views: coverage, attribute snapshots, event counts, freshness indicators.
- Materialized views for hot paths; optional row-level security.
Example DDL
-- bronze.cme_pages CREATE TABLE IF NOT EXISTS `proj.ds_bronze.cme_pages` PARTITION BY DATE(crawl_ts) CLUSTER BY instrument_code, content_type AS SELECT NULL AS instrument_code, NULL AS content_type, NULL AS source_url, NULL AS crawl_ts, NULL AS hash_key, NULL AS payload_json;
5) Ingestion & ETL Framework (Batch + Incremental)
Python Extractors (Batch)
HEADERS = {"User-Agent": "ResearchBot/1.0 (+contact@example.com)"} def extract_payload(html) -> dict: # Parse & normalize metadata: instrument_code, title, attributes... return {"content_type":"contract_specs","instrument_code":None,"attributes":{}} def make_record(url, html): h = sha256((url + html)[:1_000_000].encode()).hexdigest() return {"source_url": url, "crawl_ts": CURRENT_TIMESTAMP(), "hash_key": h, "payload_json": extract_payload(html)}
Incremental Merge (BigQuery)
MERGE `proj.ds_silver.contract_specs` T USING `proj.ds_bronze.cme_pages` S ON T.hash_key = S.hash_key WHEN NOT MATCHED AND S.content_type = 'contract_specs' THEN INSERT (instrument_code, as_of_ts, tick_size, contract_unit, hash_key) VALUES (JSON_VALUE(S.payload_json, '$.instrument_code'), S.crawl_ts, JSON_VALUE(S.payload_json, '$.attributes.tick_size'), JSON_VALUE(S.payload_json, '$.attributes.contract_unit'), S.hash_key);
Changed content can be handled via SCD-2 in Silver or snapshot logic in Gold.
Validation Hooks
def validate(record): errs = [] if not record.get("source_url"): errs.append("missing source_url") if not record.get("hash_key"): errs.append("missing hash_key") return errs
Storage Layout
gs://<bucket>/cme/raw_html/{date}/<instrument|misc>/<hash>.html gs://<bucket>/cme/extracted_json/{date}/<instrument|misc>/<hash>.json
6) Data Validation & Contracts
Expectation Suite
expect_table_row_count_to_be_between(min_value=1) expect_column_values_to_not_be_null(columns=["source_url","hash_key","crawl_ts"]) expect_column_values_to_match_regex(column="instrument_code", regex="^[A-Z0-9-_]{1,20}$")
Lineage & Provenance
_LINEAGE
meta tables: source_url, crawl_ts, code_version, container digest.- Data Catalog + policy tags (optional).
7) Schema Evolution & Versioning
Pattern
- Bronze: JSON payload absorbs drift.
- Silver: typed columns; additive changes via
ALTER TABLE ADD COLUMN
. - Breaking changes: versioned tables (
contract_specs_v2
).
DDL Examples
ALTER TABLE `proj.ds_silver.contract_specs` ADD COLUMN IF NOT EXISTS tick_value NUMERIC; CREATE OR REPLACE VIEW `proj.ds_gold.fact_products` AS SELECT instrument_code, DATE(as_of_ts) AS as_of_date, ANY_VALUE(tick_size) AS tick_size, ANY_VALUE(contract_unit) AS contract_unit FROM `proj.ds_silver.contract_specs` GROUP BY 1,2;
8) Orchestration (Cloud Composer / Airflow)
DAG Sketch
with DAG("cme_public_pipeline", schedule="@hourly", catchup=False) as dag: start >> extract >> normalize >> load_bronze >> validate >> merge_silver >> build_gold >> end
SLAs & Alerts
- Freshness: ≤ 2h publish→mart.
- Retries: exponential backoff; idempotent tasks.
- Notify: Slack on failure; on-call if SLA breached twice.
9) Infrastructure as Code (Terraform)
Core Resources
resource "google_bigquery_dataset" "bronze" { dataset_id = "ds_bronze" location = "US" labels = { layer = "bronze", owner = "olu", domain = "futures" } } resource "google_storage_bucket" "raw" { name = "cme-public-raw" location = "US" uniform_bucket_level_access = true }
Composer
resource "google_composer_environment" "main" { name = "composer-cme" region = "us-central1" config { software_config { image_version = "composer-3-airflow-2" } } }
10) Cost & Observability
Optimization
- Partition + cluster Silver/Gold; cache where possible.
- Delta-only processing; compact load files (~64MB).
- Autoscale/slots for predictable cost.
Monitoring
- Dashboards: freshness, row deltas, failure rate, cost/day.
- Audit: validation failures, schema drift, DAG runtimes.
- Correlation IDs:
hash_key
,dag_run_id
.
11) Portfolio Links
- GitHub Pages: daramoladoes.github.io
- Repo: DaramolaDoes/daramoladoes.github.io