BqForge — Documentation

MCP Tools

30+ callable tools exposed by the BqForge server — best-practice tools plus optional live GCP tools. Use them individually or let your AI chain them automatically.

Workflow Tools

resolve_topic

Resolve a natural-language question to ranked practice IDs with relevance scores. Always the first step in the two-step workflow.

get_practices

Fetch focused practice content within a token budget. Assembles the best-matching practices in relevance order until the budget is exhausted.

Best Practice Tools

review_query

Analyse a SQL query for best-practice violations — SELECT *, missing partition filters, CROSS JOINs, non-deterministic functions, and more.

review_query_with_schema

Schema-aware SQL review. Fetches the actual partition and clustering columns from BigQuery and checks whether the WHERE clause filters on the real partition column — not just any date expression. Use alongside review_query for complete coverage.

get_best_practices

Retrieve all practices for a given category: query_optimization, schema_design, cost_management, security, materialized_views, monitoring, data_ingestion, workload_management, partitioning, storage_pricing, authorized_views, scheduled_queries.

search_practices

Full-text keyword search across all practice titles, descriptions, do/don't items, and examples.

get_practice_detail

Get the full detail for a single practice by ID — e.g. QO-002, SD-001, SE-003, PT-001.

list_all_practice_ids

Return a compact list of every practice ID and title across all 12 categories.

generate_cte_refactor

Refactor deeply nested subqueries into clean, readable CTEs following BigQuery best practices.

suggest_materialized_view

Analyse a SQL query and suggest a materialized view definition that would accelerate it.

Live GCP Tools (requires GCP credentials)

check_gcp_connection

Verify that GCP credentials are configured and BigQuery is reachable from the server.

dry_run_query

Estimate bytes billed and cost (USD) for any SQL before executing — no data scanned.

execute_query

Run a SQL query and return results directly in your AI session, with a configurable max_bytes_billed safety cap.

explore_schema

Inspect table schema, partition info, clustering fields, and row count for any BigQuery table.

get_table_info

Return detailed metadata for a table — creation time, last modified, size, num rows, schema.

query_history

Surface the most expensive queries run in your project over the past N days, ranked by bytes billed.

get_cost_attribution

Break down BigQuery spend by user, label, or dataset over a configurable time window.

profile_table

Sample a table and return column-level statistics: null rate, distinct count, min/max values.

list_jobs

List currently running or recently completed BigQuery jobs in your project.

cancel_job

Cancel a running BigQuery job by job ID.

estimate_query_cost

Return the estimated on-demand cost in USD for a given SQL query using the current pricing rate.

get_expensive_queries

Find the top N most costly queries across the project, useful for identifying quick-win optimizations.

get_slot_utilization

Show slot usage trends over time — helps decide whether to switch from on-demand to capacity pricing.

check_data_freshness

Check when a table was last modified and alert if it hasn't been updated within a configurable SLA window.

detect_schema_drift

Compare the live table schema against an expected schema definition and surface any differences.

suggest_schema_improvements

Inspect a live table and recommend partition, cluster, and type optimizations based on BqForge practices.

compare_tables

Compare two tables by schema, row count, and a sample of column value distributions.

list_materialized_views

List all materialized views in a dataset with their refresh status and staleness.

explain_query_plan

Fetch the query execution plan for a completed job and summarise the most expensive stages.

detect_zombie_queries

Find queries that run repeatedly but never improve — good candidates for materialized views or caching.

map_table_lineage

Trace upstream and downstream table dependencies by mining BigQuery job history.

detect_performance_regression

Compare recent query performance against a baseline period and flag statistically significant regressions.

nl_to_sql

Describe what you want in plain English — BqForge inspects the real table schemas and generates best-practice SQL.

Installation

Works with Claude Desktop and any MCP-compatible client. No API keys required.

Clone and install

git clone https://github.com/sreekanth-kc/BqForge
cd BqForge
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

Add to Claude Desktop

Edit ~/Library/Application Support/Claude/claude_desktop_config.json

{
  "mcpServers": {
    "bqforge": {
      "command": "/absolute/path/to/.venv/bin/python",
      "args": ["/absolute/path/to/BqForge/server.py"]
    }
  }
}

Restart Claude Desktop after saving.

Activate in Claude

# Type this to activate automatic BqForge lookups
use bqforge

Recommended Workflow

Two-step pattern for best results.

# Step 1: resolve topic to ranked practice IDs
resolve_topic(query="reduce query cost")
→ [CO-001 (score 9), QO-001 (score 7), CO-002 (score 6), …]

# Step 2: fetch focused content within token budget
get_practices(topic="reduce query cost", max_tokens=3000)
→ Markdown with top practices, do/don't lists, SQL examples

Token budget guidance

max_tokens=1500Quick answer

max_tokens=3000Normal (default)

max_tokens=6000Deep dive

QO-xxx

Query Optimization

Best practices to write efficient, performant BigQuery SQL that minimises bytes processed and maximises slot utilisation.

QO-001 HIGH Cost & Performance

Select only the columns you need

BigQuery is columnar; SELECT * reads every column and bills you for all bytes in those columns. Always project only the fields required.

Do

✓SELECT user_id, event_type, created_at FROM events

✓Use column pruning in views and materialised views

Avoid

✗SELECT * FROM large_table

✗SELECT * in subqueries or CTEs

SELECT order_id, total_amount FROM orders WHERE status = 'COMPLETED'

QO-002 HIGH Cost & Performance

Filter on partition and cluster columns early

Partition pruning eliminates entire partitions from the scan. Clustering reduces blocks read within a partition. Always filter on the partition column (usually a DATE/TIMESTAMP) and cluster keys.

Do

✓WHERE event_date BETWEEN '2024-01-01' AND '2024-03-31'

✓Declare partition expiration to auto-delete old data

Avoid

✗WHERE DATE(timestamp_col) = '2024-01-01' — wrapping disables pruning

✗WHERE CAST(partition_col AS STRING) = '20240101'

SELECT * FROM `project.dataset.events`
WHERE event_date = '2024-06-15'  -- partition filter
  AND country = 'IN'             -- cluster filter

QO-003 MEDIUM Performance

Avoid data skew in JOINs and GROUP BY

When a join key or GROUP BY key has very uneven distribution, one worker slot gets overloaded. Pre-filter nulls and consider approximate aggregations.

Do

✓Filter NULL keys before joining: WHERE key IS NOT NULL

✓Use APPROX_COUNT_DISTINCT() for large sets

Avoid

✗JOIN on columns with high NULL rates without filtering

✗GROUP BY on low-cardinality skewed columns

SELECT a.user_id, SUM(b.amount)
FROM users a
JOIN transactions b ON a.user_id = b.user_id
WHERE a.user_id IS NOT NULL
GROUP BY a.user_id

QO-004 LOW Readability & Maintainability

Use CTEs instead of deeply nested subqueries

Common Table Expressions (WITH clauses) improve readability and allow BigQuery's optimiser to materialise intermediate results.

Do

✓WITH active_users AS (SELECT ... FROM users WHERE active = TRUE)

✓Name CTEs descriptively

Avoid

✗SELECT * FROM (SELECT * FROM (SELECT * FROM ...))

✗Duplicate the same subquery in multiple places

WITH orders_2024 AS (
  SELECT order_id, customer_id, amount
  FROM orders WHERE order_date >= '2024-01-01'
),
high_value AS (
  SELECT * FROM orders_2024 WHERE amount > 1000
)
SELECT customer_id, COUNT(*) AS big_orders
FROM high_value GROUP BY customer_id

QO-005 HIGH Performance

Prefer JOIN over correlated subqueries

Correlated subqueries execute once per row of the outer query. A JOIN is processed in parallel across slots and is almost always faster.

-- Instead of correlated subquery
SELECT o.order_id
FROM orders o
LEFT JOIN refunds r ON o.order_id = r.order_id
WHERE r.order_id IS NULL  -- orders with no refund

QO-006 MEDIUM Performance & Cost

Use approximate aggregation functions for analytics

For dashboards where ~1% error is acceptable, APPROX_COUNT_DISTINCT and HyperLogLog sketches are orders of magnitude faster than exact COUNT(DISTINCT).

SELECT
  event_date,
  APPROX_COUNT_DISTINCT(user_id) AS dau
FROM events
WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
GROUP BY event_date ORDER BY event_date

QO-007 MEDIUM Performance

Optimise JOIN order — largest table first

BigQuery's distributed JOIN engine broadcasts smaller tables to all workers. Place the largest table first (left side) to enable broadcast joins, reducing shuffle overhead.

-- Largest table (orders) first, smaller lookup (products) last
SELECT o.order_id, p.product_name, o.amount
FROM dataset.orders o          -- large
JOIN dataset.products p        -- small lookup
  ON o.product_id = p.product_id
WHERE o.order_date >= '2024-01-01'

QO-008 HIGH Performance & Cost

Replace self-joins with window functions

Self-joins scan the same table twice and can square the output row count. Window functions (LAG, LEAD, SUM OVER, RANK) achieve running totals, comparisons, and rankings in a single pass.

-- Running revenue per user — window function, no self-join
SELECT
  user_id, order_date, amount,
  SUM(amount) OVER (PARTITION BY user_id ORDER BY order_date) AS running_total
FROM dataset.orders

QO-009 MEDIUM Performance

Use SEARCH() and search indexes for full-text lookups

The BigQuery SEARCH() function with a search index performs inverted-index lookups, dramatically outperforming LIKE '%term%' on large tables (>10 GB).

CREATE SEARCH INDEX log_idx ON dataset.app_logs(log_message);

SELECT request_id, log_message
FROM dataset.app_logs
WHERE log_date = CURRENT_DATE()
  AND SEARCH(log_message, 'NullPointerException')

QO-010 MEDIUM Performance & Cost

Optimise UNNEST patterns for ARRAY and STRUCT columns

UNNEST explodes array elements into rows. Filter before unnesting and avoid unnesting multiple independent arrays in the same query (produces a cross-product).

-- Check array contains value WITHOUT full unnest
SELECT user_id FROM dataset.users
WHERE EXISTS (
  SELECT 1 FROM UNNEST(interest_tags) AS tag WHERE tag = 'sports'
)

-- Access nested struct field directly (no UNNEST needed)
SELECT address.city FROM dataset.users

QO-011 LOW Cost & Speed

Use TABLESAMPLE for fast exploratory queries

TABLESAMPLE SYSTEM draws a random percentage of data blocks, reducing bytes scanned proportionally. Ideal for validating query logic before running full scans.

SELECT status, COUNT(*) AS cnt
FROM dataset.orders TABLESAMPLE SYSTEM (1 PERCENT)
GROUP BY status ORDER BY cnt DESC

QO-012 MEDIUM Cost & Performance

Use granular wildcard table prefixes

Wildcard tables scan every matching table. The more specific the prefix, the fewer tables scanned. Always add a _TABLE_SUFFIX filter.

SELECT event_type, COUNT(*) AS cnt
FROM `project.dataset.events_2024*`
WHERE _TABLE_SUFFIX BETWEEN '0101' AND '0131'  -- Jan 2024 only
GROUP BY event_type

SD-xxx

Schema Design

Best practices for designing BigQuery table schemas that balance storage cost, query performance, and maintainability.

SD-001HIGHCost & Performance

Partition tables on a date/timestamp column

Partitioning physically separates data by time unit. Queries that filter on the partition column only scan relevant partitions, dramatically reducing cost.

CREATE TABLE `project.dataset.events`
PARTITION BY DATE(event_timestamp)
CLUSTER BY user_id, event_type
OPTIONS(partition_expiration_days=730)
AS SELECT * FROM staging.raw_events WHERE FALSE

SD-002HIGHPerformance

Cluster tables on frequently filtered/joined columns

Clustering sorts data within each partition by up to 4 columns. Queries that filter or join on cluster keys skip irrelevant blocks at no extra cost.

CREATE TABLE sales
PARTITION BY DATE(sale_date)
CLUSTER BY region, product_category, customer_id
AS SELECT * FROM staging.sales

SD-003MEDIUMPerformance & Simplicity

Use STRUCT and ARRAY to denormalise related data

BigQuery is optimised for denormalised schemas. Storing nested/repeated fields as STRUCT or ARRAY avoids expensive JOINs and keeps related data co-located.

CREATE TABLE users (
  user_id STRING,
  name    STRING,
  address STRUCT<street STRING, city STRING, country STRING>,
  tags    ARRAY<STRING>
)

SD-004MEDIUMStorage & Performance

Choose the right data types

Correct data types reduce storage, speed up comparisons, and avoid implicit casts that can break partition pruning.

CREATE TABLE transactions (
  transaction_id INT64,
  amount         NUMERIC(18,2),
  created_at     TIMESTAMP,
  is_refunded    BOOL
)

SD-005MEDIUMMaintainability & Cost

Avoid wide tables with hundreds of columns

Tables with hundreds of sparse columns are hard to maintain. Use STRUCT grouping or split into logical sub-tables.

-- Instead of 50 separate metric columns:
metrics STRUCT<
  clicks       INT64,
  impressions  INT64,
  conversions  INT64,
  revenue      NUMERIC
>

SD-006HIGHPerformance & Maintainability

Prefer time-partitioned tables over date-sharded tables

Date-sharded tables (events_20240101, events_20240102) require wildcard queries and increase management overhead. A single time-partitioned table gives the same pruning with better optimiser support.

CREATE OR REPLACE TABLE dataset.events
PARTITION BY event_date
CLUSTER BY user_id
OPTIONS(partition_expiration_days=365)
AS
SELECT * FROM `dataset.events_*`
WHERE _TABLE_SUFFIX >= '20230101'

SD-007MEDIUMPerformance & Flexibility

Use BigQuery native JSON type for semi-structured data

The native JSON type stores data in a lossless binary encoding and supports path-based access without string parsing. It outperforms JSON stored as STRING.

CREATE TABLE dataset.events (
  event_id   STRING,
  event_date DATE,
  event_type STRING,
  properties JSON     -- dynamic per-event fields
);

SELECT event_id, properties.user_agent
FROM dataset.events
WHERE JSON_VALUE(properties, '$.country') = 'IN'

SD-008MEDIUMStorage Cost

Choose physical storage billing for compressed savings

For datasets with high compression ratios, physical billing can cut storage costs by 30–70%. Evaluate on a dataset-by-dataset basis.

-- Compare before switching
SELECT
  table_name,
  total_logical_bytes / 1e9   AS logical_gb,
  active_physical_bytes / 1e9 AS physical_gb
FROM `region-us`.INFORMATION_SCHEMA.TABLE_STORAGE
WHERE table_schema = 'my_dataset'

CO-xxx

Cost Management

Best practices to monitor, control, and reduce BigQuery spend without sacrificing analytical capability.

CO-001HIGHCost

Always preview query cost before running

The BigQuery console dry-run and the --dry_run flag return bytes billed without executing the query. Build this check into CI pipelines and analyst workflows.

from google.cloud import bigquery
client = bigquery.Client()
job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=True)
job = client.query('SELECT * FROM dataset.table LIMIT 100', job_config=job_config)
print(f'Bytes processed: {job.total_bytes_processed / 1e9:.2f} GB')

CO-002MEDIUMCost & Performance

Leverage query result caching

BigQuery caches query results for 24 hours. Identical queries are free. Avoid defeating the cache with non-deterministic functions like NOW() or RAND().

-- Cache-friendly: date is a parameter, not NOW()
DECLARE report_date DATE DEFAULT '2024-06-01';
SELECT COUNT(*) FROM events WHERE event_date = report_date

CO-003MEDIUMCost & Performance

Use materialised views and scheduled queries

Pre-aggregate expensive computations into materialised views. Dashboards query the small summary table instead of scanning the raw fact table.

CREATE MATERIALIZED VIEW dataset.daily_revenue
OPTIONS(enable_refresh=true, refresh_interval_minutes=60)
AS
SELECT DATE(order_date) AS day, SUM(amount) AS revenue
FROM dataset.orders GROUP BY day

CO-004HIGHCost Governance

Set budgets, quotas, and cost alerts

Use Google Cloud Budget Alerts and BigQuery custom quotas to prevent runaway queries. Per-project and per-user daily byte limits act as a safety net.

# Set a hard 10 GB per-query limit
job_config = bigquery.QueryJobConfig(
    maximum_bytes_billed=10 * 1024**3  # 10 GB
)

CO-005MEDIUMStorage Cost

Use long-term storage pricing and table expiration

Tables not modified for 90+ days drop to long-term storage pricing (~50% cheaper). Set expiration on staging and temporary tables.

CREATE TABLE dataset.temp_staging
OPTIONS(
  expiration_timestamp = TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
)
AS SELECT ...

CO-006HIGHCost Governance

Use BigQuery Editions for predictable workloads

On-demand pricing bills per TB scanned — unpredictable for heavy pipelines. BigQuery Editions provide reserved slots with autoscaling and commitments for significant discounts.

-- Check average and peak slot usage to estimate reservation size
SELECT
  DATE(creation_time) AS day,
  MAX(total_slot_ms) / (1000 * 60 * 60 * 24) AS peak_slot_hours
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY day ORDER BY peak_slot_hours DESC

CO-007MEDIUMStorage Cost

Avoid repeated table overwrites to reduce time-travel storage

Every WRITE_TRUNCATE retains the previous snapshot in time-travel storage (up to 7 days). Use WRITE_APPEND + MERGE or reduce the time-travel window for high-churn tables.

ALTER TABLE dataset.daily_staging
SET OPTIONS(max_time_travel_hours = 48);

CO-008LOWCost

Preview data without running queries

Use the BigQuery console Preview tab, bq head, or the tabledata.list API — none trigger query billing.

# Free row preview via CLI (no query charge)
bq head --max_rows=20 project:dataset.tablename

CO-009MEDIUMCost & Performance

Materialise intermediate stages to reduce repeated scanning

Break ETL into explicit destination table stages. Each intermediate table is written once and queried from its smaller result.

-- Stage 1: filter once
CREATE OR REPLACE TABLE dataset.orders_2024
OPTIONS(expiration_timestamp = TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 1 DAY))
AS SELECT * FROM dataset.raw_orders WHERE order_date >= '2024-01-01';

-- Stage 2: aggregate from the small table (cheap)
SELECT region, SUM(amount) FROM dataset.orders_2024 GROUP BY region

SE-xxx

Security & Access Control

Best practices to secure BigQuery data, control access at the right granularity, and maintain compliance with data-governance policies.

SE-001HIGHSecurity & Compliance

Apply principle of least privilege with IAM

Grant the minimum BigQuery IAM role required for each principal. Prefer predefined roles over primitive roles and grant at the dataset or table level, not the project level.

gcloud projects add-iam-policy-binding PROJECT_ID \
  --member='serviceAccount:pipeline@project.iam.gserviceaccount.com' \
  --role='roles/bigquery.dataViewer'

SE-002HIGHData Privacy

Use column-level and row-level security

Column-level security (Policy Tags) restricts access to PII columns. Row-level security (Row Access Policies) filters rows based on the querying user's attributes.

CREATE ROW ACCESS POLICY region_filter
ON dataset.sales
GRANT TO ('group:apac-sales@company.com')
FILTER USING (region = 'APAC')

SE-003MEDIUMCompliance

Encrypt sensitive data with Cloud KMS CMEK

For regulatory requirements (HIPAA, PCI-DSS) use Customer-Managed Encryption Keys via Cloud KMS to retain key control.

CREATE TABLE dataset.phi_records
OPTIONS(
  kms_key_name = 'projects/proj/locations/us/keyRings/ring/cryptoKeys/bq-key'
)
AS SELECT * FROM staging.phi_staging

SE-004HIGHSecurity & Compliance

Audit and monitor data access with Cloud Audit Logs

Enable Data Access audit logs (DATA_READ, DATA_WRITE). Export logs to BigQuery for anomaly detection and compliance reporting.

-- Find top data consumers in the last 7 days
SELECT user_email, SUM(total_bytes_billed)/1e12 AS tb_billed
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
GROUP BY user_email ORDER BY tb_billed DESC LIMIT 20

SE-005MEDIUMData Exfiltration Prevention

Use VPC Service Controls and authorised views

VPC Service Controls create a security perimeter around BigQuery. Authorised views let you share query results without exposing the underlying table.

CREATE VIEW analytics.users_masked AS
SELECT
  user_id,
  CONCAT(LEFT(email, 2), '****@', SPLIT(email, '@')[OFFSET(1)]) AS email_masked,
  country, created_date
FROM pii_dataset.users

SE-006HIGHData Privacy

Apply dynamic data masking to sensitive columns

BigQuery dynamic data masking applies masking rules automatically based on the querying user's role. Analysts see masked values; authorised roles see the original data.

CREATE OR REPLACE DATA MASKING POLICY hash_email
USING (SHA256(CAST(email AS BYTES)));

SE-007HIGHSecurity

Use Workload Identity Federation instead of service account keys

SA JSON key files can be leaked and are hard to rotate. Workload Identity Federation lets external workloads authenticate without a long-lived credential.

# GitHub Actions: authenticate via Workload Identity (no key file)
- uses: google-github-actions/auth@v2
  with:
    workload_identity_provider: 'projects/123/locations/global/workloadIdentityPools/pool/providers/github'
    service_account: 'bq-pipeline@project.iam.gserviceaccount.com'

SE-008MEDIUMCompliance & Governance

Enable and govern data lineage with Dataplex

Dataplex automatically tracks table-level and column-level data lineage across BigQuery jobs. Essential for impact analysis and GDPR right-to-erasure.

gcloud services enable datalineage.googleapis.com --project=PROJECT_ID

MV-xxx

Materialized Views

Best practices for creating and managing BigQuery materialized views to pre-aggregate expensive computations, reduce query costs, and accelerate BI dashboards.

MV-001HIGHCost & Freshness

Use incremental materialized views for near-real-time aggregations

Incremental MVs only process new data since the last refresh rather than recomputing the entire result. Combined with max_staleness, they deliver sub-minute freshness at a fraction of the cost.

CREATE MATERIALIZED VIEW dataset.hourly_revenue
OPTIONS(
  enable_refresh = true,
  refresh_interval_minutes = 60,
  max_staleness = INTERVAL '90' MINUTE
)
AS
SELECT
  TIMESTAMP_TRUNC(order_ts, HOUR) AS hour,
  region,
  SUM(amount) AS revenue, COUNT(*) AS order_count
FROM dataset.orders GROUP BY 1, 2

MV-002MEDIUMCost & Performance

Tune max_staleness to balance freshness vs cost

When a query hits a table with an associated MV, BigQuery serves the MV result if it is within max_staleness. Too tight forces full scans; too loose serves stale data.

ALTER MATERIALIZED VIEW dataset.live_orders
SET OPTIONS(max_staleness = INTERVAL '15' MINUTE)

MV-003HIGHCorrectness

Avoid non-deterministic functions in MV definitions

Functions like CURRENT_TIMESTAMP(), RAND(), or GENERATE_UUID() are evaluated at refresh time, not query time — causing subtle correctness bugs.

-- GOOD: derive snapshot from the data itself
CREATE MATERIALIZED VIEW dataset.orders_summary AS
SELECT DATE(order_ts) AS order_date, SUM(amount) AS daily_revenue
FROM dataset.orders GROUP BY 1

MV-004HIGHPerformance & Cost

Align MV partition and cluster with query patterns

Choose cluster columns that match the filters your BI tool or downstream queries use most often.

CREATE MATERIALIZED VIEW dataset.sales_by_region
CLUSTER BY region, product_id
AS
SELECT sale_date, region, product_id, SUM(revenue) AS revenue
FROM dataset.sales GROUP BY 1, 2, 3

MV-005MEDIUMCost Visibility

Monitor MV refresh cost and staleness via INFORMATION_SCHEMA

MV refreshes consume slot-time and are billed like regular queries. Track refresh frequency and last_refresh_time to catch runaway MVs.

SELECT table_schema, table_name, last_refresh_time,
  TIMESTAMP_DIFF(CURRENT_TIMESTAMP(), last_refresh_time, MINUTE) AS stale_minutes
FROM `region-us`.INFORMATION_SCHEMA.MATERIALIZED_VIEWS_BY_PROJECT
ORDER BY stale_minutes DESC

MV-006MEDIUMPerformance

Use BI Engine for sub-second dashboard query acceleration

BigQuery BI Engine is an in-memory analysis service that caches frequently-queried data and accelerates SQL workloads from Looker Studio and Looker.

-- Create a 10 GB BI Engine reservation in us-central1
-- (done via Console: BigQuery → BI Engine → Create reservation)

MV-007MEDIUMCorrectness & Cost

Use MERGE for incremental MV-like updates on unsupported patterns

When your aggregation logic exceeds native MV capabilities, implement an incremental MERGE pattern into a destination table, processing only new partitions each run.

MERGE dataset.daily_summary AS T
USING (
  SELECT DATE(order_ts) AS day, region, SUM(amount) AS revenue
  FROM dataset.orders
  WHERE DATE(order_ts) = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)
  GROUP BY 1, 2
) AS S
ON T.day = S.day AND T.region = S.region
WHEN MATCHED THEN UPDATE SET T.revenue = S.revenue
WHEN NOT MATCHED THEN INSERT (day, region, revenue) VALUES (S.day, S.region, S.revenue)

MO-xxx

Monitoring & Observability

Best practices for observing BigQuery job health, tracking cost attribution, detecting anomalies, and building operational dashboards.

MO-001HIGHVisibility & Debugging

Use INFORMATION_SCHEMA.JOBS as your primary audit trail

JOBS_BY_PROJECT records every query, load, extract, and copy job. First place to look for slow queries, expensive scans, failed jobs, and quota violations.

-- Top 10 most expensive queries in the last 24 hours
SELECT user_email, query,
  ROUND(total_bytes_billed / 1e12, 4) AS tb_billed,
  TIMESTAMP_DIFF(end_time, start_time, SECOND) AS duration_s
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
  AND job_type = 'QUERY' AND state = 'DONE'
ORDER BY tb_billed DESC LIMIT 10

MO-002HIGHPerformance & Cost Governance

Set up slot utilisation alerts in Cloud Monitoring

BigQuery exposes slot utilisation metrics via Cloud Monitoring. For reserved capacity, sustained high utilisation means queries are queuing.

gcloud alpha monitoring policies create \
  --display-name='BQ High Slot Utilisation' \
  --condition-filter='metric.type="bigquery.googleapis.com/slots/allocated_for_project"' \
  --condition-threshold-value=800 \
  --condition-threshold-duration=300s

MO-003MEDIUMCost Accountability

Attribute cost by team and workload with labels

BigQuery job labels propagate to Cloud Billing exports, enabling per-team and per-pipeline cost breakdown.

job_config = bigquery.QueryJobConfig(
    labels={
        'team':        'analytics',
        'pipeline':    'user-metrics',
        'environment': 'production',
    }
)

MO-004HIGHReliability

Monitor and alert on query errors and failed jobs

Failed BigQuery jobs are silent by default unless you explicitly monitor error_result in INFORMATION_SCHEMA.

-- Find failed jobs in the last hour
SELECT job_id, user_email, error_result.reason, error_result.message
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
  AND error_result IS NOT NULL
ORDER BY creation_time DESC

MO-005MEDIUMPerformance Regression Detection

Track query performance trends over time

Track p50/p95 query duration and bytes_billed per query signature over time to detect regressions.

-- Weekly p95 query duration trend
SELECT
  DATE_TRUNC(creation_time, WEEK) AS week,
  statement_type,
  APPROX_QUANTILES(total_slot_ms, 100)[OFFSET(95)] / 1000 AS p95_slot_s
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 90 DAY)
GROUP BY 1, 2 ORDER BY 1 DESC, 3 DESC

MO-006MEDIUMCost Visibility

Audit storage costs with INFORMATION_SCHEMA.TABLE_STORAGE

TABLE_STORAGE gives a per-table breakdown of active, long-term, time-travel, and fail-safe physical bytes.

SELECT table_name,
  ROUND(total_logical_bytes / 1e9, 2)        AS logical_gb,
  ROUND(time_travel_physical_bytes / 1e9, 2) AS time_travel_gb
FROM `region-us`.INFORMATION_SCHEMA.TABLE_STORAGE
ORDER BY total_logical_bytes DESC LIMIT 20

MO-007HIGHPerformance & Capacity Planning

Monitor reservation slot utilisation and queue wait times

For capacity-based BigQuery editions, sustained high slot utilisation causes queries to queue. Track queue wait via INFORMATION_SCHEMA.JOBS.

-- Detect queue wait time by reservation
SELECT reservation_id, DATE(creation_time) AS day,
  APPROX_QUANTILES(
    TIMESTAMP_DIFF(start_time, creation_time, SECOND), 100
  )[OFFSET(95)] AS p95_queue_s
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
GROUP BY 1, 2 ORDER BY p95_queue_s DESC

DI-xxx

Data Ingestion

Best practices for loading data into BigQuery efficiently and reliably, covering batch loads, streaming, file formats, deduplication, and DML patterns.

DI-001HIGHCost & Reliability

Use Storage Write API instead of legacy streaming inserts

The legacy insertAll streaming API charges per row and has lower throughput limits. The Storage Write API supports exactly-once delivery and is ~10× cheaper for high-volume writes.

from google.cloud.bigquery_storage_v1 import BigQueryWriteClient, types

client = BigQueryWriteClient()
parent = client.table_path('project', 'dataset', 'table')
write_stream = client.create_write_stream(
    parent=parent,
    write_stream=types.WriteStream(type_=types.WriteStream.Type.COMMITTED)
)

DI-002HIGHCost & Performance

Use columnar file formats for batch loads

BigQuery natively reads Parquet, Avro, and ORC. These formats are compressed, self-describing, and load faster while consuming fewer slots than CSV/JSON.

bq load \
  --source_format=PARQUET \
  --parquet_enable_list_inference=true \
  project:dataset.table \
  'gs://bucket/data/events/dt=2024-06-01/*.parquet'

DI-003HIGHPerformance & Cost

Batch DML operations — avoid single-row mutations

BigQuery DML triggers background compaction after each statement. Issuing thousands of single-row DML statements generates excessive mutation metadata and can exhaust quotas.

-- GOOD: stage all rows then bulk-insert
INSERT INTO dataset.target_table (id, value, updated_at)
SELECT id, value, updated_at
FROM dataset.staging_table WHERE load_date = CURRENT_DATE()

DI-004MEDIUMPerformance & Correctness

Use MERGE for upsert patterns instead of DELETE + INSERT

A MERGE statement performs both update and insert in a single atomic pass and is easier to make idempotent than DELETE + INSERT.

MERGE dataset.customers AS T
USING (
  SELECT customer_id, name, email, updated_at
  FROM dataset.customers_staging
  WHERE DATE(updated_at) = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)
) AS S
ON T.customer_id = S.customer_id
WHEN MATCHED AND S.updated_at > T.updated_at THEN
  UPDATE SET name = S.name, email = S.email, updated_at = S.updated_at
WHEN NOT MATCHED THEN
  INSERT (customer_id, name, email, updated_at)
  VALUES (S.customer_id, S.name, S.email, S.updated_at)

DI-005HIGHData Quality & Reliability

Implement idempotent deduplication for streaming pipelines

Streaming pipelines can deliver duplicate records during retries. Use Storage Write API COMMITTED streams for exactly-once semantics, or implement downstream deduplication with ROW_NUMBER().

CREATE OR REPLACE TABLE dataset.events_deduped AS
SELECT * EXCEPT(row_num) FROM (
  SELECT *,
    ROW_NUMBER() OVER (
      PARTITION BY event_id
      ORDER BY ingestion_timestamp DESC
    ) AS row_num
  FROM dataset.events_raw
)
WHERE row_num = 1

DI-006MEDIUMData Quality

Validate and enforce schema on load

Use autodetect only for exploration; enforce explicit schemas in production. Enable schema update options only intentionally.

schema = [
    bigquery.SchemaField('event_id', 'STRING', mode='REQUIRED'),
    bigquery.SchemaField('event_ts', 'TIMESTAMP', mode='REQUIRED'),
    bigquery.SchemaField('amount',   'NUMERIC',   mode='NULLABLE'),
]
job_config = bigquery.LoadJobConfig(
    schema=schema,
    source_format=bigquery.SourceFormat.PARQUET,
)

WM-xxx

Workload Management

Best practices for managing BigQuery slot capacity, reservations, autoscaling, job concurrency, and workload isolation.

WM-001HIGHPerformance & Reliability

Isolate workloads with separate reservations

Sharing a single slot pool across interactive queries, batch ETL, and development workloads causes priority inversion. Use separate reservations to guarantee isolation.

bq mk --reservation --location=US --slots=500 prod-interactive
bq mk --reservation --location=US --slots=200 prod-batch
bq mk --reservation --location=US --slots=50  dev

WM-002HIGHCost & Performance

Configure autoscaling baselines and max slots correctly

Setting baseline too high wastes money; setting max too low throttles queries during peaks. Use INFORMATION_SCHEMA history to size baselines.

-- Estimate p50 and p95 slot consumption
SELECT reservation_id,
  APPROX_QUANTILES(total_slot_ms / 1000, 100)[OFFSET(50)] AS p50_slot_s,
  APPROX_QUANTILES(total_slot_ms / 1000, 100)[OFFSET(95)] AS p95_slot_s
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY reservation_id

WM-003MEDIUMCost & Reliability

Set job timeouts and query priorities to prevent runaway jobs

Set job_timeout_ms to cancel queries exceeding expected duration, and use BATCH priority for non-urgent ETL to allow BigQuery to schedule opportunistically.

job_config = bigquery.QueryJobConfig(
    job_timeout_ms=600_000,          # Cancel after 10 minutes
    maximum_bytes_billed=50 * 1e9,   # Hard 50 GB cost cap
    priority=bigquery.QueryPriority.BATCH,
    labels={'team': 'data-eng', 'pipeline': 'nightly-etl'},
)

WM-004MEDIUMCost Governance

Use capacity commitments for predictable cost and performance

Annual and 3-year slot commitments offer significant discounts (up to 25–50%). Use FLEX commitments for burst windows; annual commitments for the stable baseline.

-- View current capacity commitments
SELECT commitment_id, slot_count, plan, state,
  commitment_start_time, commitment_end_time
FROM `region-us`.INFORMATION_SCHEMA.CAPACITY_COMMITMENTS
ORDER BY commitment_start_time DESC

WM-005MEDIUMCost Accountability

Label all jobs for cost attribution and workload analysis

Job labels are the only reliable way to attribute BigQuery costs to teams and pipelines in the Cloud Billing export.

-- Find top cost by team label over the last 30 days
SELECT
  labels['team'] AS team,
  ROUND(SUM(total_bytes_billed) / 1e12, 3) AS tb_billed,
  COUNT(*) AS job_count
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT, UNNEST([labels]) AS kv
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY team ORDER BY tb_billed DESC

PT-xxx

Partitioning

Advanced partitioning strategies to minimise bytes scanned, reduce costs, and simplify data lifecycle management in BigQuery.

PT-001HIGHCost & Performance

Always filter on the partition column

Queries that don't filter on the partition column perform full table scans, bypassing pruning entirely. Always include a partition filter in WHERE clauses on partitioned tables.

Do

SELECT * FROM orders
WHERE DATE(created_at) = '2024-01-01'

Don't

SELECT * FROM orders
WHERE order_status = 'OPEN'  -- no partition filter

PT-002HIGHCost & Lifecycle

Use partition expiration to auto-delete old data

Partition expiration automatically removes data older than N days, reducing active storage costs without manual cleanup scripts.

CREATE TABLE sales.events
PARTITION BY DATE(event_time)
OPTIONS (partition_expiration_days = 90)
AS SELECT * FROM staging.events WHERE FALSE

PT-003MEDIUMPerformance

Choose the right partition granularity

Daily partitioning suits most OLAP workloads. Hourly partitioning is appropriate only for very high-volume streaming tables where hour-level pruning provides meaningful savings.

✓Use DAY granularity for tables <1 TB/day

✓Use HOUR only when filtering by hour is common and the table exceeds 1 TB/day

✗Avoid MONTH partitioning if queries typically filter on a specific day

PT-004MEDIUMCost

Use integer-range partitioning for non-time keys

When the natural query filter is an integer ID (e.g. customer_id, shard_id), use integer-range partitioning instead of forcing a timestamp column.

✓Define start, end, and interval for integer range partitions

✓Keep the number of partitions manageable (< 4000)

✗Don't use high-cardinality integers with very small intervals — creates too many partitions

CREATE TABLE my_dataset.customers
PARTITION BY RANGE_BUCKET(customer_id, GENERATE_ARRAY(0, 10000000, 100000));

SP-xxx

Storage Pricing

Practices to minimise BigQuery storage costs through intelligent use of long-term pricing, physical billing, table expiry, and data lifecycle policies.

SP-001HIGHCost

Let tables age into long-term storage pricing

Tables and partitions that haven't been modified for 90+ consecutive days are automatically billed at the long-term rate (~50% cheaper). Avoid unnecessary table writes that reset this 90-day clock.

✓Append-only patterns let partitions age into long-term pricing automatically

✗Avoid full table overwrites on historical partitions — they reset the 90-day timer

SP-002MEDIUMCost

Enable physical storage billing for compressed savings

By default, BigQuery bills for logical (uncompressed) storage. Switching to physical billing charges for actual compressed bytes on disk — often 30–70% less for columnar data.

ALTER TABLE mydataset.large_table
SET OPTIONS (storage_billing_model = 'PHYSICAL')

SP-003MEDIUMCost & Governance

Set table and dataset expiration policies

Temporary and staging tables that are never cleaned up are a silent cost driver. Use default table expiration at the dataset level to auto-delete tables after a defined period.

-- Set dataset-level default table expiration (7 days)
bq update --default_table_expiration 604800 myproject:staging

AV-xxx

Authorized Views

Best practices for using authorized views and authorized datasets to share data securely without granting direct table access.

AV-001HIGHSecurity

Use authorized views instead of granting raw table access

An authorized view can query a source table and expose only the columns and rows a consumer should see — without ever granting that consumer IAM access to the underlying table.

-- 1. Create the restricted view in a separate dataset
CREATE VIEW reporting.orders_public AS
SELECT order_id, product_id, status, created_date
FROM raw.orders
WHERE region = SESSION_USER() OR region = 'public';

-- 2. Authorize the view to access the source dataset
-- (done via Cloud Console or bq CLI — not SQL)

AV-002HIGHSecurity & Governance

Authorize datasets rather than individual views at scale

When you have many views across a dataset, authorizing the entire dataset to access a source dataset is cleaner and avoids per-view authorization overhead.

✓Authorize entire reporting dataset when all views share the same source

✓Use separate datasets per consumer team to limit blast radius

✗Don't grant bigquery.dataViewer on raw tables directly to end users

AV-003MEDIUMSecurity

Combine authorized views with row-level security for fine-grained control

For multi-tenant patterns, layer row access policies on top of authorized views so each consumer automatically sees only their own rows without needing separate tables per tenant.

CREATE ROW ACCESS POLICY tenant_filter
ON raw.events
GRANT TO ("group:team-a@company.com")
FILTER USING (tenant_id = 'team-a')

AV-004MEDIUMSecurity

Audit authorized view and IAM usage regularly

Authorized views and IAM bindings accumulate over time, and regular audits prevent privilege creep and stale access to deprecated views.

✓Use Cloud Audit Logs (DATA_READ, DATA_WRITE) to track who queries sensitive views

✓Periodically review authorized view lists on datasets via the BigQuery API

✓Remove authorization for deleted/deprecated views immediately

✗Don't rely on manual tracking of who has view authorization

✗Don't authorize views from dev datasets on production source tables

SELECT protopayload_auditlog.authenticationInfo.principalEmail,
       protopayload_auditlog.resourceName,
       timestamp
FROM `my_project.audit_logs.cloudaudit_googleapis_com_data_access`
WHERE timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
  AND protopayload_auditlog.resourceName LIKE '%sensitive_view%'

SQ-xxx

Scheduled Queries

Best practices for building reliable, cost-efficient scheduled query pipelines using BigQuery Data Transfer Service.

SQ-001HIGHReliability

Use incremental INSERT instead of full table overwrites

Full WRITE_TRUNCATE on large tables re-scans all source data on every run. Incremental patterns based on a run-time parameter only process new data, dramatically reducing cost and latency.

-- Use @run_time and @run_date built-in parameters
INSERT INTO reporting.daily_summary
SELECT DATE(@run_date) AS report_date, user_id, SUM(revenue) AS revenue
FROM events.transactions
WHERE DATE(event_time) = DATE(@run_date)
GROUP BY 1, 2

SQ-002HIGHReliability & Cost

Set maximum_bytes_billed on all scheduled queries

A logic error in a scheduled query can trigger a full scan of a petabyte table and generate a massive bill. A bytes-billed cap fails the job safely rather than processing runaway data.

✓Set bytes limit to 2–5× the normal run cost as a safety margin

✓Configure email notifications for failed scheduled query runs

✗Never leave scheduled queries with no bytes-billed limit on production tables

SQ-003MEDIUMCost & Performance

Schedule during off-peak hours to utilise flex slots

Batch and scheduled queries run with BATCH priority can be queued during off-peak hours when slot capacity is available. This is especially effective with FLEX commitment slots, reducing on-demand costs.

-- Configure via Transfer Service API or Cloud Console
-- Set schedule: "every 24 hours" starting at 02:00 UTC
-- Priority: BATCH (enables opportunistic scheduling)

SQ-004HIGHReliability

Use Dataform assertions to validate output data

Pipeline outputs should be validated after each run. Dataform assertions run automatically and block downstream dependencies if data quality fails.

✓Add row count assertions (> 0 rows)

✓Add uniqueness assertions on primary key columns

✓Add referential integrity checks between related tables

✗Don't push pipeline outputs to downstream consumers without validation

✗Don't skip assertions in production to save cost

-- Dataform SQLX assertion:
config {
  type: 'assertion',
  description: 'daily_summary must have rows for each day'
}

SELECT date
FROM ${ref('daily_summary')}
GROUP BY date
HAVING COUNT(*) = 0  -- Fails if any date has 0 rows

SQ-005MEDIUMPerformance

Stagger scheduled query start times to avoid slot contention

When many scheduled queries start at exactly the same time (e.g. midnight), they compete for slots and all run slower. Stagger start times by 5–15 minutes.

✓Spread query start times (00:00, 00:05, 00:15, 00:30)

✓Prioritize critical pipelines with earlier start times

✓Use Dataform workflow scheduling to control dependency ordering

✗Don't schedule all daily pipelines at exactly midnight

✗Don't assume flat-rate slots eliminate all contention

# Cron schedules spread across the hour:
# Critical pipeline:   0 0 * * *   (00:00)
# Secondary pipeline:  5 0 * * *   (00:05)
# Reporting rollup:   15 0 * * *   (00:15)
# Archive job:        30 0 * * *   (00:30)