Documentation
BqForge is an MCP (Model Context Protocol) server that exposes 80+ curated BigQuery best practices across 12 categories as Tools and Resources. Connect it to Claude, Cursor, or any MCP-compatible client and your AI assistant gains deep, structured knowledge of BigQuery optimization, schema design, cost management, security, partitioning, BI Engine, authorized views, and more — plus 31 live GCP tools for real-time analysis.
MCP Tools
30+ callable tools exposed by the BqForge server — best-practice tools plus optional live GCP tools. Use them individually or let your AI chain them automatically.
review_query for complete coverage.query_optimization, schema_design, cost_management, security, materialized_views, monitoring, data_ingestion, workload_management, partitioning, storage_pricing, authorized_views, scheduled_queries.QO-002, SD-001, SE-003, PT-001.max_bytes_billed safety cap.Installation
Works with Claude Desktop and any MCP-compatible client. No API keys required.
git clone https://github.com/sreekanth-kc/BqForge cd BqForge python3 -m venv .venv && source .venv/bin/activate pip install -r requirements.txt
Edit ~/Library/Application Support/Claude/claude_desktop_config.json
{
"mcpServers": {
"bqforge": {
"command": "/absolute/path/to/.venv/bin/python",
"args": ["/absolute/path/to/BqForge/server.py"]
}
}
}
Restart Claude Desktop after saving.
# Type this to activate automatic BqForge lookups
use bqforge
Recommended Workflow
Two-step pattern for best results.
# Step 1: resolve topic to ranked practice IDs resolve_topic(query="reduce query cost") → [CO-001 (score 9), QO-001 (score 7), CO-002 (score 6), …] # Step 2: fetch focused content within token budget get_practices(topic="reduce query cost", max_tokens=3000) → Markdown with top practices, do/don't lists, SQL examples
max_tokens=1500Quick answermax_tokens=3000Normal (default)max_tokens=6000Deep diveQuery Optimization
Best practices to write efficient, performant BigQuery SQL that minimises bytes processed and maximises slot utilisation.
Select only the columns you need
BigQuery is columnar; SELECT * reads every column and bills you for all bytes in those columns. Always project only the fields required.
SELECT order_id, total_amount FROM orders WHERE status = 'COMPLETED'
Filter on partition and cluster columns early
Partition pruning eliminates entire partitions from the scan. Clustering reduces blocks read within a partition. Always filter on the partition column (usually a DATE/TIMESTAMP) and cluster keys.
SELECT * FROM `project.dataset.events` WHERE event_date = '2024-06-15' -- partition filter AND country = 'IN' -- cluster filter
Avoid data skew in JOINs and GROUP BY
When a join key or GROUP BY key has very uneven distribution, one worker slot gets overloaded. Pre-filter nulls and consider approximate aggregations.
SELECT a.user_id, SUM(b.amount) FROM users a JOIN transactions b ON a.user_id = b.user_id WHERE a.user_id IS NOT NULL GROUP BY a.user_id
Use CTEs instead of deeply nested subqueries
Common Table Expressions (WITH clauses) improve readability and allow BigQuery's optimiser to materialise intermediate results.
WITH orders_2024 AS ( SELECT order_id, customer_id, amount FROM orders WHERE order_date >= '2024-01-01' ), high_value AS ( SELECT * FROM orders_2024 WHERE amount > 1000 ) SELECT customer_id, COUNT(*) AS big_orders FROM high_value GROUP BY customer_id
Prefer JOIN over correlated subqueries
Correlated subqueries execute once per row of the outer query. A JOIN is processed in parallel across slots and is almost always faster.
-- Instead of correlated subquery SELECT o.order_id FROM orders o LEFT JOIN refunds r ON o.order_id = r.order_id WHERE r.order_id IS NULL -- orders with no refund
Use approximate aggregation functions for analytics
For dashboards where ~1% error is acceptable, APPROX_COUNT_DISTINCT and HyperLogLog sketches are orders of magnitude faster than exact COUNT(DISTINCT).
SELECT event_date, APPROX_COUNT_DISTINCT(user_id) AS dau FROM events WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) GROUP BY event_date ORDER BY event_date
Optimise JOIN order — largest table first
BigQuery's distributed JOIN engine broadcasts smaller tables to all workers. Place the largest table first (left side) to enable broadcast joins, reducing shuffle overhead.
-- Largest table (orders) first, smaller lookup (products) last SELECT o.order_id, p.product_name, o.amount FROM dataset.orders o -- large JOIN dataset.products p -- small lookup ON o.product_id = p.product_id WHERE o.order_date >= '2024-01-01'
Replace self-joins with window functions
Self-joins scan the same table twice and can square the output row count. Window functions (LAG, LEAD, SUM OVER, RANK) achieve running totals, comparisons, and rankings in a single pass.
-- Running revenue per user — window function, no self-join SELECT user_id, order_date, amount, SUM(amount) OVER (PARTITION BY user_id ORDER BY order_date) AS running_total FROM dataset.orders
Use SEARCH() and search indexes for full-text lookups
The BigQuery SEARCH() function with a search index performs inverted-index lookups, dramatically outperforming LIKE '%term%' on large tables (>10 GB).
CREATE SEARCH INDEX log_idx ON dataset.app_logs(log_message); SELECT request_id, log_message FROM dataset.app_logs WHERE log_date = CURRENT_DATE() AND SEARCH(log_message, 'NullPointerException')
Optimise UNNEST patterns for ARRAY and STRUCT columns
UNNEST explodes array elements into rows. Filter before unnesting and avoid unnesting multiple independent arrays in the same query (produces a cross-product).
-- Check array contains value WITHOUT full unnest SELECT user_id FROM dataset.users WHERE EXISTS ( SELECT 1 FROM UNNEST(interest_tags) AS tag WHERE tag = 'sports' ) -- Access nested struct field directly (no UNNEST needed) SELECT address.city FROM dataset.users
Use TABLESAMPLE for fast exploratory queries
TABLESAMPLE SYSTEM draws a random percentage of data blocks, reducing bytes scanned proportionally. Ideal for validating query logic before running full scans.
SELECT status, COUNT(*) AS cnt FROM dataset.orders TABLESAMPLE SYSTEM (1 PERCENT) GROUP BY status ORDER BY cnt DESC
Use granular wildcard table prefixes
Wildcard tables scan every matching table. The more specific the prefix, the fewer tables scanned. Always add a _TABLE_SUFFIX filter.
SELECT event_type, COUNT(*) AS cnt FROM `project.dataset.events_2024*` WHERE _TABLE_SUFFIX BETWEEN '0101' AND '0131' -- Jan 2024 only GROUP BY event_type
Schema Design
Best practices for designing BigQuery table schemas that balance storage cost, query performance, and maintainability.
Partition tables on a date/timestamp column
Partitioning physically separates data by time unit. Queries that filter on the partition column only scan relevant partitions, dramatically reducing cost.
CREATE TABLE `project.dataset.events` PARTITION BY DATE(event_timestamp) CLUSTER BY user_id, event_type OPTIONS(partition_expiration_days=730) AS SELECT * FROM staging.raw_events WHERE FALSE
Cluster tables on frequently filtered/joined columns
Clustering sorts data within each partition by up to 4 columns. Queries that filter or join on cluster keys skip irrelevant blocks at no extra cost.
CREATE TABLE sales PARTITION BY DATE(sale_date) CLUSTER BY region, product_category, customer_id AS SELECT * FROM staging.sales
Use STRUCT and ARRAY to denormalise related data
BigQuery is optimised for denormalised schemas. Storing nested/repeated fields as STRUCT or ARRAY avoids expensive JOINs and keeps related data co-located.
CREATE TABLE users ( user_id STRING, name STRING, address STRUCT<street STRING, city STRING, country STRING>, tags ARRAY<STRING> )
Choose the right data types
Correct data types reduce storage, speed up comparisons, and avoid implicit casts that can break partition pruning.
CREATE TABLE transactions ( transaction_id INT64, amount NUMERIC(18,2), created_at TIMESTAMP, is_refunded BOOL )
Avoid wide tables with hundreds of columns
Tables with hundreds of sparse columns are hard to maintain. Use STRUCT grouping or split into logical sub-tables.
-- Instead of 50 separate metric columns: metrics STRUCT< clicks INT64, impressions INT64, conversions INT64, revenue NUMERIC >
Prefer time-partitioned tables over date-sharded tables
Date-sharded tables (events_20240101, events_20240102) require wildcard queries and increase management overhead. A single time-partitioned table gives the same pruning with better optimiser support.
CREATE OR REPLACE TABLE dataset.events PARTITION BY event_date CLUSTER BY user_id OPTIONS(partition_expiration_days=365) AS SELECT * FROM `dataset.events_*` WHERE _TABLE_SUFFIX >= '20230101'
Use BigQuery native JSON type for semi-structured data
The native JSON type stores data in a lossless binary encoding and supports path-based access without string parsing. It outperforms JSON stored as STRING.
CREATE TABLE dataset.events ( event_id STRING, event_date DATE, event_type STRING, properties JSON -- dynamic per-event fields ); SELECT event_id, properties.user_agent FROM dataset.events WHERE JSON_VALUE(properties, '$.country') = 'IN'
Choose physical storage billing for compressed savings
For datasets with high compression ratios, physical billing can cut storage costs by 30–70%. Evaluate on a dataset-by-dataset basis.
-- Compare before switching SELECT table_name, total_logical_bytes / 1e9 AS logical_gb, active_physical_bytes / 1e9 AS physical_gb FROM `region-us`.INFORMATION_SCHEMA.TABLE_STORAGE WHERE table_schema = 'my_dataset'
Cost Management
Best practices to monitor, control, and reduce BigQuery spend without sacrificing analytical capability.
Always preview query cost before running
The BigQuery console dry-run and the --dry_run flag return bytes billed without executing the query. Build this check into CI pipelines and analyst workflows.
from google.cloud import bigquery
client = bigquery.Client()
job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=True)
job = client.query('SELECT * FROM dataset.table LIMIT 100', job_config=job_config)
print(f'Bytes processed: {job.total_bytes_processed / 1e9:.2f} GB')
Leverage query result caching
BigQuery caches query results for 24 hours. Identical queries are free. Avoid defeating the cache with non-deterministic functions like NOW() or RAND().
-- Cache-friendly: date is a parameter, not NOW() DECLARE report_date DATE DEFAULT '2024-06-01'; SELECT COUNT(*) FROM events WHERE event_date = report_date
Use materialised views and scheduled queries
Pre-aggregate expensive computations into materialised views. Dashboards query the small summary table instead of scanning the raw fact table.
CREATE MATERIALIZED VIEW dataset.daily_revenue OPTIONS(enable_refresh=true, refresh_interval_minutes=60) AS SELECT DATE(order_date) AS day, SUM(amount) AS revenue FROM dataset.orders GROUP BY day
Set budgets, quotas, and cost alerts
Use Google Cloud Budget Alerts and BigQuery custom quotas to prevent runaway queries. Per-project and per-user daily byte limits act as a safety net.
# Set a hard 10 GB per-query limit
job_config = bigquery.QueryJobConfig(
maximum_bytes_billed=10 * 1024**3 # 10 GB
)
Use long-term storage pricing and table expiration
Tables not modified for 90+ days drop to long-term storage pricing (~50% cheaper). Set expiration on staging and temporary tables.
CREATE TABLE dataset.temp_staging OPTIONS( expiration_timestamp = TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 7 DAY) ) AS SELECT ...
Use BigQuery Editions for predictable workloads
On-demand pricing bills per TB scanned — unpredictable for heavy pipelines. BigQuery Editions provide reserved slots with autoscaling and commitments for significant discounts.
-- Check average and peak slot usage to estimate reservation size SELECT DATE(creation_time) AS day, MAX(total_slot_ms) / (1000 * 60 * 60 * 24) AS peak_slot_hours FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY) GROUP BY day ORDER BY peak_slot_hours DESC
Avoid repeated table overwrites to reduce time-travel storage
Every WRITE_TRUNCATE retains the previous snapshot in time-travel storage (up to 7 days). Use WRITE_APPEND + MERGE or reduce the time-travel window for high-churn tables.
ALTER TABLE dataset.daily_staging SET OPTIONS(max_time_travel_hours = 48);
Preview data without running queries
Use the BigQuery console Preview tab, bq head, or the tabledata.list API — none trigger query billing.
# Free row preview via CLI (no query charge) bq head --max_rows=20 project:dataset.tablename
Materialise intermediate stages to reduce repeated scanning
Break ETL into explicit destination table stages. Each intermediate table is written once and queried from its smaller result.
-- Stage 1: filter once CREATE OR REPLACE TABLE dataset.orders_2024 OPTIONS(expiration_timestamp = TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)) AS SELECT * FROM dataset.raw_orders WHERE order_date >= '2024-01-01'; -- Stage 2: aggregate from the small table (cheap) SELECT region, SUM(amount) FROM dataset.orders_2024 GROUP BY region
Security & Access Control
Best practices to secure BigQuery data, control access at the right granularity, and maintain compliance with data-governance policies.
Apply principle of least privilege with IAM
Grant the minimum BigQuery IAM role required for each principal. Prefer predefined roles over primitive roles and grant at the dataset or table level, not the project level.
gcloud projects add-iam-policy-binding PROJECT_ID \ --member='serviceAccount:pipeline@project.iam.gserviceaccount.com' \ --role='roles/bigquery.dataViewer'
Use column-level and row-level security
Column-level security (Policy Tags) restricts access to PII columns. Row-level security (Row Access Policies) filters rows based on the querying user's attributes.
CREATE ROW ACCESS POLICY region_filter
ON dataset.sales
GRANT TO ('group:apac-sales@company.com')
FILTER USING (region = 'APAC')
Encrypt sensitive data with Cloud KMS CMEK
For regulatory requirements (HIPAA, PCI-DSS) use Customer-Managed Encryption Keys via Cloud KMS to retain key control.
CREATE TABLE dataset.phi_records OPTIONS( kms_key_name = 'projects/proj/locations/us/keyRings/ring/cryptoKeys/bq-key' ) AS SELECT * FROM staging.phi_staging
Audit and monitor data access with Cloud Audit Logs
Enable Data Access audit logs (DATA_READ, DATA_WRITE). Export logs to BigQuery for anomaly detection and compliance reporting.
-- Find top data consumers in the last 7 days SELECT user_email, SUM(total_bytes_billed)/1e12 AS tb_billed FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY) GROUP BY user_email ORDER BY tb_billed DESC LIMIT 20
Use VPC Service Controls and authorised views
VPC Service Controls create a security perimeter around BigQuery. Authorised views let you share query results without exposing the underlying table.
CREATE VIEW analytics.users_masked AS SELECT user_id, CONCAT(LEFT(email, 2), '****@', SPLIT(email, '@')[OFFSET(1)]) AS email_masked, country, created_date FROM pii_dataset.users
Apply dynamic data masking to sensitive columns
BigQuery dynamic data masking applies masking rules automatically based on the querying user's role. Analysts see masked values; authorised roles see the original data.
CREATE OR REPLACE DATA MASKING POLICY hash_email USING (SHA256(CAST(email AS BYTES)));
Use Workload Identity Federation instead of service account keys
SA JSON key files can be leaked and are hard to rotate. Workload Identity Federation lets external workloads authenticate without a long-lived credential.
# GitHub Actions: authenticate via Workload Identity (no key file)
- uses: google-github-actions/auth@v2
with:
workload_identity_provider: 'projects/123/locations/global/workloadIdentityPools/pool/providers/github'
service_account: 'bq-pipeline@project.iam.gserviceaccount.com'
Enable and govern data lineage with Dataplex
Dataplex automatically tracks table-level and column-level data lineage across BigQuery jobs. Essential for impact analysis and GDPR right-to-erasure.
gcloud services enable datalineage.googleapis.com --project=PROJECT_ID
Materialized Views
Best practices for creating and managing BigQuery materialized views to pre-aggregate expensive computations, reduce query costs, and accelerate BI dashboards.
Use incremental materialized views for near-real-time aggregations
Incremental MVs only process new data since the last refresh rather than recomputing the entire result. Combined with max_staleness, they deliver sub-minute freshness at a fraction of the cost.
CREATE MATERIALIZED VIEW dataset.hourly_revenue OPTIONS( enable_refresh = true, refresh_interval_minutes = 60, max_staleness = INTERVAL '90' MINUTE ) AS SELECT TIMESTAMP_TRUNC(order_ts, HOUR) AS hour, region, SUM(amount) AS revenue, COUNT(*) AS order_count FROM dataset.orders GROUP BY 1, 2
Tune max_staleness to balance freshness vs cost
When a query hits a table with an associated MV, BigQuery serves the MV result if it is within max_staleness. Too tight forces full scans; too loose serves stale data.
ALTER MATERIALIZED VIEW dataset.live_orders SET OPTIONS(max_staleness = INTERVAL '15' MINUTE)
Avoid non-deterministic functions in MV definitions
Functions like CURRENT_TIMESTAMP(), RAND(), or GENERATE_UUID() are evaluated at refresh time, not query time — causing subtle correctness bugs.
-- GOOD: derive snapshot from the data itself CREATE MATERIALIZED VIEW dataset.orders_summary AS SELECT DATE(order_ts) AS order_date, SUM(amount) AS daily_revenue FROM dataset.orders GROUP BY 1
Align MV partition and cluster with query patterns
Choose cluster columns that match the filters your BI tool or downstream queries use most often.
CREATE MATERIALIZED VIEW dataset.sales_by_region CLUSTER BY region, product_id AS SELECT sale_date, region, product_id, SUM(revenue) AS revenue FROM dataset.sales GROUP BY 1, 2, 3
Monitor MV refresh cost and staleness via INFORMATION_SCHEMA
MV refreshes consume slot-time and are billed like regular queries. Track refresh frequency and last_refresh_time to catch runaway MVs.
SELECT table_schema, table_name, last_refresh_time, TIMESTAMP_DIFF(CURRENT_TIMESTAMP(), last_refresh_time, MINUTE) AS stale_minutes FROM `region-us`.INFORMATION_SCHEMA.MATERIALIZED_VIEWS_BY_PROJECT ORDER BY stale_minutes DESC
Use BI Engine for sub-second dashboard query acceleration
BigQuery BI Engine is an in-memory analysis service that caches frequently-queried data and accelerates SQL workloads from Looker Studio and Looker.
-- Create a 10 GB BI Engine reservation in us-central1 -- (done via Console: BigQuery → BI Engine → Create reservation)
Use MERGE for incremental MV-like updates on unsupported patterns
When your aggregation logic exceeds native MV capabilities, implement an incremental MERGE pattern into a destination table, processing only new partitions each run.
MERGE dataset.daily_summary AS T USING ( SELECT DATE(order_ts) AS day, region, SUM(amount) AS revenue FROM dataset.orders WHERE DATE(order_ts) = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY) GROUP BY 1, 2 ) AS S ON T.day = S.day AND T.region = S.region WHEN MATCHED THEN UPDATE SET T.revenue = S.revenue WHEN NOT MATCHED THEN INSERT (day, region, revenue) VALUES (S.day, S.region, S.revenue)
Monitoring & Observability
Best practices for observing BigQuery job health, tracking cost attribution, detecting anomalies, and building operational dashboards.
Use INFORMATION_SCHEMA.JOBS as your primary audit trail
JOBS_BY_PROJECT records every query, load, extract, and copy job. First place to look for slow queries, expensive scans, failed jobs, and quota violations.
-- Top 10 most expensive queries in the last 24 hours SELECT user_email, query, ROUND(total_bytes_billed / 1e12, 4) AS tb_billed, TIMESTAMP_DIFF(end_time, start_time, SECOND) AS duration_s FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY) AND job_type = 'QUERY' AND state = 'DONE' ORDER BY tb_billed DESC LIMIT 10
Set up slot utilisation alerts in Cloud Monitoring
BigQuery exposes slot utilisation metrics via Cloud Monitoring. For reserved capacity, sustained high utilisation means queries are queuing.
gcloud alpha monitoring policies create \ --display-name='BQ High Slot Utilisation' \ --condition-filter='metric.type="bigquery.googleapis.com/slots/allocated_for_project"' \ --condition-threshold-value=800 \ --condition-threshold-duration=300s
Attribute cost by team and workload with labels
BigQuery job labels propagate to Cloud Billing exports, enabling per-team and per-pipeline cost breakdown.
job_config = bigquery.QueryJobConfig(
labels={
'team': 'analytics',
'pipeline': 'user-metrics',
'environment': 'production',
}
)
Monitor and alert on query errors and failed jobs
Failed BigQuery jobs are silent by default unless you explicitly monitor error_result in INFORMATION_SCHEMA.
-- Find failed jobs in the last hour SELECT job_id, user_email, error_result.reason, error_result.message FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR) AND error_result IS NOT NULL ORDER BY creation_time DESC
Track query performance trends over time
Track p50/p95 query duration and bytes_billed per query signature over time to detect regressions.
-- Weekly p95 query duration trend SELECT DATE_TRUNC(creation_time, WEEK) AS week, statement_type, APPROX_QUANTILES(total_slot_ms, 100)[OFFSET(95)] / 1000 AS p95_slot_s FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 90 DAY) GROUP BY 1, 2 ORDER BY 1 DESC, 3 DESC
Audit storage costs with INFORMATION_SCHEMA.TABLE_STORAGE
TABLE_STORAGE gives a per-table breakdown of active, long-term, time-travel, and fail-safe physical bytes.
SELECT table_name, ROUND(total_logical_bytes / 1e9, 2) AS logical_gb, ROUND(time_travel_physical_bytes / 1e9, 2) AS time_travel_gb FROM `region-us`.INFORMATION_SCHEMA.TABLE_STORAGE ORDER BY total_logical_bytes DESC LIMIT 20
Monitor reservation slot utilisation and queue wait times
For capacity-based BigQuery editions, sustained high slot utilisation causes queries to queue. Track queue wait via INFORMATION_SCHEMA.JOBS.
-- Detect queue wait time by reservation
SELECT reservation_id, DATE(creation_time) AS day,
APPROX_QUANTILES(
TIMESTAMP_DIFF(start_time, creation_time, SECOND), 100
)[OFFSET(95)] AS p95_queue_s
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
GROUP BY 1, 2 ORDER BY p95_queue_s DESC
Data Ingestion
Best practices for loading data into BigQuery efficiently and reliably, covering batch loads, streaming, file formats, deduplication, and DML patterns.
Use Storage Write API instead of legacy streaming inserts
The legacy insertAll streaming API charges per row and has lower throughput limits. The Storage Write API supports exactly-once delivery and is ~10× cheaper for high-volume writes.
from google.cloud.bigquery_storage_v1 import BigQueryWriteClient, types
client = BigQueryWriteClient()
parent = client.table_path('project', 'dataset', 'table')
write_stream = client.create_write_stream(
parent=parent,
write_stream=types.WriteStream(type_=types.WriteStream.Type.COMMITTED)
)
Use columnar file formats for batch loads
BigQuery natively reads Parquet, Avro, and ORC. These formats are compressed, self-describing, and load faster while consuming fewer slots than CSV/JSON.
bq load \ --source_format=PARQUET \ --parquet_enable_list_inference=true \ project:dataset.table \ 'gs://bucket/data/events/dt=2024-06-01/*.parquet'
Batch DML operations — avoid single-row mutations
BigQuery DML triggers background compaction after each statement. Issuing thousands of single-row DML statements generates excessive mutation metadata and can exhaust quotas.
-- GOOD: stage all rows then bulk-insert INSERT INTO dataset.target_table (id, value, updated_at) SELECT id, value, updated_at FROM dataset.staging_table WHERE load_date = CURRENT_DATE()
Use MERGE for upsert patterns instead of DELETE + INSERT
A MERGE statement performs both update and insert in a single atomic pass and is easier to make idempotent than DELETE + INSERT.
MERGE dataset.customers AS T USING ( SELECT customer_id, name, email, updated_at FROM dataset.customers_staging WHERE DATE(updated_at) = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY) ) AS S ON T.customer_id = S.customer_id WHEN MATCHED AND S.updated_at > T.updated_at THEN UPDATE SET name = S.name, email = S.email, updated_at = S.updated_at WHEN NOT MATCHED THEN INSERT (customer_id, name, email, updated_at) VALUES (S.customer_id, S.name, S.email, S.updated_at)
Implement idempotent deduplication for streaming pipelines
Streaming pipelines can deliver duplicate records during retries. Use Storage Write API COMMITTED streams for exactly-once semantics, or implement downstream deduplication with ROW_NUMBER().
CREATE OR REPLACE TABLE dataset.events_deduped AS
SELECT * EXCEPT(row_num) FROM (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY event_id
ORDER BY ingestion_timestamp DESC
) AS row_num
FROM dataset.events_raw
)
WHERE row_num = 1
Validate and enforce schema on load
Use autodetect only for exploration; enforce explicit schemas in production. Enable schema update options only intentionally.
schema = [
bigquery.SchemaField('event_id', 'STRING', mode='REQUIRED'),
bigquery.SchemaField('event_ts', 'TIMESTAMP', mode='REQUIRED'),
bigquery.SchemaField('amount', 'NUMERIC', mode='NULLABLE'),
]
job_config = bigquery.LoadJobConfig(
schema=schema,
source_format=bigquery.SourceFormat.PARQUET,
)
Workload Management
Best practices for managing BigQuery slot capacity, reservations, autoscaling, job concurrency, and workload isolation.
Isolate workloads with separate reservations
Sharing a single slot pool across interactive queries, batch ETL, and development workloads causes priority inversion. Use separate reservations to guarantee isolation.
bq mk --reservation --location=US --slots=500 prod-interactive bq mk --reservation --location=US --slots=200 prod-batch bq mk --reservation --location=US --slots=50 dev
Configure autoscaling baselines and max slots correctly
Setting baseline too high wastes money; setting max too low throttles queries during peaks. Use INFORMATION_SCHEMA history to size baselines.
-- Estimate p50 and p95 slot consumption SELECT reservation_id, APPROX_QUANTILES(total_slot_ms / 1000, 100)[OFFSET(50)] AS p50_slot_s, APPROX_QUANTILES(total_slot_ms / 1000, 100)[OFFSET(95)] AS p95_slot_s FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY) GROUP BY reservation_id
Set job timeouts and query priorities to prevent runaway jobs
Set job_timeout_ms to cancel queries exceeding expected duration, and use BATCH priority for non-urgent ETL to allow BigQuery to schedule opportunistically.
job_config = bigquery.QueryJobConfig(
job_timeout_ms=600_000, # Cancel after 10 minutes
maximum_bytes_billed=50 * 1e9, # Hard 50 GB cost cap
priority=bigquery.QueryPriority.BATCH,
labels={'team': 'data-eng', 'pipeline': 'nightly-etl'},
)
Use capacity commitments for predictable cost and performance
Annual and 3-year slot commitments offer significant discounts (up to 25–50%). Use FLEX commitments for burst windows; annual commitments for the stable baseline.
-- View current capacity commitments SELECT commitment_id, slot_count, plan, state, commitment_start_time, commitment_end_time FROM `region-us`.INFORMATION_SCHEMA.CAPACITY_COMMITMENTS ORDER BY commitment_start_time DESC
Label all jobs for cost attribution and workload analysis
Job labels are the only reliable way to attribute BigQuery costs to teams and pipelines in the Cloud Billing export.
-- Find top cost by team label over the last 30 days SELECT labels['team'] AS team, ROUND(SUM(total_bytes_billed) / 1e12, 3) AS tb_billed, COUNT(*) AS job_count FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT, UNNEST([labels]) AS kv WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY) GROUP BY team ORDER BY tb_billed DESC
Partitioning
Advanced partitioning strategies to minimise bytes scanned, reduce costs, and simplify data lifecycle management in BigQuery.
Always filter on the partition column
Queries that don't filter on the partition column perform full table scans, bypassing pruning entirely. Always include a partition filter in WHERE clauses on partitioned tables.
SELECT * FROM orders WHERE DATE(created_at) = '2024-01-01'
SELECT * FROM orders WHERE order_status = 'OPEN' -- no partition filter
Use partition expiration to auto-delete old data
Partition expiration automatically removes data older than N days, reducing active storage costs without manual cleanup scripts.
CREATE TABLE sales.events PARTITION BY DATE(event_time) OPTIONS (partition_expiration_days = 90) AS SELECT * FROM staging.events WHERE FALSE
Choose the right partition granularity
Daily partitioning suits most OLAP workloads. Hourly partitioning is appropriate only for very high-volume streaming tables where hour-level pruning provides meaningful savings.
Use integer-range partitioning for non-time keys
When the natural query filter is an integer ID (e.g. customer_id, shard_id), use integer-range partitioning instead of forcing a timestamp column.
CREATE TABLE my_dataset.customers PARTITION BY RANGE_BUCKET(customer_id, GENERATE_ARRAY(0, 10000000, 100000));
Storage Pricing
Practices to minimise BigQuery storage costs through intelligent use of long-term pricing, physical billing, table expiry, and data lifecycle policies.
Let tables age into long-term storage pricing
Tables and partitions that haven't been modified for 90+ consecutive days are automatically billed at the long-term rate (~50% cheaper). Avoid unnecessary table writes that reset this 90-day clock.
Enable physical storage billing for compressed savings
By default, BigQuery bills for logical (uncompressed) storage. Switching to physical billing charges for actual compressed bytes on disk — often 30–70% less for columnar data.
ALTER TABLE mydataset.large_table SET OPTIONS (storage_billing_model = 'PHYSICAL')
Set table and dataset expiration policies
Temporary and staging tables that are never cleaned up are a silent cost driver. Use default table expiration at the dataset level to auto-delete tables after a defined period.
-- Set dataset-level default table expiration (7 days) bq update --default_table_expiration 604800 myproject:staging
Scheduled Queries
Best practices for building reliable, cost-efficient scheduled query pipelines using BigQuery Data Transfer Service.
Use incremental INSERT instead of full table overwrites
Full WRITE_TRUNCATE on large tables re-scans all source data on every run. Incremental patterns based on a run-time parameter only process new data, dramatically reducing cost and latency.
-- Use @run_time and @run_date built-in parameters INSERT INTO reporting.daily_summary SELECT DATE(@run_date) AS report_date, user_id, SUM(revenue) AS revenue FROM events.transactions WHERE DATE(event_time) = DATE(@run_date) GROUP BY 1, 2
Set maximum_bytes_billed on all scheduled queries
A logic error in a scheduled query can trigger a full scan of a petabyte table and generate a massive bill. A bytes-billed cap fails the job safely rather than processing runaway data.
Schedule during off-peak hours to utilise flex slots
Batch and scheduled queries run with BATCH priority can be queued during off-peak hours when slot capacity is available. This is especially effective with FLEX commitment slots, reducing on-demand costs.
-- Configure via Transfer Service API or Cloud Console -- Set schedule: "every 24 hours" starting at 02:00 UTC -- Priority: BATCH (enables opportunistic scheduling)
Use Dataform assertions to validate output data
Pipeline outputs should be validated after each run. Dataform assertions run automatically and block downstream dependencies if data quality fails.
-- Dataform SQLX assertion:
config {
type: 'assertion',
description: 'daily_summary must have rows for each day'
}
SELECT date
FROM ${ref('daily_summary')}
GROUP BY date
HAVING COUNT(*) = 0 -- Fails if any date has 0 rows
Stagger scheduled query start times to avoid slot contention
When many scheduled queries start at exactly the same time (e.g. midnight), they compete for slots and all run slower. Stagger start times by 5–15 minutes.
# Cron schedules spread across the hour: # Critical pipeline: 0 0 * * * (00:00) # Secondary pipeline: 5 0 * * * (00:05) # Reporting rollup: 15 0 * * * (00:15) # Archive job: 30 0 * * * (00:30)