Your team’s BigQuery table is growing fast, tables are slowing down, and your bill is creeping up month by month. Someone on your team says “just partition it,” while someone else wants to “cluster it.” But which is the best choice to improve efficiency and keep your investment low while maintaining performance?
The honest answer: both strategies reduce the amount of data BigQuery scans per query, but they solve different problems. And in a lot of cases, you actually want both running together.
This article will break down:
- What BigQuery clustering and partitioning are
- How they compare
- How to pick the right approach for your tables
What Is BigQuery Partitioning?
BigQuery partitioning divides a table into logical partitions based on column value or ingestion time. Each partition holds a subset of your data based on the value of a single column – usually a date, timestamp, or integer range.
When you run a query that filters on the partition column, BigQuery skips the partitions that don’t match your filter criteria. This is called partition pruning. It happens before the query executes, which means BigQuery can give you a cost estimate upfront.
BigQuery supports three types of partitioning:
- Time-unit partitioning
- Ingestion-time partitioning
- Integer-range partitioning
Time-Unit Partitioning
Uses partition by DATE, TIMESTAMP, or DATETIME column with hourly, daily, monthly or yearly granularity. This is the most common type of BigQuery partitioning.
CREATE TABLE analytics.events
PARTITION BY DATE(event_timestamp)
AS SELECT * FROM analytics.events_raw;
Ingestion-Time Partitioning
BigQuery automatically partitions rows based on when they were loaded, using a pseudo-column called _PARTITIONTIME. It’s most useful when your data doesn’t have a reliable timestamp column.
CREATE TABLE analytics.events ( event_id STRING, user_id STRING, event_type STRING)PARTITION BY _PARTITIONTIMEOPTIONS ( require_partition_filter = TRUE);
Integer-Range Partitioning
This partitions on an INTEGER column with a defined start, end, and interval. It works best for things like customer ID ranges or geographical codes.
CREATE TABLE analytics.customers
PARTITION BY RANGE_BUCKET(customer_id, GENERATE_ARRAY(0, 10000000, 100000));
Pro tip: BigQuery enforces partition limits that vary by partition types, so teams should verify current limits in Google Cloud documentation before designing highly granular partition strategies.
What Is BigQuery Clustering?
BigQuery clustering organizes the data within a table (or within a partition) by sorting according to one or more columns. BigQuery uses those sorted values to group rows into storage blocks. When a query filters or aggregates on a clustered column, BigQuery only reads the relevant blocks, which is called block pruning.
Unlike partitioning, clustering works on up to four columns and is effective for high-cardinality filtering patterns. There’s no hard limit on distinct values.
CREATE TABLE analytics.events
CLUSTER BY user_id, event_type;
When you cluster on multiple columns, order is the most important. BigQuery starts sorting with the first column, then the second within that, and so on. A query filtering only on the second or third clustered column won’t get the full benefit.
-- This query benefits from clustering on (country, event_type)
SELECT * FROM analytics.events
WHERE country = 'US' AND event_type = 'purchase';
-- This query gets less benefit because it skips the first clustered column
SELECT * FROM analytics.events
WHERE event_type = 'purchase';
Pro tip: BigQuery can’t estimate exactly how much data a query will scan before running. The pre-run estimate in the query editor is often an upper-bound estimate because BigQuery determines block pruning efficiency during execution. The actual bytes processed – and your actual cost – are determined at execution time after block pruning starts.
How Partitioning and Clustering Are Similar
Both BigQuery partitioning and clustering exist for the same reason: BigQuery charges you based on how much data gets scanned, so anything that reduces scan volume directly reduces cost.
Both methods work by helping BigQuery skip data it doesn’t need:
- Partitioning skips whole partitions
- Clustering skips storage blocks within a table or partition
The end result gets your team faster queries and a smaller bill – especially when used at scale.
Both partitioning and clustering also integrate cleanly into standard DDL, requiring no code changes to your queries (as long as you’re filtering on the right columns), and maintain themselves automatically as new data is written. Your team can define the structure once and BigQuery just handles the rest.
Where Partitioning and Clustering Differ
The main ways that partitioning and clustering differ are:
- How data is organized
- Supported column types
- Cardinality limits
- Query cost predictability
- Partition-level management
- Table size threshold
Take a closer look at how each method performs for these different features.
How Data Is Organized
Partitioning physically separates data into distinct storage segments based on a column value or ingestion time. For example, a table partitioned by date stores January 1st data separately from January 2nd data.
Clustering works differently. Instead of splitting the table into separate segments, it sorts rows within existing storage blocks based on selected columns. BigQuery then uses metadata about those sorted blocks to skip scanning irrelevant data during queries.
In other words:
- Partitioning divides data into buckets
- Clustering organizes data within those buckets
Supported Column Types
Partitioning is more restrictive when it comes to column selection. In BigQuery, you can only partition on a single DATE, TIMESTAMP, DATETIME, or INTEGER column (or use ingestion-time partitioning).
Clustering is much more flexible. You can cluster on up to four columns and support a wider range of data types, including common types like:
- STRING
- INT64
- DATE
- TIMESTAMP
- BOOL
- NUMERIC
This means that clustering is usually better suited for tables frequently filtered by multiple dimensions like region, customer ID, product category, or event type.
Cardinality Limits
Partitioning comes with hard limits. BigQuery caps partitioned tables at 4,000 partitions, which can become a problem if you’re partitioning highly granular datasets.
Clustering doesn’t have this limitation. It sorts rather than separates data into physical partitions. That means it can handle columns with high cardinality much more effectively.
Query Cost Predictability
Partition pruning happens before query execution begins. When a query filters on a partition column, BigQuery can immediately determine which partitions need to be scanned. That means users can estimate costs before running the query.
Clustering is less predictable. Because clustering data pruning happens during query execution, BigQuery doesn’t know exactly how much data can be skipped until the query runs. As a result, final query costs are only visible after execution finishes.
Partition-Level Management
Partition tables offer more granular administrative controls, allowing you to:
- Set expiration dates for individual partitions
- Load data into specific partitions
- Delete individual partitions without affecting the rest of the table
- Manage retention policies more precisely
These features are especially helpful for time-series datasets with strict retention requirements.
Clustering doesn’t provide partition-level administrative capabilities because the data remains stored in a single table structure.
Table Size Threshold
BigQuery often recommends partitioning and clustering for larger tables where scan reduction becomes meaningful, but exact thresholds depend on workload patterns and query behavior.
When to Use Partitioning
Partitioning is the better fit when:
- Your data has a clear time dimension and most queries filter by date or time range – event logs, transactions, page views, etc.
- You want predictable cost estimates before queries run. With require_partition_filter = TRUE, you can even enforce that every query must specify a partition filter to prevent accidental full-table scans.
- You need to manage data by time period – expiring old partitions, loading a backfill into a specific date range, or auditing how much data lives in each segment.
- Your filter column has a relatively low cardinality and fits within the 4,000-partition limit.
If your table doesn’t have a strong time component and you’re filtering on things like user IDs, product categories, or geographic regions – partitioning alone probably isn’t the right tool.
When to Use Clustering
Clustering is the more natural fit when:
- Your queries frequently filter or group by high-cardinality columns – user_id, product_sku, session_id, etc.
- You want the flexibility to optimize across multiple filter columns without being locked into a single partition key.
- Your table doesn’t have a natural date column to partition on, but you do have columns that queries consistently filter against.
- Your data is under 4,000 logical segments but still large enough (10+ GB) that block pruning would meaningfully reduce scan volume.
Clustering also handles something partitioning can’t: when your access patterns vary query to query. If half your queries filter by region and the other half filter by product_category, clustering on both columns captures a wider range of query patterns than a single partition key ever could.
When to Use Both Together
This is where things get interesting – and the strategy that most production tables should use.
Partitioning and clustering are designed to work together:
- Partitioning first reduces your table into time-bound segments
- Clustering then organizes data within each partition so individual queries only touch the relevant blocks
The two strategies stack for added benefits.
Take a practical example. Here’s an e-commerce events table that gets queried by both date range and by customer attributes:
CREATE TABLE analytics.ecommerce_events (
event_id STRING,
user_id STRING,
event_ts TIMESTAMP,
event_type STRING,
country STRING,
revenue FLOAT64
)
PARTITION BY DATE(event_ts)
CLUSTER BY country, event_type, user_id;
A query like “show me all U.S. purchases in Q1” benefits from both layers simultaneously – partition pruning skips every day outside Q1, and block pruning within each remaining partition skips rows where country != ‘US’ and event_type != ‘purchase’.
Pro tip: In some cases, teams cluster on a more granular timestamp or related filter column within a partitioned table – but this should be tested against real query behavior before implementation.
An important thing to remember: Google Cloud has a partitioning and clustering recommender built into BigQuery that analyzes your workload over the past 30 days and surfaces optimization opportunities. It runs a candidate generation step, a read pattern analysis, and a write pattern analysis and then generates net savings estimates per column.
This is worth checking, especially if you’ve got a lot of tables that grew organically and never get proper optimization attention. You can get export recommendations through Google Cloud’s recommender APIs or review them directly in BigQuery UI.
This recommender is good at catching the obvious wins on larger tables. It’s less useful when it comes to understanding the interplay between your query billing model (on-demand vs. reservations) and how optimization choices affect your actual monthly spend.
Table-Level Optimization Only Gets You So Far
The main issue when it comes to partitioning vs. clustering is that, even if you’ve structured every table perfectly, you can still end up with an eye-wateringly high BigQuery bill.
Why? Because table-level optimization only controls how much data gets scanned per query. It doesn’t do anything about how your jobs are billed – specifically whether a workload runs on expensive on-demand compute or against reservation capacity you’ve already committed to. It won’t help with:
- Underutilized reservations
- Rerouting query spikes to avoid slot contention
- Visibility into which dbt models are eating the most compute
That’s the gap that Yuki Data fills. Yuki sits between your workload and BigQuery’s compute layer, automatically routing each job to the most cost-effective billing option in real-time – on-demand or reservation. It does this all without requiring any:
- Code changes
- Query rewrites
- Infrastructure migrations
Teams using Yuki see costs drop by up to 60%, including workloads that were already partitioned and clustered.
If you’ve done the table optimization work and your bill is still higher than it should be, the issue is almost certainly at the compute routing layer – not the schema layer.
See how Yuki handles BigQuery cost optimization and find out exactly where your spend is going.


