The Second Blogpost in the Series of Snowflake Data Ingestion Tips is Here!
So after we explored the various data ingestion methods that Snowflake provides in the first post, in this blog post we’ll focus on tips for optimizing and reducing the costs of your Snowflake data ingestion.
Let’s dive right in!
Monitoring Usage and Costs on an Ongoing Basis
While it may seem like the most traditional and boring approach, consistently monitoring your Snowflake usage and costs is crucial for staying efficient. The following SQL query provides a detailed view of your Snowpipe usage and credits consumed over the last 30 days, broken out by day:
SELECT TO_DATE(start_time) AS date,
pipe_name,
SUM(credits_used) AS credits
FROM snowflake.account_usage.pipe_usage_history
WHERE start_time >= DATEADD(month,-1,CURRENT_TIMESTAMP())
GROUP BY date,pipe_name
ORDER BY credits DESC;
To get a broader view of your consumption patterns, you can also run this query to see the average daily credits used by Snowpipe over the last year, grouped by week:
WITH credits_by_day AS (
SELECT TO_DATE(start_time) AS date,
SUM(credits_used) AS credits_used
FROM snowflake.account_usage.pipe_usage_history
WHERE start_time >= DATEADD(year,-1,CURRENT_TIMESTAMP())
GROUP BY 1
ORDER BY 2 DESC
)
SELECT DATE_TRUNC('week',date),
AVG(credits_used) AS avg_daily_credits
FROM credits_by_day
GROUP BY 1
ORDER BY 1;
This will help you identify any anomalies or unexpected changes in your daily Snowpipe usage so you can investigate further.
Optimization Tips for Snowpipe
When it comes to optimizing your Snowpipe usage, the devil is in the details.
By aligning your file characteristics and ingestion strategies with Snowpipe’s strengths, you can see significant improvements in the cost-effectiveness and efficiency of your data pipelines.
Aligning File Characteristics with Snowpipe’s Capabilities
- Ensure file sizes are between 100-250 MiB for the best price-performance on Snowpipe
- Aggregate smaller data files into larger batches to take advantage of the optimal file size
- Avoid exceeding 5 GiB file sizes to prevent issues with erroneous records causing ingestion job failures
- Use the most explicit file paths to allow COPY operations to list and load data without traversing the entire bucket, saving compute and network resources
- Consolidate events together to take advantage of Snowpipe’s per-1000 file charge of 0.06 credits
Our Snowpipe Optimization Journey
About three years ago, we started using Snowflake to power a critical business application that required near real-time data freshness.
We initially leveraged Snowpipe to ingest data from Kafka into Snowflake, but soon discovered that this approach was costing us $4,000 per month – an unsustainable figure.
After some analysis, we identified key optimization strategies that allowed us to dramatically reduce our Snowpipe costs.
By adjusting our file sizes, batching events, and tuning our refresh rates, we were able to bring our monthly Snowpipe costs down to just $3,000 yearly.
You can read the full details of our Snowpipe optimization journey in this blog post.
Choosing Snowpipe vs. COPY INTO
In most cases, we recommend using Snowpipe over COPY INTO for its simplicity and convenience.
Snowpipe provides a serverless experience, with Snowflake automatically managing and scaling the compute resources, which reduces the overhead compared to the more hands-on approach required by COPY INTO.
That said, COPY INTO offers advantages like file-level transaction granularity and the ability to apply transformations during ingestion.
So if you require that level of control, COPY INTO may be the better choice.
Overall, Snowpipe is the preferred option for most use cases due to its ease of use and automatic scaling capabilities.
But evaluate your specific requirements to determine the best approach for your needs.
Refining File Structure for Enhanced Snowflake Ingestion Efficiency
The structure of your data files can have a significant impact on ingestion speeds into Snowflake, often more so than the size of the files themselves.
Optimizing file configuration can markedly improve both the performance and the efficiency of your data loading processes.
Key strategies include:
- Flattening nested data formats like JSON or XML to simplify ingestion
- Limiting the number of columns in your data files to reduce metadata processing
- Ensuring consistent data types across all files to avoid unnecessary conversions during ingestion
By strategically adjusting the file structure, you can streamline your data workflows in Snowflake, making them not only faster but also more cost-effective in the long run.
Leveraging Compression for Optimal Snowflake Ingestion
For Snowflake users, embracing file compression before ingestion isn’t just a good practice—it’s essential for maximizing efficiency and reducing costs.
Snowflake supports a wide range of compression formats, including GZIP, BZ2, BROTLI, ZSTD, SNAPPY, DEFLATE, and RAW_DEFLATE.
The main benefit of compression is that it significantly reduces the volume of data that must be transferred over the network, addressing a common bottleneck in data ingestion.
By minimizing data size, you not only speed up the transfer but also reduce the cost associated with data movement, especially when dealing across cloud regions or providers.
Based on experience and testing, GZIP-compressed CSV files (.csv.gz) strike the best balance, providing a high compression ratio while being widely used and supported by Snowflake.
Whenever possible, stick with the compression scheme provided by your source systems to avoid unnecessary processing overhead. Only consider re-encoding data for cross-region transfers where the benefits of reduced data transfer costs outweigh the computational expense.
I hope these tips help you optimize your Snowflake data ingestion and reduce your costs. Stay tuned for more in-depth content on Snowflake best practices in the weeks to come!