Snowflake

Data recovery

There is an additional 7 day fail-safe period for data recovery after the 7 days of time travel Understanding & Viewing Fail-safe — Snowflake Documentation

Cost

Cost is time-based, unlike Big Query (where cost depends on scanned info)Understanding Virtual Warehouse, Storage, and Cloud Services Usage — Snowflake Documentation.

Some cloud data platforms charge based on how many terabytes are scanned or how many terabytes are returned from a query, rather than based on the time that queries are running. While this is flexible, it’s problematic because users or applications can’t predict how much data would be scanned or returned by a given query, which results in highly variable and unpredictable pricing each month. (How Usage-Based Pricing Delivers a Budget-Friendly Data Cloud (snowflake.com))

There is auto-stop and auto-resume of queries. Note that you are charged at least by a whole minute (and then by seconds). The cache is dropped when the warehouse is suspended.Warehouse Considerations — Snowflake Documentation

On the other hand, this makes it a bit hard to compute the cost of queries: it’s hard to know whether the warehouse was already running or not (and thus, whether to add or not the round-to-the-minute, and whether we are charged by that query at all, since it may be running in parallel). Snowflake provides you with the WAREHOUSE_METERING_HISTORY to know the credits of a given warehouse, but if you try to drill down within a warehouse, that will be difficult.

Partitioning / clustering

(Micro) partitioning is entirely handled by Snowflake. Clustering should only be done occasionally: Micro-partitions & Data Clustering — Snowflake Documentation and Clustering Keys & Clustered Tables — Snowflake Documentation.

Streams and Tasks

From https://docs.snowflake.com/en/user-guide/data-pipelines-intro.html#workflow:

Note that, if we use this with a CDC over a external database, the staging table would contain the CDC events of the external table in this case. It is weird that a stream over a table, itself, is a CDC mechanism on the source table (so in this config, we are doing kind of CDC x 2).

Note that a stream itself does not contain any table data. A stream only stores an offset for the source object and returns CDC records by leveraging the versioning history for the source object

Querying a stream alone does not advance its offset, even within an explicit transaction; the stream contents must be consumed in a DML statement [e.g., a INSERT into another table].

The main cost associated with a stream is the processing time used by a virtual warehouse to query the stream. These charges appear on your bill as familiar Snowflake credits.

Note that a task runs on schedule: there is no live materalized view .

Other

Lots of Query Plan Rewriting! https://twitter.com/teej_m/status/1546591465706438657?s=20&t=ojxPPojcodLvMX9lIeEe2w

Dr. Mario's 2nd 🧠

Explorer

Snowflake

Data recovery

Cost

Partitioning / clustering

Streams and Tasks

Other

Graph View

Table of Contents

Backlinks