📖 The Data Warehouse Toolkit 3rd Edition

Tags:: 📚Books , Data Modeling
Author:: Ralph Kimball
Liked:: 6
Link:: The Data Warehouse Toolkit, 3rd Edition - Kimball Group
Source date:: 2013-07-21
Finished date:: 2019-02-01
Cover::

1. Data Warehousing, Business Intelligence, and Dimensional Modeling Primer

Data warehousing / BI

These recurring themes have existed for more than three decades:

“We collect tons of data, but we can’t access it.”

“We need to slice and dice the data every which way.”

“Business people need to get at the data easily.”

“Just show me what is important.”

“We spend entire meetings arguing about who has the right numbers rather than making decisions.”

“We want people to use information to support more fact-based decision making.” (p. 3)

It includes the one that I though came from Benn Stancil in 🗞️ The Modern Data Experience, which I have used several times as Data team vision and mission:

Our goal should be to spend more time debating what to do because of a number on a dashboard than we spend verifying if that number’s right

Link to original

We must adapt to change without causing disruptions, which is something hard to accept for DE teams:

The DW/BI system must adapt to change. User needs, business conditions, data, and technology are all subject to change. The DW/BI system must be designed to handle this inevitable change gracefully so that it doesn’t invalidate existing data or applications. Existing data and applications should not be changed or disrupted when the business community asks new questions or new data is added to the warehouse. (p. 4)

Kimball Dimensional modeling

(Nowadays we don’t really need dimensional modeling for query performance. See also: OBT - One Big Table).

Fact tables

The fact table in a dimensional model stores the performance measurements resulting from an organization’s business process events. (p. 10)

The most useful facts are numeric and additive, such as dollar sales amount. (…) You will see that facts are sometimes semi-additive or even non-additive. Semi-additive facts, such as account balances, cannot be summed across the time dimension. Non-additive facts, such as unit prices, can never be added. (p. 11)

Notice how “fact” denotes a column (a measurement), not a row (which could de different things. Normally, a transaction).

Dimension tables

The dimension tables contain the textual context associated with a business process measurement event. They describe the “who, what, where, when, how, and why” associated with the event. (p. 13)

Dimension attributes serve as the primary source of query constraints, group- ings, and report labels. In a query or report request, attributes are identified as the by words. (p. 13)

Because they are the source of virtually all constraints and report labels, dimension attributes are critical to making the DW/BI system usable and understandable. Attributes should consist of real words rather than cryptic abbreviations. You should strive to mini-mize the use of codes in dimension tables by replacing them with more verbose textual attributes (p. 14)

When triaging operational source data, it is sometimes unclear whether a numeric data element is a fact or dimension attribute. You often make the decision by asking whether the column is a measurement that takes on lots of values and participates in calculations (making it a fact) or is a discretely valued description that is more or less constant and participates in constraints and row labels (making it a dimensional attribute). For example, the standard cost for a product seems like a constant attribute of the product but may be changed so often that you decide it is more like a measured fact. Occasionally, you can’t be certain of the classification; it is possible to model the data element either way (or both ways) as a matter of the designer’s prerogative. (p. 14)

Wow, mind blown:

Contrary to popular folklore, Ralph Kimball didn’t invent the terms fact and dimension. As best as can be determined, the dimension and fact terminology originated from a joint research project conducted by General Mills and Dartmouth University in the 1960s. In the 1970s, both AC Nielsen and IRI used the terms consistently to describe their syndicated data offerings and gravitated to dimensional models for simplifying the presentation of their analytic information. (p. 15)

Layered warehouse

In the marts layer, aggregates are not enough:

…the presentation area is that it must contain detailed, atomic data. Atomic data is required to withstand assaults from unpredictable ad hoc user queries. Although the presentation area also may contain performance-enhancing aggregated data, it is not sufficient to deliver these summaries without the underlying granular data in a dimensional form. (p. 21)

Because users’ requirements are unpredictable and constantly changing, you must provide access to the exquisite details so they can roll up to address the questions of the moment. (p. 22)

Dimensional models that deliver only summary data are bound to be problematic; users run into analytic brick walls when they try to drill down into details not available in the summary tables. Developers also run into brick walls because they can’t easily accommodate new dimensions, attributes, or facts with these prematurely summarized tables. (…) Remember, when you pre-suppose the business question, you’ll likely pre-summarize the data, which can be fatal in the long run. (p. 32)

Both dimensions and facts should be shared by the marts:

We strongly discourage the independent data mart approach. However, often these independent data marts have embraced dimensional modeling because they’re interested in delivering data that’s easy for the business to understand and highly responsive to queries. So our concepts of dimensional modeling are often applied in this architecture, despite the complete disregard for some of our core tenets, such as focusing on atomic details, building by business process instead of department, and leveraging conformed dimensions for enterprise consistency and integration. (p. 27)

Rather than drawing boundaries based on organizational departments, dimensional models should be organized around business processes, such as orders, invoices, and service calls. Multiple business functions often want to analyze the same metrics resulting from a single business process. (p. 31)

Kimball Dimensonal Modeling Techniques Overview

Four-Step Dimensional Design Process

Select the business process.
Declare the grain.
Identify the dimensions.
Identify the facts.

Note that a business process could be “snapshotting every account each month”: that has a lot to do with the different types of fact tables we can find.

Types of fact tables

(p. 119)

You may even need all three at the same time:

Each model tells a different story. For some analytic requirements, two or even all three models may be appropriate simultaneously. (p. 112)

Transaction tables

The most typical ones:

A row in a transaction fact table corresponds to a measurement event at a point in space and time. Atomic transaction grain fact tables are the most dimensional and expressive fact tables; this robust dimensionality enables the maximum slicing and dicing of transaction data. (p. 43)

Periodic Snapshot Fact Tables

Periodic snapshot tables are common for accounts balances, inventory status, subscriptions.

A row in a periodic snapshot fact table summarizes many measurement events occurring over a standard period, such as a day, a week, or a month. The grain is the period, not the individual transaction. (p. 43)

These periodic snapshot rows appear over time as a series of data layers in the dimensional model, much like geologic layers represent the accumulation of sediment over long periods of time. (p. 112)

…the periodic snapshot fact table often is the only place to easily retrieve a regular, predictable view of longitudinal performance trends. (p. 120)

In many businesses, however, transaction details are not easily summarized to present management performance metrics. As you saw in this inventory case study, crawling through the transactions would be extremely time-consuming, plus the logic required to interpret the effect of different kinds of transactions on inventory levels could be horrendously complicated, presuming you even have access to the required historical data. The periodic snapshot again comes to the rescue to provide management with a quick, flexible view of inventory levels. (p. 121)

Accumulated Snapshot Fact Tables

Finally, in the third model, we’ll describe the inventory accumulating snapshot where a fact table row is inserted for each product delivery and then the row is updated as the product moves through the warehouse. (p. 121)

Slowly Changing Dimensions

To this point, we have pretended dimensions are independent of time. Unfortunately, this is not the case in the real world. Although dimension table attributes are relatively static, they aren’t fixed forever; attribute values change, albeit rather slowly, over time. (p. 148)

When change tracking is needed, it might be tempting to put every changing attribute into the fact table on the assumption that dimension tables are static. This is unacceptable and unrealistic. Instead you need strategies to deal with slowly changing attributes within dimension tables. (p. 148)

There are multiple ways of dealing with them, where the most popular are type 2: (p. 151)

Dr. Mario's 2nd 🧠

Explorer