starschemar: Obtaining Star Schemas from Flat Tables

Introduction

The multidimensional data model was defined in the 1990s with the aim of supporting data analysis. Data in multidimensional systems is obtained from operational systems and is transformed to adapt it to the new structure.

Transformations can be carried out using professional ETL (extract, transform and load) tools. Recently, tools aimed at end users have emerged, which are also aimed at performing transformation operations. All these tools are very useful to carry out the transformation process, they provide a development environment to define the transformation operations in a general way.

Frequently, the operations to be performed aim to transform a flat table (with data that comes from operational systems) into a star schema (which implements a multidimensional system). With the tools mentioned above, this transformation can be carried out, but it requires a lot of work.

This is the main objective of starschemar package: Define transformations that allow obtaining star schemas from flat tables easily. In addition, it includes basic data cleaning operations, incremental data refresh operations and query operations, adapted to this context.

The rest of this document is structured as follows: First, basic concepts of dimensional modelling and star schemas are summarized. The following is an illustrative example of how the package works. Finally, the document ends with conclusions and bibliography.

Dimensional modelling and star schemas

The content of this section is a summary based mainly on Adamson (2010) and Kimball and Ross (2013). It is intended to present the fundamental concepts of the area that are relevant to understand the functionality, naming and use of the starschemar package.

Dimensional modelling

Dimensional modelling aims to obtain simple data models. Simplicity is sought for two reasons: so that decision-makers can easily understand the data, and also so that they can be easily queried.

In dimensional modelling, the analysis of a business process is performed modelling how it is measured. The measures are called facts, and the descriptors of the context of the facts are dimensions. Facts are numerical data, and decision makers want to see them at various levels of detail, defined by dimensions.

Not all numerical data is a fact (some tools consider it that way). In dimensional modelling the designer has to differentiate between facts and dimensions. Some criteria are considered to distinguish between them, for example:

  • If it can be defined at different levels of detail then it is a fact.
  • If it is quantitative and takes continuous values, then it is a fact.
  • If it provides context then it is a dimension.

Sometimes there are no measures associated with the business process, it is simply recorded that the combination of dimensions has occurred. This situation is often called factless facts, Jensen, Pedersen, and Thomsen (2010) prefer to call it measureless facts. In any case, including when no other measures are available, a measure can be considered that represents the number of times the combination of dimension values occurs.

Dimensions and dimension attributes

Attributes considered by the designer as dimensions can be grouped taking into account the natural affinities between them. In particular, they can be grouped as they describe the “who, what, where, when, how and why” associated with the modelled business process. Two attributes share a natural affinity when they are only related in one context. When their relationships are determined by transactions or activities, they can occur in multiple contexts, if this occurs, they must be located in different dimensions.

In this way, a dimension is made up of a set of naturally related dimension attributes that describe the context of facts. Dimensions are used for two purposes: fact selection and fact grouping with the desired level of detail.

Additionally, in the dimensions hierarchies with levels and descriptors can be defined. More details can be found at Jensen, Pedersen, and Thomsen (2010). These concepts are not used in the current version of the package.

Facts and measures

A fact has a granularity, which is determined by the attributes of the dimensions that are considered at each moment. Thus, a measure in a fact has two components, the numerical property of the fact and an formula, frequently the SUM aggregation function, that allows combining several values of this measure to obtain a new value of the same measure with a coarser granularity (Jensen, Pedersen, and Thomsen 2010).

According to their behaviour to obtain a coarser granularity, three types of measures are distinguished: additive, semi-additive and non-additive. For additive measures, SUM is always a valid formula that maintains the meaning of the measure when the granularity changes. For semi-additive measures, there is no point in using SUM when changing the level of detail in any of the dimensions because the meaning of the measure changes, this frequently occurs in dimensions that represents time and measures representing inventory level. For non-additive measures, values cannot be combined across any dimension using SUM because the result obtained has a different meaning from the original measure (generally occurs with ratios, percentages or unit amounts such as unit cost or unit price).

The most useful measures are additive. If we have non-additive measures, they can generally be redefined from other additive measures.

Star schemas

Dimensional models implemented in RDBMS (Relational Database Management Systems) using a table for each dimension are called star schemas because of their resemblance to a star-like structure: A fact table in the centre and dimension tables around it. Thus, dimension attributes are columns of the respective dimension tables, and measures are columns of the fact table.

Other possible implementations on RDBMS normalize dimensions and are known as snowflake schema. More details can be found at Jensen, Pedersen, and Thomsen (2010). This is not considered in this package.

Dimension tables

Dimension tables contain the context associated with business process measures. Although they can contain any type of data, numerical data is generally not used for dimension attributes because some query tools consider any numeric data as a measure.

Dimension attributes with NULL value are a source of problems when querying since DBMS and query tools sometimes handle them inconsistently, the result depends on the product. It is recommended to avoid the use of NULL and replace them with a descriptive text. In the case of dates, it is recommended to replace the NULL values with an arbitrary date in the very far future.

Surrogate keys

A dimension table contains dimension attributes and also a surrogate key column. This column is a unique identifier that has no intrinsic meaning: It is generally an integer and is the primary key for the dimension table. In Adamson (2010) surrogate keys are easily identifiable by the suffix “_key” in the column name (and this criterion has also been applied in starschemar package).

Dimension tables also contain key columns that uniquely identify associated entities in an operational system. The separation of surrogate keys and natural keys allows the star schema to store changes in dimensions. Therefore, the use of surrogate keys in dimensions is a solution to the SCD (slowly changing dimensions) problem. This problem is not specifically addressed in this version of this package.

Special dimensions

In some cases, for the sake of simplicity, it is helpful to create a table that contains dimension attributes that have no natural affinities to each other, generally these are low-cardinality flags and indicators. The result is what is known as a junk dimension. They do not require any special support, only the designer’s will to define them.

Sometimes some dimension attributes are left in the fact table, usually transaction identifiers. It is considered as the primary key of a dimension that does not have an associated table, for this reason it is known as a degenerate dimension. Degenerate dimensions are not allowed in this package.

A single dimension can be referenced multiple times in a fact table, with each reference linked to a different logical role for each dimension. These separate dimension views, with unique attribute column names, are called role dimensions and the common dimension is called a role-playing dimension.

Associated with multiple star schemas we have the conformed dimensions that are presented in section Conformed dimensions.

Fact table

At the centre of a star schema is the fact table. In addition to containing measures, the fact table includes foreign keys that refer to each of the surrogate keys in the dimension tables.

Primary key

A subset of foreign keys, along with possibly degenerate dimensions, is considered to form the primary key of the fact table.

In starschemar package, since degenerate dimensions are not allowed, the primary key is made up of a subset of foreign keys.

Grain

The subset of dimensions that forms the primary key defines the level of detail stored in the fact table, which is known as the fact table’s grain. In the design process, it is very important for the designer to clearly define the grain of the fact table (it is usually defined by listing the dimensions whose surrogate keys form its primary key): it is a way to ensure that all the facts are stored at the same level of detail.

At the finest grain, a row in the fact table corresponds to the measures of an event and vice versa, it is not influenced by the possible reports that may be obtained. When two facts have different grains, they should be set on different fact tables.

Multiple fact tables

It is frequent the need to have several fact tables for various reasons:

  • We find measurements with different grain.

  • There are measures that do not occur simultaneously, for example, when one occurs, we have no value for others and vice versa.

In reality it is about different business processes, each one has to have its own fact table but they have dimensions in common. This is known as a fact constellation which corresponds to the Kimball enterprise data warehouse bus architecture.

Conformed dimensions

When star schemas share a set of common dimensions, these dimensions are called conformed dimensions.

There are several possibilities to have conformed dimensions, the most obvious form is that the dimension tables share structure and content, that is, they are identical dimensions. This is the one considered in this version of the starschemar package.

Additional operations

Cleaning and conforming data

When data is loaded into a star schema, errors or inconsistencies can be discovered in some of them. In some cases, it is best to make corrections at the source of the data, in operational systems. Sometimes this is not possible and there is no other option but to modify the data before loading it into the star schema or even when it is already loaded.

Inconsistencies are often found in dimensions, when dimensions are integrated into one, for example to generate a role-playing dimension or conformed dimensions. Support for modifying dimension data is provided in the package.

Dimension enrichment

The more attributes the dimensions have, the more possibilities they offer to filter or aggregate data. For this reason, it is convenient to enrich the dimensions with new attributes, whenever possible. New attributes are often defined based on existing ones. There must be a way to associate the new attributes with the dimensions.

The more attributes the dimensions have, the more possibilities they offer to filter or aggregate data. For this reason, it is convenient to enrich the dimensions with new attributes, whenever possible. New attributes are often defined based on existing ones. There must be a way to associate the new attributes with the dimensions.

Operations have been defined in the package to export dimension attributes so that their values are not repeated, and also to import them, once the new attributes have been added.

Incremental refresh

When a star schema is built, an initial load is performed with all available data from a moment in time onwards.

Operational systems continue to operate and produce data. If we want to incorporate these data into the star schema, we have two possibilities:

  • Carry out a new load with all the data available at that time.

  • Perform an incremental refresh with only the new data.

In order to carry out this second option, the CDC (change data capture) system allows to exclusively obtain the new data produced.

Sometimes it is also convenient to delete data that is considered to be no longer necessary, generally data from the more distant past.

In this package, it has been considered that we can obtain the new data, possibly mixed with updates to data already incorporated into the star schema, in order to carry out an incremental refresh of star schemas with them. Operations are also offered to select data and delete it if it is not considered necessary.

An illustrative example

Starting data sets

To illustrate how the package works we will use a small part of the Deaths in 122 U.S. cities - 1962-2016. 122 Cities Mortality Reporting System data set1. Specifically, only the data for the first 11 weeks of 1962 are considered.

Year WEEK Week Ending Date REGION State City Pneumonia and Influenza Deaths All Deaths <1 year (all cause deaths) 1-24 years (all cause deaths) 25-44 years 45-64 years (all cause deaths) 65+ years (all cause deaths)
1962 1 1962-01-06 1 MA Boston 11 262 10 8 11 87 146
1962 2 1962-01-13 1 MA Boston 11 270 14 8 11 70 167
1962 3 1962-01-20 1 MA Boston 5 237 11 8 10 66 142
1962 4 1962-01-27 1 MA Boston 12 285 22 7 8 73 175
1962 5 1962-02-03 1 MA Boston 5 245 15 9 17 62 142
1962 6 1962-02-10 1 MA Boston 11 284 16 8 16 79 165
1962 7 1962-02-17 1 MA Boston 8 227 11 7 13 71 125
1962 8 1962-02-24 1 MA Boston 10 270 14 9 10 85 152
1962 9 1962-03-03 1 MA Boston 16 278 15 11 15 79 158
1962 10 1962-03-10 1 MA Boston 11 297 11 7 18 80 181
1962 11 1962-03-17 1 MA Boston 16 290 16 7 21 74 172
1962 1 1962-01-06 1 CT Bridgeport 3 46 3 0 0 16 27

In the table above, the first rows of the original data are shown. For each week and city, mortality figures by age group and cause, considered separately, are included (i.e., the combination of age group and cause is not included). In the cause, only a distinction is made between pneumonia or influenza and others.

It can be seen that there is only one measure, deaths, defined at two different granularities: week-city-cause (data in columns “Pneumonia and Influenza Deaths” and “All Deaths”) and week-city-age bracket (data from the 5 columns on the right). This means that we will need two fact tables to include all the data in star schemas.

As the data is partially in the form of a pivot table, from these data two tables have been generated2, one for each granularity. Additionally, the following operations have been carried out:

  • To show the incremental refresh, the data has been selected: On the one hand, the first 9 weeks were considered, and on the other, each of the following weeks separately.

  • To show the possibilities of data modification operations to correct errors, some data has been modified to generate missing data or errors.

Year WEEK Week Ending Date REGION State City Age Range Deaths
1962 1 1962-01-06 1 MA Boston <1 year 10
1962 1 1962-01-06 1 MA Boston 1-24 years 8
1962 1 1962-01-06 1 MA Boston 25-44 years 11
1962 1 1962-01-06 1 MA Boston 45-64 years 87
1962 1 1962-01-06 1 MA Boston 65+ years 146
1962 2 1962-01-13 1 MA Boston <1 year 14

In the table above, the first rows of the flat table that contains the data according to the age bracket are shown.

The following table shows the first 18 rows of the flat table containing data based on cause of death. The calculated column Other Deaths has been added. Here you can see some missing data (columns Year and WEEK), errors (value “Bridgepor” in column City) and how there are only data from the first 9 weeks.

Year WEEK Week Ending Date REGION State City Pneumonia and Influenza Deaths All Deaths Other Deaths
1962 1 1962-01-06 1 MA Boston 11 262 251
1962 2 1962-01-13 1 MA Boston 11 270 259
1962 3 1962-01-20 1 MA Boston 5 237 232
1962 4 1962-01-27 1 MA Boston 12 285 273
1962 5 1962-02-03 1 MA Boston 5 245 240
NA NA 1962-02-10 1 MA Boston 11 284 273
1962 7 1962-02-17 1 MA Boston 8 227 219
1962 8 1962-02-24 1 MA Boston 10 270 260
1962 9 1962-03-03 1 MA Boston 16 278 262
1962 1 1962-01-06 1 CT Bridgeport 3 46 43
1962 2 1962-01-13 1 CT Bridgeport 2 43 41
1962 3 1962-01-20 1 CT Bridgeport 2 40 38
1962 4 1962-01-27 1 CT Bridgeport 4 46 42
1962 5 1962-02-03 1 CT Bridgeport 5 46 41
1962 6 1962-02-10 1 CT Bridgeport 3 42 39
1962 7 1962-02-17 1 CT Bridgeport 1 35 34
1962 8 1962-02-24 1 CT Bridgeport 2 45 43
1962 9 1962-03-03 1 CT Bridgepor 3 46 43

To have more layout possibilities to display, new columns with dates have been generated. Next, in both tables, the new columns related to the date that have been added can be seen.

Reception Year Reception Week Reception Date Data Availability Year Data Availability Week Data Availability Date Year WEEK Week Ending Date REGION State City Age Range Deaths
1962 2 1962-01-10 1962 3 1962-01-19 1962 1 1962-01-06 1 MA Boston <1 year 10
1962 2 1962-01-10 1962 3 1962-01-19 1962 1 1962-01-06 1 MA Boston 1-24 years 8
1962 2 1962-01-10 1962 3 1962-01-19 1962 1 1962-01-06 1 MA Boston 25-44 years 11
1962 2 1962-01-10 1962 3 1962-01-19 1962 1 1962-01-06 1 MA Boston 45-64 years 87
1962 2 1962-01-10 1962 3 1962-01-19 1962 1 1962-01-06 1 MA Boston 65+ years 146
1962 3 1962-01-16 1962 3 1962-01-18 1962 2 1962-01-13 1 MA Boston <1 year 14
Reception Year Reception Week Reception Date Data Availability Year Data Availability Week Data Availability Date Year WEEK Week Ending Date REGION State City Pneumonia and Influenza Deaths All Deaths Other Deaths
1962 2 1962-01-10 1962 3 1962-01-19 1962 1 1962-01-06 1 MA Boston 11 262 251
1962 3 1962-01-16 1962 3 1962-01-18 1962 2 1962-01-13 1 MA Boston 11 270 259
1962 4 1962-01-23 1962 4 1962-01-25 1962 3 1962-01-20 1 MA Boston 5 237 232
1962 4 1962-01-27 1962 6 1962-02-07 1962 4 1962-01-27 1 MA Boston 12 285 273
1962 6 1962-02-08 1962 7 1962-02-14 1962 5 1962-02-03 1 MA Boston 5 245 240
1962 6 1962-02-10 1962 7 1962-02-12 NA NA 1962-02-10 1 MA Boston 11 284 273

These are the flat tables that will be considered as a starting point to obtain star schemas from them in this example. They are available in the package: mrs_age and mrs_cause respectively.

Dimensional modelling

For each flat table, the goal is to define the attributes that correspond to facts and those that are dimensions. For facts, measures and their aggregation functions have to be defined. For dimensions, attributes with natural affinity must be grouped. Each attribute can only appear once in the definition.

Dimensional modelling data according to age range

To avoid having to write the name of the attributes of the table, with the following function we can have them in the form of a string. Thus, we can copy and paste each name as needed.

dput(colnames(mrs_age))
#> c("Reception Year", "Reception Week", "Reception Date", "Data Availability Year", 
#> "Data Availability Week", "Data Availability Date", "Year", "WEEK", 
#> "Week Ending Date", "REGION", "State", "City", "Age Range", "Deaths"
#> )

The definition of the dimensional model for the data considered is shown below.

library(starschemar)

dm_mrs_age <- dimensional_model() |>
  define_fact(
    name = "mrs_age",
    measures = c(
      "Deaths"
    ),
    agg_functions = c(
      "SUM"
    ),
    nrow_agg = "nrow_agg"
  ) |>
  define_dimension(
    name = "when",
    attributes = c(
      "Week Ending Date",
      "WEEK",
      "Year"
    )
  ) |>
  define_dimension(
    name = "when_available",
    attributes = c(
      "Data Availability Date",
      "Data Availability Week",
      "Data Availability Year"
    )
  ) |>
  define_dimension(
    name = "where",
    attributes = c(
      "REGION",
      "State",
      "City"
    )
  ) |>
  define_dimension(
    name = "who",
    attributes = c(
      "Age Range"
    )
  )

In this case, all the elements have been explicitly defined, including aggregation functions and the name of an additional measure representing the number of rows aggregated, which is always included. Only data from two of the three possible time-related dimensions have been considered.

Dimensional modelling data according to cause

In the case of data according to cause of death, the definition of the model is shown below.

dm_mrs_cause <- dimensional_model() |>
  define_fact(
    name = "mrs_cause",
    measures = c(
      "Pneumonia and Influenza Deaths",
      "Other Deaths"
    ),
  ) |>
  define_dimension(
    name = "when",
    attributes = c(
      "Week Ending Date",
      "WEEK",
      "Year"
    )
  ) |>
  define_dimension(
    name = "when_received",
    attributes = c(
      "Reception Date",
      "Reception Week",
      "Reception Year"
    )
  ) |>
  define_dimension(
    name = "when_available",
    attributes = c(
      "Data Availability Date",
      "Data Availability Week",
      "Data Availability Year"
    )
  ) |>
  define_dimension(
    name = "where",
    attributes = c(
      "REGION",
      "State",
      "City"
    )
  )

If no aggregation function is indicated, by default, SUM is considered. Although not explicitly stated, it also includes by default the measure relative to the number of rows aggregated. In this case, the three dimensions related to the date have been defined.

Star schema definition and transformation: Constellations

To define a star schema, we need a flat table and a dimensional model defined from it. Once defined, we can apply format modification operations to it.

Star schema definition for data according to age range

The basic definition operation of a star schema is shown below3.

st_mrs_age <- star_schema(mrs_age, dm_mrs_age)

The first rows of the obtained dimension and fact tables are shown below.

when_key Week Ending Date WEEK Year
1 1962-01-06 1 1962
2 1962-01-13 2 1962
3 1962-01-20 3 1962
4 1962-01-27 4 1962
5 1962-02-03 5 1962
6 1962-02-10 6 1962
when_available_key Data Availability Date Data Availability Week Data Availability Year
1 1962-01-06 1 1962
2 1962-01-07 1 1962
3 1962-01-08 2 1962
4 1962-01-09 2 1962
5 1962-01-10 2 1962
6 1962-01-11 2 1962
where_key REGION State City
1 1 CT Bridgepor
2 1 CT Bridgeport
3 1 CT Hartford
4 1 CT New Haven
5 1 CT Waterbury
6 1 MA Boston
who_key Age Range
1 1-24 years
2 25-44 years
3 45-64 years
4 65+ years
5 <1 year
when_key when_available_key where_key who_key Deaths nrow_agg
1 1 7 2 3 1
1 1 7 3 13 1
1 1 7 4 20 1
1 1 7 5 2 1
1 1 21 1 2 1
1 1 21 2 3 1

The data from the original flat table has been structured in the form of dimension tables and fact tables. Data in the columns of the original table included in the dimensions is not repeated. A surrogate key has been added to each of the dimension tables that are foreign keys in the fact table.

Next, we will apply format modification operations to the original structure obtained.

st_mrs_age <- st_mrs_age |>
  role_playing_dimension(
    dim_names = c("when", "when_available"),
    name = "When Common",
    attributes = c("date", "week", "year")
  ) |>
  snake_case() |>
  character_dimensions(NA_replacement_value = "Unknown",
                       length_integers = list(week = 2))

First, a role playing dimension has been defined based on the dimensions related to dates. Then, to work with databases, the names have been adapted to the snake case criterion. Finally, the data type of the attributes of the dimensions has been transformed so that all columns except the date columns are of the character data type, in the case of numerical data, it is allowed to indicate the length of the field to fill with leading zeros, and undefined values have been replaced by the indicated value.

The first rows of the new dimension and fact tables are shown below.

when_key week_ending_date week year
when_available_key data_availability_date data_availability_week data_availability_year
where_key region state city
1 1 CT Bridgepor
2 1 CT Bridgeport
3 1 CT Hartford
4 1 CT New Haven
5 1 CT Waterbury
6 1 MA Boston
who_key age_range
1 1-24 years
2 25-44 years
3 45-64 years
4 65+ years
5 <1 year
when_common_key date week year
1 1962-01-06 01 1962
2 1962-01-07 01 1962
3 1962-01-08 02 1962
4 1962-01-09 02 1962
5 1962-01-10 02 1962
6 1962-01-11 02 1962
when_available_key when_key where_key who_key deaths nrow_agg
1 1 7 2 3 1
1 1 7 3 13 1
1 1 7 4 20 1
1 1 7 5 2 1
1 1 21 1 2 1
1 1 21 2 3 1

In the result, it can be seen that the dimensions related to date are now role dimensions and do not have their own data, a role playing dimension has been generated with the integrated data. The fact table continues to refer to role dimensions, the value of foreign keys has been adapted to the possible new values of the surrogate keys. Additionally it can be seen that the week field now has length 2 and has 0 on the left (this is useful to sort numbers in text format).

Star schema definition for data according to cause

We are going to define the star schema and apply similar transformations to the other flat table. The transformations can be applied in any order.

st_mrs_cause <- star_schema(mrs_cause, dm_mrs_cause) |>
  snake_case() |>
  character_dimensions(
    NA_replacement_value = "Unknown",
    length_integers = list(
      week = 2,
      data_availability_week = 2,
      reception_week = 2
    )
  ) |>
  role_playing_dimension(
    dim_names = c("when", "when_received", "when_available"),
    name = "when_common",
    attributes = c("date", "week", "year")
  )

In this case, since the role playing dimension definition is the last transformation defined, the format of the week column had to be defined in the three date dimensions to obtain an equivalent result. The result obtained is shown below.

when_key week_ending_date week year
when_received_key reception_date reception_week reception_year
when_available_key data_availability_date data_availability_week data_availability_year
where_key region state city
1 1 CT Bridgepor
2 1 CT Bridgeport
3 1 CT Hartford
4 1 CT New Haven
5 1 CT Waterbury
6 1 MA Boston
when_common_key date week year
1 1962-01-06 01 1962
2 1962-01-07 01 1962
3 1962-01-08 02 1962
4 1962-01-09 02 1962
5 1962-01-10 02 1962
6 1962-01-11 02 1962
when_available_key when_received_key when_key where_key pneumonia_and_influenza_deaths other_deaths nrow_agg
1 1 1 7 1 37 1
1 1 1 21 5 32 1
1 1 1 28 2 24 1
1 1 1 50 0 35 1
1 1 1 61 12 207 1
1 1 1 66 2 81 1

In this case we have three role dimensions defined on a role playing dimension.

Star schema transformation, cleaning and conforming data

Star schemas are defined from flat table fields. In some cases it may be interesting to rename elements of the schema, especially attributes of dimensions and measures. On the other hand, dimensions can be enriched by adding additional attributes, generally derived from the rest of the attributes. These are the transformations considered in this section.

We can perform data cleaning and conforming operations on star schema dimensions. Updates defined in a star schema can be applied on another with common dimensions.

Star schema rename

If necessary, the elements of a star schema can be renamed. These functions are especially useful for renaming attributes of dimensions and measures of facts because the definition operations only allowed us to select columns of a flat table, not modifying their names.

We are going to modify the attribute names of the when dimension for the two design cases. The values of the attribute called region actually correspond to the concept of division in the organization of the US territory, we are going to rename it this way for the two design cases.

Data according to age:

st_mrs_age <-
  st_mrs_age |> rename_dimension_attributes(
    name = "when",
    attributes = c("week_ending_date", "week", "year"),
    new_names = c(
      "when_happened_date",
      "when_happened_week",
      "when_happened_year"
    )
  ) |>
  rename_dimension_attributes(
    name = "where",
    attributes = c("region"),
    new_names = c("division")
  )

And data according to cause:

st_mrs_cause <-
  st_mrs_cause |> rename_dimension_attributes(
    name = "when",
    attributes = c("week_ending_date", "week", "year"),
    new_names = c(
      "when_happened_date",
      "when_happened_week",
      "when_happened_year"
    )
  ) |>
  rename_dimension_attributes(
    name = "where",
    attributes = c("region"),
    new_names = c("division")
  )

We are also going to modify the name of the measure for data according to age:

st_mrs_age <-
  st_mrs_age |> rename_measures(measures = c("deaths"),
                                 new_names = c("n_deaths"))

Additionally, functions are available to modify the names of dimensions and facts. In this case it is not necessary to modify them because they do not depend on the names of the starting base data and we have defined them as we wanted.

Definition of updates

Using the following code, first of all, we get the names of the dimensions of a star schema, then, we get them by their name (we can see the rows using the utils::View function). If there are several role dimensions, it is enough to consult one of them, updates defined on it will be propagated to the rest.

dim_names <- st_mrs_age |>
    get_dimension_names()

where <- st_mrs_age |>
  get_dimension("where")

# View(where)
# where[where$where_key %in% c(1, 2, 62), ]

when <- st_mrs_age |>
  get_dimension("when")

# View(when)
# when[when$when_key %in% c(36, 37, 73), ]

who <- st_mrs_age |>
  get_dimension("who")

# View(who)

Reviewing the dimensions we can detect the need for updates. Errors are also sometimes detected by comparing the values with equivalent values obtained from other data sources.

Updates in the where dimension

In the where dimension we find the wrong value of “Bridgepor”, generated by ourselves, the correct value is “Bridgeport”, also included in the dimension. Additionally, trying to find additional data associated with cities, we found another error for the city of “Wilimington”, whose correct name is “Wilmington”. Below are the mentioned instances.

where_key division state city
1 1 CT Bridgepor
2 1 CT Bridgeport
62 5 DE Wilimington

Updates can be defined in several ways. One of them is referencing the surrogate key. Although in some of the functions the dimension surrogate key is referred to, updates only consider the values of the rest of the columns of the corresponding dimension. In this way, it is achieved that updates can be applied to equivalent dimensions of other star schemas, where the values of the surrogate key do not necessarily coincide with those of the dimension where updates were originally defined.

Next, we define the update that indicates that the values of record “1” (“Bridgepor”) of the where dimension must match those of record “2” (“Bridgeport”).

updates_st_mrs_age <- record_update_set() |>
  match_records(dimension = where,
                old = 1,
                new = 2) 

For the case of the other city, we define an update that replaces the value in the indicated fields for the records that meet the defined condition, as shown below.

updates_st_mrs_age <- updates_st_mrs_age |>
  update_selection_general(
    dimension = where,
    columns_old = c("state", "city"),
    old_values = c("DE", "Wilimington"),
    columns_new = c("city"),
    new_values = c("Wilmington")
  ) 
Updates in the when dimension

Below is a selection of rows involved in update operations that are referred by their surrogate key in the when dimension.

when_key when_happened_date when_happened_week when_happened_year
36 1962-02-10 06 1962
37 1962-02-10 Unknown Unknown
73 9999-12-31 07 1962

The corresponding updates are shown below.

updates_st_mrs_age <- updates_st_mrs_age |>
  match_records(dimension = when,
                old = 37,
                new = 36) |>
  update_record(
    dimension = when,
    old = 73,
    values = c("1962-02-17", "07", "1962")
  )

The values of record “37” must match those of record “36” (both registers will be unified). The values in register “73” will be replaced by those provided.

Updates in the who dimension

In the case of the who dimension, we want to include a code in each value of age_range attribute so that they are sorted increasingly based on their value.

updates_st_mrs_age <- updates_st_mrs_age |>
  update_selection(
    dimension = who,
    columns = c("age_range"),
    old_values = c("<1 year"),
    new_values = c("1: <1 year")
  ) |>
  update_selection(
    dimension = who,
    columns = c("age_range"),
    old_values = c("1-24 years"),
    new_values = c("2: 1-24 years")
  ) |>
  update_selection(
    dimension = who,
    columns = c("age_range"),
    old_values = c("25-44 years"),
    new_values = c("3: 25-44 years")
  ) |>
  update_selection(
    dimension = who,
    columns = c("age_range"),
    old_values = c("45-64 years"),
    new_values = c("4: 45-64 years")
  ) |>
  update_selection(
    dimension = who,
    columns = c("age_range"),
    old_values = c("65+ years"),
    new_values = c("5: 65+ years")
  )

In each case, we substitute the old value for the new one in the indicated attribute.

Updates application

Once updates are defined, they can be applied on the star schema from which they have been defined, as shown below.

st_mrs_age <- st_mrs_age |>
  modify_dimension_records(updates_st_mrs_age)

The result obtained for the first star schema is shown below.

when_key when_happened_date when_happened_week when_happened_year
when_available_key data_availability_date data_availability_week data_availability_year
where_key division state city
1 1 CT Bridgeport
2 1 CT Hartford
3 1 CT New Haven
4 1 CT Waterbury
5 1 MA Boston
6 1 MA Cambridge
who_key age_range
1 1: <1 year
2 2: 1-24 years
3 3: 25-44 years
4 4: 45-64 years
5 5: 65+ years
when_common_key date week year
1 1962-01-06 01 1962
2 1962-01-07 01 1962
3 1962-01-08 02 1962
4 1962-01-09 02 1962
5 1962-01-10 02 1962
6 1962-01-11 02 1962
who_key where_key when_available_key when_key n_deaths nrow_agg
1 1 10 1 3 1
1 1 12 8 1 1
1 1 18 15 5 1
1 1 23 22 3 1
1 1 31 29 6 1
1 1 49 36 3 1

It can be seen that the row with the value “Bridgepor” in the city column has disappeared: it has been merged with the row with the correct value. This update has also been transmitted to the fact table. Although it is not seen in the tables, the same has happened with the dates that have been unified by the update.

The same updates can also be applied to other star schemas with dimensions in common with the original star schema, as shown below.

st_mrs_cause <- st_mrs_cause |>
  modify_dimension_records(updates_st_mrs_age)

Updates defined on dimensions not included in the star schema are ignored. In this case those referring to the who dimension. Likewise, if there are no records that meet the conditions to get the old value, they are also ignored.

Dimension enrichment

From the original attributes included in the dimensions, we can obtain new attributes that facilitate queries or offer new query possibilities.

Enrich the who dimension

Suppose that we are interested in defining some broader age ranges than the existing ones. This operation can be done by enriching the corresponding dimension.

First, we export the attributes to consider in table form, in this case only the age range.

tb_who <-
  enrich_dimension_export(st_mrs_age,
                          name = "who",
                          attributes = c("age_range"))

Next, we can see the result of the export operation. It is a table with the selected attributes where duplicate values have been eliminated if there are any (in this case there are no repeated values).

age_range
1: <1 year
2: 1-24 years
3: 25-44 years
4: 45-64 years
5: 65+ years

In the table we add the columns that we want. In this case a new column to define the new broader age range.

v <-
  c("0-24 years", "0-24 years", "25+ years", "25+ years", "25+ years")
tb_who <-
  tibble::add_column(tb_who,
                     wide_age_range = v)

The new table can be seen below.

age_range wide_age_range
1: <1 year 0-24 years
2: 1-24 years 0-24 years
3: 25-44 years 25+ years
4: 45-64 years 25+ years
5: 65+ years 25+ years

We enrich the dimension considering the new data in the table.

st_mrs_age <-
  st_mrs_age |>
  enrich_dimension_import(name = "who", tb_who)

We can see the result below, where the dimension has the new defined attribute.

who_key age_range wide_age_range
1 1: <1 year 0-24 years
2 2: 1-24 years 0-24 years
3 3: 25-44 years 25+ years
4 4: 45-64 years 25+ years
5 5: 65+ years 25+ years
Enrich the where dimension

For the where dimension we can proceed in the same way as we have done for the who dimension: Export the data, complete it manually and import it again, as shown below.

tb_where <-
  enrich_dimension_export(st_mrs_age,
                          name = "where",
                          attributes = c("division"))

The new table for division data can be seen below.

division
1
2
3
4
5
6
7
8
9

We look for the names of the divisions and add the data of the regions to which they belong.

tb_where <-
  tibble::add_column(
    tb_where,
    division_name = c(
      "New England",
      "Middle Atlantic",
      "East North Central",
      "West North Central",
      "South Atlantic",
      "East South Central",
      "West South Central",
      "Mountain",
      "Pacific"
    ),
    region = c('1',
               '1',
               '2',
               '2',
               '3',
               '3',
               '3',
               '4',
               '4'),
    region_name = c(
      "Northeast",
      "Northeast",
      "Midwest",
      "Midwest",
      "South",
      "South",
      "South",
      "West",
      "West"
    )
  )

st_mrs_age <-
  st_mrs_age |>
  enrich_dimension_import(name = "where", tb_where)

st_mrs_cause <-
  st_mrs_cause |>
  enrich_dimension_import(name = "where", tb_where)

To add the name of the states and the county to which each city belongs, we could proceed in the same way. However, it is easier if we try to locate this data and use it directly. These data are available in the ft_usa_states and ft_usa_city_county data sets, respectively.

However, if we operate in the same way, when importing the data an error occurs. The reason is that not all the data in the dimension matches the data in the imported table. We can determine the missing data using the following function.

tb_missing <-
  st_mrs_age |>
  enrich_dimension_import_test(name = "where", ft_usa_states)

The result obtained is shown below.

where_key division state city division_name region region_name
48 3 Unknown Unknown East North Central 2 Midwest
78 6 Unknown Unknown East South Central 3 South
91 7 Unknown Unknown West South Central 3 South
111 9 Unknown Unknown Pacific 4 West

In all cases, the problem occurs for the value “Unknown” in the state attribute. We must add a row to the data before importing it.

tb_where_state <- ft_usa_states |>
  tibble::add_row(state = "Unknown", state_name = "Unknown")

st_mrs_age <-
  st_mrs_age |>
  enrich_dimension_import(name = "where", tb_where_state)

st_mrs_cause <-
  st_mrs_cause |>
  enrich_dimension_import(name = "where", tb_where_state)

The same problem occurs and we apply the same solution to add the county data4.

tb_where_county <- ft_usa_city_county |>
  tibble::add_row(city = "Unknown",
                  state = "Unknown",
                  county = "Unknown")

st_mrs_age <-
  st_mrs_age |>
  enrich_dimension_import(name = "where", tb_where_county)

st_mrs_cause <-
  st_mrs_cause |>
  enrich_dimension_import(name = "where", tb_where_county)

We can see the first rows of the final result below.

where_key division state city division_name region region_name state_name county
1 1 CT Bridgeport New England 1 Northeast Connecticut Fairfield
2 1 CT Hartford New England 1 Northeast Connecticut Hartford
3 1 CT New Haven New England 1 Northeast Connecticut New Haven
4 1 CT Waterbury New England 1 Northeast Connecticut New Haven
5 1 MA Boston New England 1 Northeast Massachusetts Suffolk
6 1 MA Cambridge New England 1 Northeast Massachusetts Middlesex
7 1 MA Fall River New England 1 Northeast Massachusetts Bristol
8 1 MA Lowell New England 1 Northeast Massachusetts Middlesex
9 1 MA Lynn New England 1 Northeast Massachusetts Essex
10 1 MA New Bedford New England 1 Northeast Massachusetts Bristol

Constellation definition

A constellation is defined from a list of star schemas, as shown below.

ct_mrs <- constellation(list(st_mrs_age, st_mrs_cause), name = "mrs")

All dimensions of the same name in star schemas must be compatible in structure and type of columns, and are defined as conformed dimensions. The conformed dimensions share all the instances of the original dimensions, this implies possible modifications in the surrogate keys that are transmitted to foreign keys of the component fact tables.

The tables of the obtained conformed dimensions are shown below.

when_key when_happened_date when_happened_week when_happened_year
1 1962-01-06 01 1962
2 1962-01-07 01 1962
3 1962-01-08 02 1962
4 1962-01-09 02 1962
5 1962-01-10 02 1962
6 1962-01-11 02 1962
when_available_key data_availability_date data_availability_week data_availability_year
1 1962-01-06 01 1962
2 1962-01-07 01 1962
3 1962-01-08 02 1962
4 1962-01-09 02 1962
5 1962-01-10 02 1962
6 1962-01-11 02 1962
where_key division state city division_name region region_name state_name county
1 1 CT Bridgeport New England 1 Northeast Connecticut Fairfield
2 1 CT Hartford New England 1 Northeast Connecticut Hartford
3 1 CT New Haven New England 1 Northeast Connecticut New Haven
4 1 CT Waterbury New England 1 Northeast Connecticut New Haven
5 1 MA Boston New England 1 Northeast Massachusetts Suffolk
6 1 MA Cambridge New England 1 Northeast Massachusetts Middlesex

It can be seen that the conformed dimensions are considered regardless of the type of dimension in the star schema. Definition of conformed dimensions does not imply any change in the definition of dimensions in star schemas: Role and role playing dimensions remain the same, only their rows may have changed.

In this particular case there are no discrepancies in the values of the star schemas dimension instances, since the data source is the same for both.

If discrepancies are detected once the integration has been carried out, new modification operations can be defined on conformed dimensions (as it has been done for the dimensions of star schemas) that are applied at the constellation level and are automatically transmitted to the component star schemas (modify_conformed_dimension_records).

Incremental refresh

Once we have star schemas built with the data available at the moment, we may obtain additional data, with the same structure as the initial data but from a later time. Sometimes the new data also includes data from previous periods to operate on them.

Under these conditions, suppose we get the data sets mrs_age_w10, mrs_age_w11, mrs_cause_w10, and mrs_cause_w11, for weeks 10 and 11 (data in star schemas runs through week 9). In all cases, some data from previous periods are included.

To perform an incremental refresh of a star schema, we must have the new data in the same star schema format. Additionally, if we have done data cleaning, it is likely that we will have to correct part of the corrected errors again over the new data. For this reason, it is best to package all the transformations carried out on the original data in function form so that they can be easily applied to new data.

Refresh operations for data according to age range

Below you can see the function that groups the transformations defined for the data according to the age range.

mrs_age_definition <-
  function(ft,
           dm,
           updates,
           tb_who,
           tb_where,
           tb_where_state,
           tb_where_county) {
    star_schema(ft, dm) |>
      role_playing_dimension(
        dim_names = c("when", "when_available"),
        name = "When Common",
        attributes = c("date", "week", "year")
      ) |>
      snake_case() |>
      character_dimensions(NA_replacement_value = "Unknown",
                           length_integers = list(week = 2)) |>
      rename_dimension_attributes(
        name = "when",
        attributes = c("week_ending_date", "week", "year"),
        new_names = c(
          "when_happened_date",
          "when_happened_week",
          "when_happened_year"
        )
      ) |>
      rename_dimension_attributes(
        name = "where",
        attributes = c("region"),
        new_names = c("division")
      ) |>
      rename_measures(measures = c("deaths"),
                      new_names = c("n_deaths")) |>
      modify_dimension_records(updates) |>
      enrich_dimension_import(name = "who", tb_who) |>
      enrich_dimension_import(name = "where", tb_where) |>
      enrich_dimension_import(name = "where", tb_where_state) |>
      enrich_dimension_import(name = "where", tb_where_county)
  }

We apply this function to new data sets, as shown below.

st_mrs_age_w10 <-
  mrs_age_definition(
    mrs_age_w10,
    dm_mrs_age,
    updates_st_mrs_age,
    tb_who,
    tb_where,
    tb_where_state,
    tb_where_county
  )

st_mrs_age_w11 <-
  mrs_age_definition(
    mrs_age_w11,
    dm_mrs_age,
    updates_st_mrs_age,
    tb_who,
    tb_where,
    tb_where_state,
    tb_where_county
  )

Errors may occur because the data is different from the original, especially in the dimension enrichment part, if data is missing. In this case, we must eliminate from the function the lines corresponding to enrichment and later try to enrich the dimension by checking the errors with enrich_dimension_import_test.

Once we have the data in the same format, we can apply the incremental refresh to the original star schema, as follows.

st_mrs_age <- st_mrs_age |>
  incremental_refresh_star_schema(st_mrs_age_w10, existing = "replace") |>
  incremental_refresh_star_schema(st_mrs_age_w11, existing = "replace")

In this case, it has been assumed that if data from previous periods appears among the new data, the new data has to replace the previous data (value “replace” in existing parameter).

If the star schema has been integrated into a constellation, the incremental refresh can be performed on it, as follows.

ct_mrs <- ct_mrs |>
  incremental_refresh_constellation(st_mrs_age_w10, existing = "replace") |>
  incremental_refresh_constellation(st_mrs_age_w11, existing = "replace")

In this case, the corresponding star schema, the conformed dimensions and all the star schemas that share them are updated.

Refresh operations for data according to cause

Similar to how it has been done for age data, it can be done for cause data, as shown below.

mrs_cause_definition <-
  function(ft,
           dm,
           updates,
           tb_where,
           tb_where_state,
           tb_where_county) {
    star_schema(ft, dm) |>
      snake_case() |>
      character_dimensions(
        NA_replacement_value = "Unknown",
        length_integers = list(
          week = 2,
          data_availability_week = 2,
          reception_week = 2
        )
      ) |>
      role_playing_dimension(
        dim_names = c("when", "when_received", "when_available"),
        name = "when_common",
        attributes = c("date", "week", "year")
      ) |>
      rename_dimension_attributes(
        name = "when",
        attributes = c("week_ending_date", "week", "year"),
        new_names = c(
          "when_happened_date",
          "when_happened_week",
          "when_happened_year"
        )
      ) |>
      rename_dimension_attributes(
        name = "where",
        attributes = c("region"),
        new_names = c("division")
      ) |>
      modify_dimension_records(updates) |>
      enrich_dimension_import(name = "where", tb_where) |>
      enrich_dimension_import(name = "where", tb_where_state) |>
      enrich_dimension_import(name = "where", tb_where_county)
  }

st_mrs_cause_w10 <-
  mrs_cause_definition(
    mrs_cause_w10,
    dm_mrs_cause,
    updates_st_mrs_age,
    tb_where,
    tb_where_state,
    tb_where_county
  )

st_mrs_cause_w11 <-
  mrs_cause_definition(
    mrs_cause_w11,
    dm_mrs_cause,
    updates_st_mrs_age,
    tb_where,
    tb_where_state,
    tb_where_county
  )

st_mrs_cause <- st_mrs_cause |>
  incremental_refresh_star_schema(st_mrs_cause_w10, existing = "group") |>
  incremental_refresh_star_schema(st_mrs_cause_w11, existing = "group")

ct_mrs <- ct_mrs |>
  incremental_refresh_constellation(st_mrs_cause_w10, existing = "group") |>
  incremental_refresh_constellation(st_mrs_cause_w11, existing = "group")

In this case, the previously existing data is treated differently than it was in the previous case, now what is done is grouping it using the defined aggregation functions, assuming it is additional data that has not been entered before (value “group” in existing parameter).

Filter and purge operations

Sometimes the data refresh consists of eliminating data that is no longer necessary, generally because it corresponds to a period that has stopped being analysed but it can also be for other reasons. In general, they can be selected considering any combination of dimensions, not just the time dimension.

Suppose we want to delete the Boston data working at the constellation level. First, we select them considering the stars that make up the constellation.

st1 <- ct_mrs |>
  get_star_schema("mrs_age") |>
  filter_fact_rows(name = "where", city == "Boston")

st2 <- ct_mrs |>
  get_star_schema("mrs_cause") |>
  filter_fact_rows(name = "where", city == "Boston")

We can work both at the star or constellation level. In this example, we are going to do it at the constellation level (working on a temporary variable so as not to lose data), as we have been doing.

ct_tmp <- ct_mrs |>
  incremental_refresh_constellation(st1, existing = "delete") |>
  incremental_refresh_constellation(st2, existing = "delete")

These operations have only removed the fact data. The first records of the where dimension are shown below.

where_key division state city division_name region region_name state_name county
1 1 CT Bridgeport New England 1 Northeast Connecticut Fairfield
2 1 CT Hartford New England 1 Northeast Connecticut Hartford
3 1 CT New Haven New England 1 Northeast Connecticut New Haven
4 1 CT Waterbury New England 1 Northeast Connecticut New Haven
5 1 MA Boston New England 1 Northeast Massachusetts Suffolk
6 1 MA Cambridge New England 1 Northeast Massachusetts Middlesex

Once the fact data is removed, we can remove the data for the dimensions that are no longer needed using the following function.

ct_tmp <- ct_tmp |>
  purge_dimensions_constellation()

The result in the where dimension is shown below.

where_key division state city division_name region region_name state_name county
1 1 CT Bridgeport New England 1 Northeast Connecticut Fairfield
2 1 CT Hartford New England 1 Northeast Connecticut Hartford
3 1 CT New Haven New England 1 Northeast Connecticut New Haven
4 1 CT Waterbury New England 1 Northeast Connecticut New Haven
5 1 MA Cambridge New England 1 Northeast Massachusetts Middlesex
6 1 MA Fall River New England 1 Northeast Massachusetts Bristol

The Boston data has disappeared and the surrogate keys have been reassigned (on the temporary variable).

Exporting results

Once we have made the necessary definitions and transformations, we can export the data to work in a database or with a query tool.

Instead of exporting data in the specific format of a particular tool, it is exported as tibble-based structures that can be easily handled.

Star schema

Various export possibilities are offered. Specifically, for a star schema one of them is to export the data as a flat table. The main difference from the initial data is that we have cleaned and conformed it. This operation is offered for completeness. To work only with flat tables, this package is not suitable.

To work with databases, it is useful to be able to export a star schema as a list of tibble with dimension and fact tables, as shown below.

tl <- st_mrs_age |>
  star_schema_as_tibble_list()

Optionally, the export function allows the role playing dimensions to be included.

Constellation

To export constellation data, as well as a tibble list, the multistar format may be interesting, where you have a list of tibble for fact tables and another for dimension tables.

ms_mrs <- ct_mrs |>
  constellation_as_multistar()

multistar

We can obtain a flat table, implemented using a tibble, from a multistar (which can be the result of a query). If it only has one fact table, it is not necessary to provide its name.

ft <- ms_mrs |>
  multistar_as_flat_table(fact = "mrs_age")

The first rows of the flat table obtained as a result are shown below.

age_range wide_age_range division state city division_name region region_name state_name county data_availability_date data_availability_week data_availability_year when_happened_date when_happened_week when_happened_year n_deaths nrow_agg
1: <1 year 0-24 years 1 CT Bridgeport New England 1 Northeast Connecticut Fairfield 1962-01-15 03 1962 1962-01-06 01 1962 3 1
1: <1 year 0-24 years 1 CT Bridgeport New England 1 Northeast Connecticut Fairfield 1962-01-17 03 1962 1962-01-13 02 1962 1 1
1: <1 year 0-24 years 1 CT Bridgeport New England 1 Northeast Connecticut Fairfield 1962-01-23 04 1962 1962-01-20 03 1962 5 1
1: <1 year 0-24 years 1 CT Bridgeport New England 1 Northeast Connecticut Fairfield 1962-01-28 04 1962 1962-01-27 04 1962 3 1
1: <1 year 0-24 years 1 CT Bridgeport New England 1 Northeast Connecticut Fairfield 1962-02-05 06 1962 1962-02-03 05 1962 6 1
1: <1 year 0-24 years 1 CT Bridgeport New England 1 Northeast Connecticut Fairfield 1962-02-23 08 1962 1962-02-10 06 1962 3 1

Query functions

The main motivation of the query functions of this package is to have the possibility to select subsets of data to be exported or presented through other packages.

We can define queries on data in multistar format. Data in this format can be obtained from a star or a constellation.

ms_mrs <- ct_mrs |>
  constellation_as_multistar()

The query is created using dimensional_query and executed using run_query. We can refine it using select_dimension, select_fact and filter_dimension as much as we deem appropriate.

ms <- dimensional_query(ms_mrs) |>
  select_dimension(name = "where",
                   attributes = c("city", "state")) |>
  select_dimension(name = "when",
                   attributes = c("when_happened_year")) |>
  select_fact(name = "mrs_age",
              measures = c("n_deaths")) |>
  select_fact(
    name = "mrs_cause",
    measures = c("pneumonia_and_influenza_deaths", "other_deaths")
  ) |>
  filter_dimension(name = "when", when_happened_week <= "03") |>
  filter_dimension(name = "where", city == "Bridgeport") |>
  run_query()

The result of a query is a multistar structure. In this way, we can define queries on the result of others. The result can also be transformed into a flat table.

ft <- ms |>
  multistar_as_flat_table()

The content of the flat table is shown below.

when_happened_year city state n_deaths nrow_agg mrs_cause_pneumonia_and_influenza_deaths mrs_cause_other_deaths mrs_cause_nrow_agg
1962 Bridgeport CT 129 12 7 122 3

In the result it can be seen that, as the query has been defined at the year and city level, for a single city, with data available for a single year, there is only one row. Columns nrow_agg and mrs_cause_nrow_agg show the number of original rows that make up the result row.

From the result in the form of a flat table, pivottabler package can be used to present it in the form of a pivot table.

Conclusions

starschemar package offers a set of operations that allow us to transform flat tables into star schemas. Star schemas support the definition of role playing and role dimensions. Additional transformation operations can be applied to each star schema to adapt the format of the data. From several star schemas you can define fact constellation with conformed dimensions.

Cleaning and conforming data operations can be defined on the star schemas and fact constellation. To update the data, incremental refresh operations are offered, also applicable on said structures. In addition, there are several possibilities to export the results obtained in the form of easily treatable tibble-based structures. Operations are also provided to query the multidimensional structure, mainly for selecting the data to export.

Operations have been designed to be intuitive and easy to use. The result greatly facilitates the data transformation process for the exposed situation.

References

Adamson, Christopher. 2010. Star Schema: The Complete Reference. McGraw Hill Professional.
Jensen, Christian S., Torben Bach Pedersen, and Christian Thomsen. 2010. Multidimensional Databases and Data Warehousing. Morgan & Claypool Publishers.
Kimball, Ralph, and Margy Ross. 2013. The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling. John Wiley & Sons.

  1. I know this data set thanks to Alberto J. Duran, student of Multidimensional Systems during the 2019-2020 academic year, a subject I teach at the University of Granada (Spain), for a work he developed based on it, tutored by me.↩︎

  2. The transformation has been carried out with tidyverse and flattabler packages.↩︎

  3. SaveRDS and readRDS functions can be used to save and retrieve star schemas or any other defined data structure.↩︎

  4. This is how I have located the problem with the value “Wilimington” in the city attribute, which we have already corrected, so it no longer appears.↩︎