--- title: "starschemar: Obtaining Star Schemas from Flat Tables" author: "Jose Samos (jsamos@ugr.es)" date: "2020-09-24" output: rmarkdown::html_vignette bibliography: bibliography.bib vignette: > %\VignetteIndexEntry{starschemar: Obtaining Star Schemas from Flat Tables} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup, echo = FALSE} library(starschemar) ``` # Introduction The *multidimensional data model* was defined in the 1990s with the aim of supporting data analysis. Data in multidimensional systems is obtained from operational systems and is transformed to adapt it to the new structure. Transformations can be carried out using professional ETL (extract, transform and load) tools. Recently, tools aimed at end users have emerged, which are also aimed at performing transformation operations. All these tools are very useful to carry out the transformation process, they provide a development environment to define the transformation operations in a general way. Frequently, the operations to be performed aim to transform a flat table (with data that comes from operational systems) into a star schema (which implements a multidimensional system). With the tools mentioned above, this transformation can be carried out, but it requires a lot of work. This is the main objective of `starschemar` package: Define **transformations that allow obtaining star schemas from flat tables** easily. In addition, it includes basic data cleaning operations, incremental data refresh operations and query operations, adapted to this context. The rest of this document is structured as follows: First, basic concepts of dimensional modelling and star schemas are summarized. The following is an illustrative example of how the package works. Finally, the document ends with conclusions and bibliography. # Dimensional modelling and star schemas The content of this section is a summary based mainly on @adamson2010star and @kimball2013data. It is intended to present the fundamental concepts of the area that are relevant to understand the functionality, naming and use of the `starschemar` package. ## Dimensional modelling Dimensional modelling aims to obtain simple data models. Simplicity is sought for two reasons: so that decision-makers can easily understand the data, and also so that they can be easily queried. In dimensional modelling, the analysis of a business process is performed *modelling how it is measured*. The measures are called *facts*, and the descriptors of the context of the facts are *dimensions*. Facts are numerical data, and decision makers want to see them at various levels of detail, defined by dimensions. Not all numerical data is a fact (some tools consider it that way). In dimensional modelling the designer has to differentiate between facts and dimensions. Some criteria are considered to distinguish between them, for example: - If it can be defined at different levels of detail then it is a fact. - If it is quantitative and takes continuous values, then it is a fact. - If it provides context then it is a dimension. Sometimes there are no measures associated with the business process, it is simply recorded that the combination of dimensions has occurred. This situation is often called *factless facts*, @jensen2010multi prefer to call it *measureless facts*. In any case, including when no other measures are available, a measure can be considered that represents the number of times the combination of dimension values occurs. ### Dimensions and dimension attributes Attributes considered by the designer as dimensions can be grouped taking into account the natural affinities between them. In particular, they can be grouped as they describe the "who, what, where, when, how and why" associated with the modelled business process. Two attributes share a natural affinity when they are only related in one context. When their relationships are determined by transactions or activities, they can occur in multiple contexts, if this occurs, they must be located in different dimensions. In this way, a *dimension* is made up of a set of naturally related *dimension attributes* that describe the context of facts. Dimensions are used for two purposes: fact *selection* and fact *grouping* with the desired level of detail. Additionally, in the dimensions *hierarchies* with *levels* and *descriptors* can be defined. More details can be found at @jensen2010multi. These concepts are not used in the current version of the package. ### Facts and measures A fact has a *granularity*, which is determined by the attributes of the dimensions that are considered at each moment. Thus, a measure in a fact has two components, the *numerical property* of the fact and an *formula*, frequently the SUM aggregation function, that allows combining several values of this measure to obtain a new value of the same measure with a coarser granularity [@jensen2010multi]. According to their behaviour to obtain a coarser granularity, three types of measures are distinguished: additive, semi-additive and non-additive. For *additive measures*, SUM is always a valid formula that maintains the meaning of the measure when the granularity changes. For *semi-additive measures*, there is no point in using SUM when changing the level of detail in any of the dimensions because the meaning of the measure changes, this frequently occurs in dimensions that represents time and measures representing inventory level. For *non-additive measures*, values cannot be combined across any dimension using SUM because the result obtained has a different meaning from the original measure (generally occurs with ratios, percentages or unit amounts such as unit cost or unit price). The most useful measures are additive. If we have non-additive measures, they can generally be redefined from other additive measures. ## Star schemas Dimensional models implemented in RDBMS (Relational Database Management Systems) using a table for each dimension are called *star schemas* because of their resemblance to a star-like structure: A *fact table* in the centre and *dimension tables* around it. Thus, dimension attributes are columns of the respective dimension tables, and measures are columns of the fact table. Other possible implementations on RDBMS normalize dimensions and are known as *snowflake schema*. More details can be found at @jensen2010multi. This is not considered in this package. ### Dimension tables Dimension tables contain the context associated with business process measures. Although they can contain any type of data, numerical data is generally not used for dimension attributes because some query tools consider any numeric data as a measure. Dimension attributes with NULL value are a source of problems when querying since DBMS and query tools sometimes handle them inconsistently, the result depends on the product. It is recommended to avoid the use of NULL and replace them with a descriptive text. In the case of dates, it is recommended to replace the NULL values with an arbitrary date in the very far future. #### Surrogate keys A dimension table contains dimension attributes and also a *surrogate key* column. This column is a unique identifier that has no intrinsic meaning: It is generally an integer and is the primary key for the dimension table. In @adamson2010star surrogate keys are easily identifiable by the suffix "_key" in the column name (and this criterion has also been applied in `starschemar` package). Dimension tables also contain key columns that uniquely identify associated entities in an operational system. The separation of surrogate keys and natural keys allows the star schema to store changes in dimensions. Therefore, the use of surrogate keys in dimensions is a solution to the SCD (slowly changing dimensions) problem. This problem is not specifically addressed in this version of this package. #### Special dimensions In some cases, for the sake of simplicity, it is helpful to create a table that contains dimension attributes that have no natural affinities to each other, generally these are low-cardinality flags and indicators. The result is what is known as a *junk dimension*. They do not require any special support, only the designer's will to define them. Sometimes some dimension attributes are left in the fact table, usually transaction identifiers. It is considered as the primary key of a dimension that does not have an associated table, for this reason it is known as a *degenerate dimension*. Degenerate dimensions are not allowed in this package. A single dimension can be referenced multiple times in a fact table, with each reference linked to a different logical role for each dimension. These separate dimension views, with unique attribute column names, are called *role dimensions* and the common dimension is called a *role-playing dimension*. Associated with multiple star schemas we have the *conformed dimensions* that are presented in section [Conformed dimensions]. ### Fact table At the centre of a star schema is the fact table. In addition to containing measures, the fact table includes foreign keys that refer to each of the surrogate keys in the dimension tables. #### Primary key A subset of foreign keys, along with possibly degenerate dimensions, is considered to form the primary key of the fact table. In `starschemar` package, since degenerate dimensions are not allowed, the primary key is made up of a subset of foreign keys. #### Grain The subset of dimensions that forms the primary key defines the level of detail stored in the fact table, which is known as the fact table's *grain*. In the design process, it is very important for the designer to clearly define the grain of the fact table (it is usually defined by listing the dimensions whose surrogate keys form its primary key): it is a way to ensure that all the facts are stored at the same level of detail. At the finest grain, a row in the fact table corresponds to the measures of an event and vice versa, it is not influenced by the possible reports that may be obtained. When two facts have different grains, they should be set on different fact tables. ### Multiple fact tables It is frequent the need to have several fact tables for various reasons: - We find measurements with different grain. - There are measures that do not occur simultaneously, for example, when one occurs, we have no value for others and vice versa. In reality it is about different business processes, each one has to have its own fact table but they have dimensions in common. This is known as a [*fact constellation*](https://www.geeksforgeeks.org/fact-constellation-in-data-warehouse-modelling/) which corresponds to the Kimball *enterprise data warehouse bus architecture*. ### Conformed dimensions When star schemas share a set of common dimensions, these dimensions are called *conformed dimensions*. There are several possibilities to have conformed dimensions, the most obvious form is that the dimension tables share *structure* and *content*, that is, they are identical dimensions. This is the one considered in this version of the `starschemar` package. ## Additional operations ### Cleaning and conforming data When data is loaded into a star schema, errors or inconsistencies can be discovered in some of them. In some cases, it is best to make corrections at the source of the data, in operational systems. Sometimes this is not possible and there is no other option but to modify the data before loading it into the star schema or even when it is already loaded. Inconsistencies are often found in dimensions, when dimensions are integrated into one, for example to generate a role-playing dimension or conformed dimensions. Support for modifying dimension data is provided in the package. ### Dimension enrichment The more attributes the dimensions have, the more possibilities they offer to filter or aggregate data. For this reason, it is convenient to enrich the dimensions with new attributes, whenever possible. New attributes are often defined based on existing ones. There must be a way to associate the new attributes with the dimensions. The more attributes the dimensions have, the more possibilities they offer to filter or aggregate data. For this reason, it is convenient to enrich the dimensions with new attributes, whenever possible. New attributes are often defined based on existing ones. There must be a way to associate the new attributes with the dimensions. Operations have been defined in the package to export dimension attributes so that their values are not repeated, and also to import them, once the new attributes have been added. ### Incremental refresh When a star schema is built, an initial load is performed with all available data from a moment in time onwards. Operational systems continue to operate and produce data. If we want to incorporate these data into the star schema, we have two possibilities: - Carry out a new load with all the data available at that time. - Perform an incremental refresh with only the new data. In order to carry out this second option, the CDC (change data capture) system allows to exclusively obtain the new data produced. Sometimes it is also convenient to delete data that is considered to be no longer necessary, generally data from the more distant past. In this package, it has been considered that we can obtain the new data, possibly mixed with updates to data already incorporated into the star schema, in order to carry out an incremental refresh of star schemas with them. Operations are also offered to select data and delete it if it is not considered necessary. # An illustrative example ## Starting data sets To illustrate how the package works we will use a small part of the [Deaths in 122 U.S. cities - 1962-2016. 122 Cities Mortality Reporting System](https://catalog.data.gov/dataset/deaths-in-122-u-s-cities-1962-2016-122-cities-mortality-reporting-system) data set[^1]. Specifically, only the data for the first 11 weeks of 1962 are considered. [^1]: I know this data set thanks to Alberto J. Duran, student of *Multidimensional Systems* during the 2019-2020 academic year, a subject I teach at the University of Granada (Spain), for a work he developed based on it, tutored by me. ```{r, results = "asis", echo = FALSE} pander::pandoc.table(head(mrs, 12), split.table = Inf) ``` In the table above, the first rows of the original data are shown. For each week and city, mortality figures by age group and cause, considered separately, are included (i.e., the combination of age group and cause is not included). In the cause, only a distinction is made between pneumonia or influenza and others. It can be seen that there is only one measure, *deaths*, defined at two different granularities: *week-city-cause* (data in columns "Pneumonia and Influenza Deaths" and "All Deaths") and *week-city-age bracket* (data from the 5 columns on the right). This means that we will need **two fact tables** to include all the data in star schemas. As the data is partially in the form of a pivot table, from these data two tables have been generated[^2], one for each granularity. Additionally, the following operations have been carried out: - To show the incremental refresh, the data has been selected: On the one hand, the first 9 weeks were considered, and on the other, each of the following weeks separately. - To show the possibilities of data modification operations to correct errors, some data has been modified to generate missing data or errors. [^2]: The transformation has been carried out with [`tidyverse`](https://CRAN.R-project.org/package=tidyverse) and [`flattabler`](https://CRAN.R-project.org/package=flattabler) packages. ```{r, results = "asis", echo = FALSE} pander::pandoc.table(head(mrs_age[,-c(1:6)]), split.table = Inf) ``` In the table above, the first rows of the flat table that contains the data according to the age bracket are shown. The following table shows the first 18 rows of the flat table containing data based on cause of death. The calculated column *Other Deaths* has been added. Here you can see some missing data (columns *Year* and *WEEK*), errors (value "Bridgepor" in column *City*) and how there are only data from the first 9 weeks. ```{r, results = "asis", echo = FALSE} pander::pandoc.table(head(mrs_cause[,-c(1:6)], 18), split.table = Inf) ``` To have more layout possibilities to display, new columns with dates have been generated. Next, in both tables, the new columns related to the date that have been added can be seen. ```{r, results = "asis", echo = FALSE} pander::pandoc.table(head(mrs_age), split.table = Inf) ``` ```{r, results = "asis", echo = FALSE} pander::pandoc.table(head(mrs_cause), split.table = Inf) ``` These are the flat tables that will be considered as a starting point to obtain star schemas from them in this example. They are available in the package: `mrs_age` and `mrs_cause` respectively. ## Dimensional modelling For each flat table, the goal is to define the attributes that correspond to facts and those that are dimensions. For facts, measures and their aggregation functions have to be defined. For dimensions, attributes with natural affinity must be grouped. Each attribute can only appear once in the definition. ### Dimensional modelling data according to age range To avoid having to write the name of the attributes of the table, with the following function we can have them in the form of a string. Thus, we can copy and paste each name as needed. ```{r} dput(colnames(mrs_age)) ``` The definition of the dimensional model for the data considered is shown below. ```{r} library(starschemar) dm_mrs_age <- dimensional_model() |> define_fact( name = "mrs_age", measures = c( "Deaths" ), agg_functions = c( "SUM" ), nrow_agg = "nrow_agg" ) |> define_dimension( name = "when", attributes = c( "Week Ending Date", "WEEK", "Year" ) ) |> define_dimension( name = "when_available", attributes = c( "Data Availability Date", "Data Availability Week", "Data Availability Year" ) ) |> define_dimension( name = "where", attributes = c( "REGION", "State", "City" ) ) |> define_dimension( name = "who", attributes = c( "Age Range" ) ) ``` In this case, all the elements have been explicitly defined, including aggregation functions and the name of an additional measure representing the number of rows aggregated, which is always included. Only data from two of the three possible time-related dimensions have been considered. ### Dimensional modelling data according to cause In the case of data according to cause of death, the definition of the model is shown below. ```{r} dm_mrs_cause <- dimensional_model() |> define_fact( name = "mrs_cause", measures = c( "Pneumonia and Influenza Deaths", "Other Deaths" ), ) |> define_dimension( name = "when", attributes = c( "Week Ending Date", "WEEK", "Year" ) ) |> define_dimension( name = "when_received", attributes = c( "Reception Date", "Reception Week", "Reception Year" ) ) |> define_dimension( name = "when_available", attributes = c( "Data Availability Date", "Data Availability Week", "Data Availability Year" ) ) |> define_dimension( name = "where", attributes = c( "REGION", "State", "City" ) ) ``` If no aggregation function is indicated, by default, SUM is considered. Although not explicitly stated, it also includes by default the measure relative to the number of rows aggregated. In this case, the three dimensions related to the date have been defined. ## Star schema definition and transformation: Constellations To define a star schema, we need a flat table and a dimensional model defined from it. Once defined, we can apply format modification operations to it. ### Star schema definition for data according to age range The basic definition operation of a star schema is shown below[^3]. [^3]: `SaveRDS` and `readRDS` functions can be used to save and retrieve star schemas or any other defined data structure. ```{r} st_mrs_age <- star_schema(mrs_age, dm_mrs_age) ``` The first rows of the obtained dimension and fact tables are shown below. ```{r, results = "asis", echo = FALSE} pander::pandoc.table(head(st_mrs_age$dimension$when), split.table = Inf) pander::pandoc.table(head(st_mrs_age$dimension$when_available), split.table = Inf) pander::pandoc.table(head(st_mrs_age$dimension$where), split.table = Inf) pander::pandoc.table(head(st_mrs_age$dimension$who), split.table = Inf) pander::pandoc.table(head(st_mrs_age$fact$mrs_age), split.table = Inf) ``` The data from the original flat table has been structured in the form of dimension tables and fact tables. Data in the columns of the original table included in the dimensions is not repeated. A surrogate key has been added to each of the dimension tables that are foreign keys in the fact table. Next, we will apply format modification operations to the original structure obtained. ```{r} st_mrs_age <- st_mrs_age |> role_playing_dimension( dim_names = c("when", "when_available"), name = "When Common", attributes = c("date", "week", "year") ) |> snake_case() |> character_dimensions(NA_replacement_value = "Unknown", length_integers = list(week = 2)) ``` First, a role playing dimension has been defined based on the dimensions related to dates. Then, to work with databases, the names have been adapted to the snake case criterion. Finally, the data type of the attributes of the dimensions has been transformed so that all columns except the date columns are of the character data type, in the case of numerical data, it is allowed to indicate the length of the field to fill with leading zeros, and undefined values have been replaced by the indicated value. The first rows of the new dimension and fact tables are shown below. ```{r, results = "asis", echo = FALSE} pander::pandoc.table(head(st_mrs_age$dimension$when), split.table = Inf) pander::pandoc.table(head(st_mrs_age$dimension$when_available), split.table = Inf) pander::pandoc.table(head(st_mrs_age$dimension$where), split.table = Inf) pander::pandoc.table(head(st_mrs_age$dimension$who), split.table = Inf) pander::pandoc.table(head(st_mrs_age$dimension$when_common), split.table = Inf) pander::pandoc.table(head(st_mrs_age$fact$mrs_age), split.table = Inf) ``` In the result, it can be seen that the dimensions related to date are now role dimensions and do not have their own data, a role playing dimension has been generated with the integrated data. The fact table continues to refer to role dimensions, the value of foreign keys has been adapted to the possible new values of the surrogate keys. Additionally it can be seen that the *week* field now has length 2 and has 0 on the left (this is useful to sort numbers in text format). ### Star schema definition for data according to cause We are going to define the star schema and apply similar transformations to the other flat table. The transformations can be applied in any order. ```{r} st_mrs_cause <- star_schema(mrs_cause, dm_mrs_cause) |> snake_case() |> character_dimensions( NA_replacement_value = "Unknown", length_integers = list( week = 2, data_availability_week = 2, reception_week = 2 ) ) |> role_playing_dimension( dim_names = c("when", "when_received", "when_available"), name = "when_common", attributes = c("date", "week", "year") ) ``` In this case, since the role playing dimension definition is the last transformation defined, the format of the *week* column had to be defined in the three date dimensions to obtain an equivalent result. The result obtained is shown below. ```{r, results = "asis", echo = FALSE} pander::pandoc.table(head(st_mrs_cause$dimension$when), split.table = Inf) pander::pandoc.table(head(st_mrs_cause$dimension$when_received), split.table = Inf) pander::pandoc.table(head(st_mrs_cause$dimension$when_available), split.table = Inf) pander::pandoc.table(head(st_mrs_cause$dimension$where), split.table = Inf) pander::pandoc.table(head(st_mrs_cause$dimension$when_common), split.table = Inf) pander::pandoc.table(head(st_mrs_cause$fact$mrs_cause), split.table = Inf) ``` In this case we have three role dimensions defined on a role playing dimension. ### Star schema transformation, cleaning and conforming data Star schemas are defined from flat table fields. In some cases it may be interesting to rename elements of the schema, especially attributes of dimensions and measures. On the other hand, dimensions can be enriched by adding additional attributes, generally derived from the rest of the attributes. These are the transformations considered in this section. We can perform data cleaning and conforming operations on star schema dimensions. Updates defined in a star schema can be applied on another with common dimensions. #### Star schema rename If necessary, the elements of a star schema can be renamed. These functions are especially useful for renaming attributes of dimensions and measures of facts because the definition operations only allowed us to select columns of a flat table, not modifying their names. We are going to modify the attribute names of the *when* dimension for the two design cases. The values of the attribute called *region* actually correspond to the concept of *division* in the organization of the US territory, we are going to rename it this way for the two design cases. Data according to age: ```{r} st_mrs_age <- st_mrs_age |> rename_dimension_attributes( name = "when", attributes = c("week_ending_date", "week", "year"), new_names = c( "when_happened_date", "when_happened_week", "when_happened_year" ) ) |> rename_dimension_attributes( name = "where", attributes = c("region"), new_names = c("division") ) ``` And data according to cause: ```{r} st_mrs_cause <- st_mrs_cause |> rename_dimension_attributes( name = "when", attributes = c("week_ending_date", "week", "year"), new_names = c( "when_happened_date", "when_happened_week", "when_happened_year" ) ) |> rename_dimension_attributes( name = "where", attributes = c("region"), new_names = c("division") ) ``` We are also going to modify the name of the measure for data according to age: ```{r} st_mrs_age <- st_mrs_age |> rename_measures(measures = c("deaths"), new_names = c("n_deaths")) ``` Additionally, functions are available to modify the names of dimensions and facts. In this case it is not necessary to modify them because they do not depend on the names of the starting base data and we have defined them as we wanted. #### Definition of updates Using the following code, first of all, we get the names of the dimensions of a star schema, then, we get them by their name (we can see the rows using the `utils::View` function). If there are several role dimensions, it is enough to consult one of them, updates defined on it will be propagated to the rest. ```{r} dim_names <- st_mrs_age |> get_dimension_names() where <- st_mrs_age |> get_dimension("where") # View(where) # where[where$where_key %in% c(1, 2, 62), ] when <- st_mrs_age |> get_dimension("when") # View(when) # when[when$when_key %in% c(36, 37, 73), ] who <- st_mrs_age |> get_dimension("who") # View(who) ``` Reviewing the dimensions we can detect the need for updates. Errors are also sometimes detected by comparing the values with equivalent values obtained from other data sources. ##### Updates in the *where* dimension In the *where* dimension we find the wrong value of "Bridgepor", generated by ourselves, the correct value is "Bridgeport", also included in the dimension. Additionally, trying to find additional data associated with cities, we found another error for the city of "Wilimington", whose correct name is "Wilmington". Below are the mentioned instances. ```{r, results = "asis", echo = FALSE} pander::pandoc.table(where[where$where_key %in% c(1, 2, 62), ], split.table = Inf) ``` Updates can be defined in several ways. One of them is referencing the surrogate key. Although in some of the functions the dimension surrogate key is referred to, updates only consider the values of the rest of the columns of the corresponding dimension. In this way, it is achieved that updates can be applied to equivalent dimensions of other star schemas, where the values of the surrogate key do not necessarily coincide with those of the dimension where updates were originally defined. Next, we define the update that indicates that the values of record "1" ("Bridgepor") of the *where* dimension must match those of record "2" ("Bridgeport"). ```{r} updates_st_mrs_age <- record_update_set() |> match_records(dimension = where, old = 1, new = 2) ``` For the case of the other city, we define an update that replaces the value in the indicated fields for the records that meet the defined condition, as shown below. ```{r} updates_st_mrs_age <- updates_st_mrs_age |> update_selection_general( dimension = where, columns_old = c("state", "city"), old_values = c("DE", "Wilimington"), columns_new = c("city"), new_values = c("Wilmington") ) ``` ##### Updates in the *when* dimension Below is a selection of rows involved in update operations that are referred by their surrogate key in the *when* dimension. ```{r, results = "asis", echo = FALSE} pander::pandoc.table(when[when$when_key %in% c(36, 37, 73), ], split.table = Inf) ``` The corresponding updates are shown below. ```{r} updates_st_mrs_age <- updates_st_mrs_age |> match_records(dimension = when, old = 37, new = 36) |> update_record( dimension = when, old = 73, values = c("1962-02-17", "07", "1962") ) ``` The values of record "37" must match those of record "36" (both registers will be unified). The values in register "73" will be replaced by those provided. ##### Updates in the *who* dimension In the case of the *who* dimension, we want to include a code in each value of `age_range` attribute so that they are sorted increasingly based on their value. ```{r} updates_st_mrs_age <- updates_st_mrs_age |> update_selection( dimension = who, columns = c("age_range"), old_values = c("<1 year"), new_values = c("1: <1 year") ) |> update_selection( dimension = who, columns = c("age_range"), old_values = c("1-24 years"), new_values = c("2: 1-24 years") ) |> update_selection( dimension = who, columns = c("age_range"), old_values = c("25-44 years"), new_values = c("3: 25-44 years") ) |> update_selection( dimension = who, columns = c("age_range"), old_values = c("45-64 years"), new_values = c("4: 45-64 years") ) |> update_selection( dimension = who, columns = c("age_range"), old_values = c("65+ years"), new_values = c("5: 65+ years") ) ``` In each case, we substitute the old value for the new one in the indicated attribute. #### Updates application Once updates are defined, they can be applied on the star schema from which they have been defined, as shown below. ```{r} st_mrs_age <- st_mrs_age |> modify_dimension_records(updates_st_mrs_age) ``` The result obtained for the first star schema is shown below. ```{r, results = "asis", echo = FALSE} pander::pandoc.table(head(st_mrs_age$dimension$when), split.table = Inf) pander::pandoc.table(head(st_mrs_age$dimension$when_available), split.table = Inf) pander::pandoc.table(head(st_mrs_age$dimension$where), split.table = Inf) pander::pandoc.table(head(st_mrs_age$dimension$who), split.table = Inf) pander::pandoc.table(head(st_mrs_age$dimension$when_common), split.table = Inf) pander::pandoc.table(head(st_mrs_age$fact$mrs_age), split.table = Inf) ``` It can be seen that the row with the value "Bridgepor" in the *city* column has disappeared: it has been merged with the row with the correct value. This update has also been transmitted to the fact table. Although it is not seen in the tables, the same has happened with the dates that have been unified by the update. The same updates can also be applied to other star schemas with dimensions in common with the original star schema, as shown below. ```{r} st_mrs_cause <- st_mrs_cause |> modify_dimension_records(updates_st_mrs_age) ``` Updates defined on dimensions not included in the star schema are ignored. In this case those referring to the who dimension. Likewise, if there are no records that meet the conditions to get the old value, they are also ignored. #### Dimension enrichment From the original attributes included in the dimensions, we can obtain new attributes that facilitate queries or offer new query possibilities. ##### Enrich the *who* dimension Suppose that we are interested in defining some broader age ranges than the existing ones. This operation can be done by enriching the corresponding dimension. First, we export the attributes to consider in table form, in this case only the age range. ```{r} tb_who <- enrich_dimension_export(st_mrs_age, name = "who", attributes = c("age_range")) ``` Next, we can see the result of the export operation. It is a table with the selected attributes where duplicate values have been eliminated if there are any (in this case there are no repeated values). ```{r, results = "asis", echo = FALSE} pander::pandoc.table(tb_who, split.table = Inf) ``` In the table we add the columns that we want. In this case a new column to define the new broader age range. ```{r} v <- c("0-24 years", "0-24 years", "25+ years", "25+ years", "25+ years") tb_who <- tibble::add_column(tb_who, wide_age_range = v) ``` The new table can be seen below. ```{r, results = "asis", echo = FALSE} pander::pandoc.table(tb_who, split.table = Inf) ``` We enrich the dimension considering the new data in the table. ```{r} st_mrs_age <- st_mrs_age |> enrich_dimension_import(name = "who", tb_who) ``` We can see the result below, where the dimension has the new defined attribute. ```{r, results = "asis", echo = FALSE} pander::pandoc.table(st_mrs_age$dimension$who, split.table = Inf) ``` ##### Enrich the *where* dimension For the *where* dimension we can proceed in the same way as we have done for the *who* dimension: Export the data, complete it manually and import it again, as shown below. ```{r} tb_where <- enrich_dimension_export(st_mrs_age, name = "where", attributes = c("division")) ``` The new table for *division* data can be seen below. ```{r, results = "asis", echo = FALSE} pander::pandoc.table(tb_where, split.table = Inf) ``` We look for the names of the divisions and add the data of the regions to which they belong. ```{r} tb_where <- tibble::add_column( tb_where, division_name = c( "New England", "Middle Atlantic", "East North Central", "West North Central", "South Atlantic", "East South Central", "West South Central", "Mountain", "Pacific" ), region = c('1', '1', '2', '2', '3', '3', '3', '4', '4'), region_name = c( "Northeast", "Northeast", "Midwest", "Midwest", "South", "South", "South", "West", "West" ) ) st_mrs_age <- st_mrs_age |> enrich_dimension_import(name = "where", tb_where) st_mrs_cause <- st_mrs_cause |> enrich_dimension_import(name = "where", tb_where) ``` To add the name of the states and the county to which each city belongs, we could proceed in the same way. However, it is easier if we try to locate this data and use it directly. These data are available in the `ft_usa_states` and `ft_usa_city_county` data sets, respectively. However, if we operate in the same way, when importing the data an error occurs. The reason is that not all the data in the dimension matches the data in the imported table. We can determine the missing data using the following function. ```{r} tb_missing <- st_mrs_age |> enrich_dimension_import_test(name = "where", ft_usa_states) ``` The result obtained is shown below. ```{r, results = "asis", echo = FALSE} pander::pandoc.table(tb_missing, split.table = Inf) ``` In all cases, the problem occurs for the value "Unknown" in the *state* attribute. We must add a row to the data before importing it. ```{r} tb_where_state <- ft_usa_states |> tibble::add_row(state = "Unknown", state_name = "Unknown") st_mrs_age <- st_mrs_age |> enrich_dimension_import(name = "where", tb_where_state) st_mrs_cause <- st_mrs_cause |> enrich_dimension_import(name = "where", tb_where_state) ``` The same problem occurs and we apply the same solution to add the county data[^4]. [^4]: This is how I have located the problem with the value "Wilimington" in the *city* attribute, which we have already corrected, so it no longer appears. ```{r} tb_where_county <- ft_usa_city_county |> tibble::add_row(city = "Unknown", state = "Unknown", county = "Unknown") st_mrs_age <- st_mrs_age |> enrich_dimension_import(name = "where", tb_where_county) st_mrs_cause <- st_mrs_cause |> enrich_dimension_import(name = "where", tb_where_county) ``` We can see the first rows of the final result below. ```{r, results = "asis", echo = FALSE} pander::pandoc.table(head(st_mrs_age$dimension$where, 10), split.table = Inf) ``` ### Constellation definition A constellation is defined from a list of star schemas, as shown below. ```{r} ct_mrs <- constellation(list(st_mrs_age, st_mrs_cause), name = "mrs") ``` All dimensions of the same name in star schemas must be compatible in structure and type of columns, and are defined as conformed dimensions. The conformed dimensions share all the instances of the original dimensions, this implies possible modifications in the surrogate keys that are transmitted to foreign keys of the component fact tables. The tables of the obtained conformed dimensions are shown below. ```{r, results = "asis", echo = FALSE} pander::pandoc.table(head(ct_mrs$dimension$when), split.table = Inf) pander::pandoc.table(head(ct_mrs$dimension$when_available), split.table = Inf) pander::pandoc.table(head(ct_mrs$dimension$where), split.table = Inf) ``` It can be seen that the conformed dimensions are considered regardless of the type of dimension in the star schema. Definition of conformed dimensions does not imply any change in the definition of dimensions in star schemas: Role and role playing dimensions remain the same, only their rows may have changed. In this particular case there are no discrepancies in the values of the star schemas dimension instances, since the data source is the same for both. If discrepancies are detected once the integration has been carried out, new modification operations can be defined on conformed dimensions (as it has been done for the dimensions of star schemas) that are applied at the constellation level and are automatically transmitted to the component star schemas (`modify_conformed_dimension_records`). ## Incremental refresh Once we have star schemas built with the data available at the moment, we may obtain additional data, with the same structure as the initial data but from a later time. Sometimes the new data also includes data from previous periods to operate on them. Under these conditions, suppose we get the data sets `mrs_age_w10`, `mrs_age_w11`, `mrs_cause_w10`, and `mrs_cause_w11`, for weeks 10 and 11 (data in star schemas runs through week 9). In all cases, some data from previous periods are included. To perform an incremental refresh of a star schema, we must have the new data in the same star schema format. Additionally, if we have done data cleaning, it is likely that we will have to correct part of the corrected errors again over the new data. For this reason, it is best to package all the transformations carried out on the original data in function form so that they can be easily applied to new data. ### Refresh operations for data according to age range Below you can see the function that groups the transformations defined for the data according to the age range. ```{r} mrs_age_definition <- function(ft, dm, updates, tb_who, tb_where, tb_where_state, tb_where_county) { star_schema(ft, dm) |> role_playing_dimension( dim_names = c("when", "when_available"), name = "When Common", attributes = c("date", "week", "year") ) |> snake_case() |> character_dimensions(NA_replacement_value = "Unknown", length_integers = list(week = 2)) |> rename_dimension_attributes( name = "when", attributes = c("week_ending_date", "week", "year"), new_names = c( "when_happened_date", "when_happened_week", "when_happened_year" ) ) |> rename_dimension_attributes( name = "where", attributes = c("region"), new_names = c("division") ) |> rename_measures(measures = c("deaths"), new_names = c("n_deaths")) |> modify_dimension_records(updates) |> enrich_dimension_import(name = "who", tb_who) |> enrich_dimension_import(name = "where", tb_where) |> enrich_dimension_import(name = "where", tb_where_state) |> enrich_dimension_import(name = "where", tb_where_county) } ``` We apply this function to new data sets, as shown below. ```{r} st_mrs_age_w10 <- mrs_age_definition( mrs_age_w10, dm_mrs_age, updates_st_mrs_age, tb_who, tb_where, tb_where_state, tb_where_county ) st_mrs_age_w11 <- mrs_age_definition( mrs_age_w11, dm_mrs_age, updates_st_mrs_age, tb_who, tb_where, tb_where_state, tb_where_county ) ``` Errors may occur because the data is different from the original, especially in the dimension enrichment part, if data is missing. In this case, we must eliminate from the function the lines corresponding to enrichment and later try to enrich the dimension by checking the errors with `enrich_dimension_import_test`. Once we have the data in the same format, we can apply the incremental refresh to the original star schema, as follows. ```{r} st_mrs_age <- st_mrs_age |> incremental_refresh_star_schema(st_mrs_age_w10, existing = "replace") |> incremental_refresh_star_schema(st_mrs_age_w11, existing = "replace") ``` In this case, it has been assumed that if data from previous periods appears among the new data, the new data has to replace the previous data (value "replace" in `existing` parameter). If the star schema has been integrated into a constellation, the incremental refresh can be performed on it, as follows. ```{r} ct_mrs <- ct_mrs |> incremental_refresh_constellation(st_mrs_age_w10, existing = "replace") |> incremental_refresh_constellation(st_mrs_age_w11, existing = "replace") ``` In this case, the corresponding star schema, the conformed dimensions and all the star schemas that share them are updated. ### Refresh operations for data according to cause Similar to how it has been done for age data, it can be done for cause data, as shown below. ```{r} mrs_cause_definition <- function(ft, dm, updates, tb_where, tb_where_state, tb_where_county) { star_schema(ft, dm) |> snake_case() |> character_dimensions( NA_replacement_value = "Unknown", length_integers = list( week = 2, data_availability_week = 2, reception_week = 2 ) ) |> role_playing_dimension( dim_names = c("when", "when_received", "when_available"), name = "when_common", attributes = c("date", "week", "year") ) |> rename_dimension_attributes( name = "when", attributes = c("week_ending_date", "week", "year"), new_names = c( "when_happened_date", "when_happened_week", "when_happened_year" ) ) |> rename_dimension_attributes( name = "where", attributes = c("region"), new_names = c("division") ) |> modify_dimension_records(updates) |> enrich_dimension_import(name = "where", tb_where) |> enrich_dimension_import(name = "where", tb_where_state) |> enrich_dimension_import(name = "where", tb_where_county) } st_mrs_cause_w10 <- mrs_cause_definition( mrs_cause_w10, dm_mrs_cause, updates_st_mrs_age, tb_where, tb_where_state, tb_where_county ) st_mrs_cause_w11 <- mrs_cause_definition( mrs_cause_w11, dm_mrs_cause, updates_st_mrs_age, tb_where, tb_where_state, tb_where_county ) st_mrs_cause <- st_mrs_cause |> incremental_refresh_star_schema(st_mrs_cause_w10, existing = "group") |> incremental_refresh_star_schema(st_mrs_cause_w11, existing = "group") ct_mrs <- ct_mrs |> incremental_refresh_constellation(st_mrs_cause_w10, existing = "group") |> incremental_refresh_constellation(st_mrs_cause_w11, existing = "group") ``` In this case, the previously existing data is treated differently than it was in the previous case, now what is done is grouping it using the defined aggregation functions, assuming it is additional data that has not been entered before (value "group" in `existing` parameter). ### Filter and purge operations Sometimes the data refresh consists of eliminating data that is no longer necessary, generally because it corresponds to a period that has stopped being analysed but it can also be for other reasons. In general, they can be selected considering any combination of dimensions, not just the time dimension. Suppose we want to delete the *Boston* data working at the constellation level. First, we select them considering the stars that make up the constellation. ```{r} st1 <- ct_mrs |> get_star_schema("mrs_age") |> filter_fact_rows(name = "where", city == "Boston") st2 <- ct_mrs |> get_star_schema("mrs_cause") |> filter_fact_rows(name = "where", city == "Boston") ``` We can work both at the star or constellation level. In this example, we are going to do it at the constellation level (working on a temporary variable so as not to lose data), as we have been doing. ```{r} ct_tmp <- ct_mrs |> incremental_refresh_constellation(st1, existing = "delete") |> incremental_refresh_constellation(st2, existing = "delete") ``` These operations have only removed the fact data. The first records of the `where` dimension are shown below. ```{r, results = "asis", echo = FALSE} pander::pandoc.table(head(ct_tmp$dimension$where), split.table = Inf) ``` Once the fact data is removed, we can remove the data for the dimensions that are no longer needed using the following function. ```{r} ct_tmp <- ct_tmp |> purge_dimensions_constellation() ``` The result in the `where` dimension is shown below. ```{r, results = "asis", echo = FALSE} pander::pandoc.table(head(ct_tmp$dimension$where), split.table = Inf) ``` The *Boston* data has disappeared and the surrogate keys have been reassigned (on the temporary variable). ## Exporting results Once we have made the necessary definitions and transformations, we can export the data to work in a database or with a query tool. Instead of exporting data in the specific format of a particular tool, it is exported as `tibble`-based structures that can be easily handled. ### Star schema Various export possibilities are offered. Specifically, for a star schema one of them is to export the data as a flat table. The main difference from the initial data is that we have cleaned and conformed it. This operation is offered for completeness. To work only with flat tables, this package is not suitable. To work with databases, it is useful to be able to export a star schema as a list of `tibble` with dimension and fact tables, as shown below. ```{r} tl <- st_mrs_age |> star_schema_as_tibble_list() ``` Optionally, the export function allows the role playing dimensions to be included. ### Constellation To export constellation data, as well as a `tibble` list, the `multistar` format may be interesting, where you have a list of `tibble` for fact tables and another for dimension tables. ```{r} ms_mrs <- ct_mrs |> constellation_as_multistar() ``` ### `multistar` We can obtain a flat table, implemented using a `tibble`, from a `multistar` (which can be the result of a query). If it only has one fact table, it is not necessary to provide its name. ```{r} ft <- ms_mrs |> multistar_as_flat_table(fact = "mrs_age") ``` The first rows of the flat table obtained as a result are shown below. ```{r, results = "asis", echo = FALSE} pander::pandoc.table(head(ft), split.table = Inf) ``` ## Query functions The main motivation of the query functions of this package is to have the possibility to select subsets of data to be exported or presented through other packages. We can define queries on data in `multistar` format. Data in this format can be obtained from a star or a constellation. ```{r} ms_mrs <- ct_mrs |> constellation_as_multistar() ``` The query is created using `dimensional_query` and executed using `run_query.` We can refine it using `select_dimension`, `select_fact` and `filter_dimension` as much as we deem appropriate. ```{r} ms <- dimensional_query(ms_mrs) |> select_dimension(name = "where", attributes = c("city", "state")) |> select_dimension(name = "when", attributes = c("when_happened_year")) |> select_fact(name = "mrs_age", measures = c("n_deaths")) |> select_fact( name = "mrs_cause", measures = c("pneumonia_and_influenza_deaths", "other_deaths") ) |> filter_dimension(name = "when", when_happened_week <= "03") |> filter_dimension(name = "where", city == "Bridgeport") |> run_query() ``` The result of a query is a `multistar` structure. In this way, we can define queries on the result of others. The result can also be transformed into a flat table. ```{r} ft <- ms |> multistar_as_flat_table() ``` The content of the flat table is shown below. ```{r, results = "asis", echo = FALSE} pander::pandoc.table(head(ft), split.table = Inf) ``` In the result it can be seen that, as the query has been defined at the year and city level, for a single city, with data available for a single year, there is only one row. Columns `nrow_agg` and `mrs_cause_nrow_agg` show the number of original rows that make up the result row. From the result in the form of a flat table, [`pivottabler`](https://CRAN.R-project.org/package=pivottabler) package can be used to present it in the form of a pivot table. # Conclusions `starschemar` package offers a set of operations that allow us to transform flat tables into star schemas. Star schemas support the definition of role playing and role dimensions. Additional transformation operations can be applied to each star schema to adapt the format of the data. From several star schemas you can define fact constellation with conformed dimensions. Cleaning and conforming data operations can be defined on the star schemas and fact constellation. To update the data, incremental refresh operations are offered, also applicable on said structures. In addition, there are several possibilities to export the results obtained in the form of easily treatable `tibble`-based structures. Operations are also provided to query the multidimensional structure, mainly for selecting the data to export. Operations have been designed to be intuitive and easy to use. The result greatly facilitates the data transformation process for the exposed situation. # References