--- title: "Basics and getting started with the `rolap` package" author: "Jose Samos (jsamos@ugr.es)" date: "2023-07-28" output: rmarkdown::html_vignette bibliography: bibliography.bib vignette: > %\VignetteIndexEntry{Basics and getting started with the `rolap` package} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup, echo = FALSE, message=FALSE} library(rolap) ``` # Introduction The *multidimensional data model* was defined in the 1990s with the aim of supporting data analysis. Data in multidimensional systems is obtained from operational systems and is transformed to adapt it to the new structure. Transformations can be carried out using professional ETL (*Extract, Transform and Load*) tools. Recently, tools aimed at end users have emerged, which are also aimed at performing transformation operations. All these tools are very useful to carry out the transformation process, they provide a development environment to define the transformation operations in a general way. Frequently, the operations to be performed aim to transform tables (with data that comes from operational systems) into a ROLAP (*Relational On-Line Analytical Processing*) star database, made up of fact and dimension tables, which implements a multidimensional system. With the tools mentioned above, this transformation can be carried out, but it requires a lot of work. The goal of `rolap` is to define transformations that allow you to easily obtain ROLAP star databases, composed by fact and dimension tables, from operational tables and to be able to export them in various formats to be used by OLAP query tools. The rest of this document is structured as follows: First, basic concepts of dimensional modelling and star databases are summarized. The following is an illustrative example of how the package works, to show usage of features it offers. Finally, the document ends with conclusions and bibliography. # Dimensional modelling and star databases The content of this section is a summary based mainly on @adamson2010star and @kimball2013data. It is intended to present the fundamental concepts of the area that are relevant to understand the functionality, naming and use of the `rolap` package. ## Dimensional modelling Dimensional modelling aims to obtain simple data models. Simplicity is sought for two reasons: so that decision-makers can easily understand the data, and also so that they can be easily queried. In dimensional modelling, the analysis of a business process is performed *modelling how it is measured*. The measures are called *facts*, and the descriptors of the context of the facts are *dimensions*. Facts are numerical data, contain *measures*, and decision makers want to see them at various levels of detail, defined by dimensions. Not all numerical data is a measure in facts (some tools consider it that way). In dimensional modelling the designer has to differentiate between facts and dimensions. Some criteria are considered to distinguish between them, for example: - If it can be defined at different levels of detail then it is a fact. - If it is quantitative and takes continuous values, then it is a fact. - If it provides context then it is a dimension. Sometimes there are no measures associated with the business process, it is simply recorded that the combination of dimensions has occurred. This situation is often called *factless facts*, @jensen2010multi prefer to call it *measureless facts*. In any case, including when no other measures are available, a measure can be considered that represents the number of times the combination of dimension values occurs. ### Dimensions and dimension attributes Attributes considered by the designer as dimensions can be grouped taking into account the natural affinities between them. In particular, they can be grouped as they describe the "who, what, where, when, how and why" associated with the modelled business process. Two attributes share a natural affinity when they are only related in one context. When their relationships are determined by transactions or activities, they can occur in multiple contexts, if this occurs, they must be located in different dimensions. In this way, a *dimension* is made up of a set of naturally related *dimension attributes* that describe the context of facts. Dimensions are used for two purposes: fact *selection* and fact *grouping* with the desired level of detail. Additionally, in the dimensions *hierarchies* with *levels* and *descriptors* can be defined. More details can be found at @jensen2010multi. These concepts are not used in the current version of the package. ### Facts and measures Facts have a *granularity*, which is determined by the attributes of the dimensions that are considered at each moment. Thus, a measure in a fact has two components, the *numerical property* and an *formula*, frequently the *SUM* aggregation function, that allows combining several values of this measure to obtain a new value of the same measure with a coarser granularity [@jensen2010multi]. According to their behaviour to obtain a coarser granularity, three types of measures are distinguished: *additive*, *semi-additive* and *non-additive*. For *additive measures*, SUM is always a valid formula that maintains the meaning of the measure when the granularity changes. For *semi-additive measures*, there is no point in using SUM when changing the level of detail in any of the dimensions because the meaning of the measure changes, this frequently occurs in dimensions that represents time and measures representing inventory level. For *non-additive measures*, values cannot be combined across any dimension using SUM because the result obtained has a different meaning from the original measure (generally occurs with ratios, percentages or unit amounts such as unit cost or unit price). The most useful measures are additive. If we have non-additive measures, they can generally be redefined from other additive measures. ## Star databases Dimensional models implemented in RDBMS (*Relational Database Management Systems*) are known as ROLAP (*Relational On-Line Analytical Processing*), if they use a table for each dimension are called *ROLAP star databases* because of their resemblance to a star-like structure: A *fact table* in the centre and *dimension tables* around it. Thus, dimension attributes are columns of the respective dimension tables, and measures are columns of the fact table. Other possible implementations on RDBMS normalize dimensions and are known as *ROLAP snowflake databases*. More details can be found at @jensen2010multi. This is not considered in this package. ### Dimension tables Dimension tables contain the context associated with business process measures. Although they can contain any type of data, numerical data is generally not used for dimension attributes because some end user query tools consider any numeric data as a measure. Dimension attributes with NULL value are a source of problems when querying since DBMS and query tools sometimes handle them inconsistently, the result depends on the product. It is recommended to avoid the use of NULL and replace them with a descriptive text. In the case of dates, it is recommended to replace the NULL values with an arbitrary date in the very far future. #### Surrogate keys A dimension table contains dimension attributes and also a *surrogate key* column. This column is a unique identifier that has no intrinsic meaning: It is generally an integer and is the primary key for the dimension table. In @adamson2010star surrogate keys are easily identifiable by the suffix "_key" in the column name (and this criterion has also been applied in `rolap` package). Dimension tables also contain *natural key* columns that uniquely identify associated entities in an operational system. The separation of surrogate keys and natural keys allows the star schema to store changes in dimensions. Therefore, the use of surrogate keys in dimensions is a solution to the SCD (*Slowly Changing Dimensions*) problem. #### Special dimensions In some cases, for the sake of simplicity, it is helpful to create a table that contains dimension attributes that have no natural affinities to each other, generally these are low-cardinality flags and indicators. The result is what is known as a *junk dimension*. They do not require any special support, only the designer's will to define them. Sometimes some dimension attributes are left in the fact table, usually transaction identifiers. It is considered as the primary key of a dimension that does not have an associated table, for this reason it is known as a *degenerate dimension*. In this package, each dimension is represented by a table, for a degenerate dimension this table will have a single attribute (for example, the transaction identifier). If the dimension table is subsequently deleted, we will have a degenerate dimension. Associated with multiple star databases we have the *conformed dimensions* that are presented in section [Conformed dimensions]. ### Fact table At the centre of a star database is the fact table. In addition to containing measures, the fact table includes *foreign keys* that refer to each of the surrogate keys in the dimension tables. #### Primary key A subset of foreign keys (along with possibly keys of degenerate dimensions) is considered to form the *primary key* of the fact table. #### Grain The subset of dimensions that forms the primary key defines the level of detail stored in the fact table, which is known as the fact table's *grain*. In the design process, it is very important for the designer to clearly define the grain of the fact table (it is usually defined by listing the dimensions whose surrogate keys form its primary key): It is a way to ensure that all the facts are stored at the same level of detail. At the finest grain, a row in the fact table corresponds to the measures of an event and vice versa, it is not influenced by the possible reports that may be obtained. When two facts have different grains, they should be set on different fact tables. ### Constellation It is frequent the need to have several fact tables for various reasons: - We find measures with different grain. - There are measures that do not occur simultaneously, for example, when one occurs, we have no value for others and vice versa. In reality it is about different business processes, each one has to have its own fact table but they have dimensions in common. This is known as a [*constellation*](https://www.geeksforgeeks.org/fact-constellation-in-data-warehouse-modelling/) which corresponds to the Kimball *enterprise data warehouse bus architecture*. ### Conformed dimensions When star databases share a set of common dimensions in a constellation, these dimensions are called *conformed dimensions*. There are several possibilities to have conformed dimensions, the most obvious form is that the dimension tables share *structure* and *content*, that is, they are identical dimensions. This is the one considered in this version of the `rolap` package. # An illustrative example In this section an illustrative example is developed to show how the package works. ## Data sets To illustrate how the package works we will use a small part of the [Deaths in 122 U.S. cities - 1962-2016. 122 Cities Mortality Reporting System](https://catalog.data.gov/dataset/deaths-in-122-u-s-cities-1962-2016-122-cities-mortality-reporting-system) dataset. ```{r, results = "asis", echo = FALSE} pander::pandoc.table(ft, split.table = Inf) ``` In the table above, all the rows of the `ft` dataset are shown. For each week and city, mortality figures by **age** group and **cause**, considered separately, are included (i.e., the combination of age group and cause is not included). In the cause, only a distinction is made between pneumonia or influenza and others. Regarding the data types, all the columns are of type character, they have been imported directly as a `tibble`, without indicating a specific type. From this initial dataset, two additional datasets, `ft_num` and `ft_age`, have been defined with the transformations shown below. ```{r} library(rolap) ft_num <- ft |> dplyr::mutate(`Pneumonia and Influenza Deaths` = as.integer(`Pneumonia and Influenza Deaths`)) |> dplyr::mutate(`All Deaths` = as.integer(`All Deaths`)) ft_age <- ft |> dplyr::select(-`Pneumonia and Influenza Deaths`,-`All Deaths`) |> tidyr::gather("Age", "All Deaths", 7:11) |> dplyr::mutate(`All Deaths` = as.integer(`All Deaths`)) |> dplyr::mutate(Age = stringr::str_replace(Age, " \\(all cause deaths\\)", "")) ``` `ft_num` has the same structure and data as `ft`, only the data types of the two cause measures change. For `ft_age`, the columns with cause measures are removed and it is transformed into a tidy table (through `tidyr::gather`), with the only measure of numeric type and removing redundant descriptions. Below are the first 15 rows of `ft_age`. ```{r, results = "asis", echo = FALSE} pander::pandoc.table(head(ft_age, 15), split.table = Inf) ``` `ft_num` and `ft_age` are the tables that will be considered as a starting point to obtain star databases from them in this example. ## Dimensional modelling For each table, the goal is to define the attributes that correspond to measures in facts and those that are in dimensions. For facts, measures and their aggregation functions have to be defined. For dimensions, attributes with natural affinity must be grouped. Each attribute should only appear once in the definition. To avoid having to write the name of the attributes of the table, with the following function we can have them in the form of a string. Thus, we can copy and paste each name as needed. ```{r} dput(colnames(ft_num)) dput(colnames(ft_age)) ``` ### Star Schema We create an `star_schema` object using the `star_schema()` function, as shown below. ```{r} s <- star_schema() ``` We already have an empty schema, we can add facts and dimensions to it, as shown in the sections that follow. ### Dimensions We can define the dimensions through variables, using the `dimension_schema()` function and then the `define_dimension()` function in a `star_schema` object, or directly in a `star_schema` object , using the `define_dimension()` function. In the example below, the result in both cases is the same. The advantage is that by using variables we save defining them several times if the same dimension appears in several star schemes. ```{r} # definition 1 when <- dimension_schema(name = "When", attributes = c("Year", "WEEK")) s_age <- star_schema() |> define_dimension(when) # definition 2 s_age <- star_schema() |> define_dimension(name = "When", attributes = c("Year", "WEEK")) ``` ### Facts For completeness, facts can also be defined as variables or directly associated with the star_schema object. As for the dimensions, in the example below, the result in both cases is the same. ```{r} # definition 1 mrs_cause <- fact_schema(name = "MRS Cause", measures = c("Pneumonia and Influenza Deaths", "All Deaths")) s_cause <- star_schema() |> define_facts(mrs_cause) # definition 2 s_cause <- star_schema() |> define_facts(name = "MRS Cause", measures = c("Pneumonia and Influenza Deaths", "All Deaths")) ``` Each measure has an aggregation function associated with it, by default it is *SUM*. Alternatively, the *MAX* or *MIN* functions can be used. *AVG* cannot be used because the average of averages is not the total average. Instead, a field called `nrow_agg` (an alternative name can be specified) is always added where the number of records added in the result row is stored. With that value and the result of *SUM*, the value of the *AVG* function can be obtained. The aggregation functions are needed to get the data at the granularity defined in the star schema, which can be equal to or coarser than the granularity defined in the table. ```{r} s_cause <- star_schema() |> define_facts(name = "MRS Cause", measures = c("Pneumonia and Influenza Deaths", "All Deaths"), agg_functions = c("MAX", "SUM"), nrow_agg = "Num" ) ``` In the above example, aggregation functions have been explicitly defined for the measures and the default name of the measure corresponding to the number of records added has been changed. ### Star databases From a `star_schema` object and a table (`tibble`) that contains all the fields that appear in it, using the `star_database()` function, we obtain a `star_database` object. As an example we will consider data on causes. The definition of the star schema for the data considered for the cause case is shown below. ```{r} s_cause <- star_schema() |> define_facts(name = "MRS Cause", measures = c("Pneumonia and Influenza Deaths", "All Deaths"), nrow_agg = "Num") |> define_dimension(name = "When", attributes = c("Year", "WEEK", "Week Ending Date")) |> define_dimension(name = "Where", attributes = c("REGION", "State", "City")) ``` Next we get the star database. ```{r} db_cause <- star_database(s_cause, ft_num) ``` For now the objective of this database is to export its tables to be used in OLAP query tools. Two functions are offered for this purpose, firstly `snake_case()` to transform the column names to snake case and secondly `as_tibble_list()` to generate a `tibble` list with the database tables, as shown in the following example. ```{r} l_cause <- db_cause |> snake_case() |> as_tibble_list() ``` The tables of dimensions and facts of the obtained star database are shown below. ```{r, results = "asis", echo = FALSE} for (i in 1:length(l_cause)) { pander::pandoc.table(l_cause[[i]], split.table = Inf) } ``` It can be seen that the data in the fact table has the same granularity as the data in the original table (they have the same number of records). ```{r} nrow(ft_num) nrow(l_cause[[3]]) ``` The same can be seen in the `num` field in the fact table which shows the number of records added to get each record, in this case it is always 1. Dimensions contain only the distinct value combinations for the attribute sets they contain. #### Rename fact measures and dimension attributes In this case it has not been considered necessary, but if desired, it is possible to rename both the dimension attributes and the fact measures. To do this, the functions `get_attribute_names()` and `get_measure_names()` are available to consult the current names, and `set_attribute_names()` and `set_measure_names()` to change them if necessary. ### Constellations From two or more star databases we can define a constellation. #### Definition of star databases To more effectively demonstrate the functionality, we are going to define the star databases with a coarser granularity than the original tables; additionally we are going to select only some of the rows from the tables. For the cause case is shown below. ```{r} when <- dimension_schema(name = "When", attributes = c("Year")) where <- dimension_schema(name = "Where", attributes = c("State", "City")) s_cause <- star_schema() |> define_facts(name = "MRS Cause", measures = c("Pneumonia and Influenza Deaths", "All Deaths")) |> define_dimension(when) |> define_dimension(where) ft_num2 <- ft_num |> dplyr::filter(Year > "1962") |> dplyr::filter(City == "Boston" | City == "Bridgeport") db_cause <- star_database(s_cause, ft_num2) |> snake_case() l_cause <- db_cause |> as_tibble_list() ``` The tables obtained as a result are shown below. ```{r, results = "asis", echo = FALSE} for (i in 1:length(l_cause)) { pander::pandoc.table(l_cause[[i]], split.table = Inf) } ``` Next, we define the star database and the rest of the operations in the case of age groups. ```{r} s_age <- star_schema() |> define_facts(name = "MRS Age", measures = c("All Deaths")) |> define_dimension(when) |> define_dimension(where) |> define_dimension(name = "Who", attributes = c("Age")) ft_age2 <- ft_age |> dplyr::filter(Year < "1964") |> dplyr::filter(City != "Boston" & City != "Bridgeport") db_age <- star_database(s_age, ft_age2) |> snake_case() l_age <- db_age |> as_tibble_list() ``` The tables obtained as a result are also shown below. ```{r, results = "asis", echo = FALSE} for (i in 1:length(l_age)) { pander::pandoc.table(l_age[[i]], split.table = Inf) } ``` As we have filtered the data, it can be seen that the dimension tables only contain the necessary data for each star database but, for common dimensions, they share the same structure. #### Constellation definition Next we define a constellation formed by the two star databases. ```{r example5} ct <- constellation("MRS", db_cause, db_age) ``` Constellation tables can also be exported as a tibble list. ```{r example6} lc <- ct |> as_tibble_list() ``` Below are the tables of the constellation's star databases. The instances of the dimensions have been integrated so that the tables are common to both databases. ```{r, results = "asis", echo = FALSE} for (i in 1:length(lc)) { pander::pandoc.table(lc[[i]], split.table = Inf) } ``` These tables can be directly exported in the format required by the OLAP query tool that we are going to use. Constellations can be defined from star databases and also from other constellations. For example, let's define a new star database based on the previous examples. ```{r} s_age2 <- star_schema() |> define_facts(name = "MRS Age 2", measures = c("All Deaths")) |> define_dimension(when) |> define_dimension(where) |> define_dimension(name = "Who", attributes = c("Age")) db_age2 <- star_database(s_age2, ft_age) |> snake_case() ``` Next we define a constellation, it is made up of the previously defined constellation and the new star database. ```{r} ct2 <- constellation("MRS2", ct, db_age2) ``` ## Export results To continue working with the result of the implementation of the multidimensional database (star database or constellation) it can be exported as a `tibble` list, suitable for example to pass the data to a spreadsheet, or as an object of the class `dm`, which allows passing the data to a RDBMS. In some cases, we may be interested in having the data in the form of a table again, especially if we want to carry out some statistical analysis with the integrated data. We can export them as a single `tibble` list. Functions are defined, applicable to objects of both star databases and constellations. ### `tibble` list The following shows the results that are obtained for a star database. ```{r} db_cause |> as_tibble_list() ``` Also for a constellation. ```{r} ct |> as_tibble_list() ``` ### `dm` object The [`dm`](https://cran.r-project.org/package=dm) package allows you to link tables implemented using the `tibble` class in R with tables in RDBMS. The following example shows the transformation of a star database to an object of class `dm`. ```{r} # star database db_cause_dm <- db_cause |> as_dm_class() class(db_cause_dm) db_cause_dm ``` Also for a constellation. ```{r} # constellation ct_dm <- ct |> as_dm_class(pk_facts = TRUE) class(ct_dm) ct_dm ``` For example, the `dm` class object can be used to represent the constellation tables, as it's shown in the following (it can also be used to store them in a DBMS). ```{r} ct_dm |> dm::dm_draw(rankdir = "LR", view_type = "all") ``` In the example below, a local *SQLite* database is used to demonstrate how to store the model in a database. `dm` works with any RDBMS supported by `DBI`. ```{r} db <- DBI::dbConnect(RSQLite::SQLite()) ct_dm_db <- dm::copy_dm_to(db, ct_dm) ct_dm_db DBI::dbDisconnect(db) ``` ### Single `tibble` list The following shows the results that are obtained for a star database seen as a single `tibble` list. Each single `tibble` is the result of performing an *inner join* between the fact table and the dimension tables. ```{r} db_cause |> as_single_tibble_list() ``` Also for a constellation. ```{r} ct |> as_single_tibble_list() ``` # Conclusions `rolap` package offers a set of operations that allow us to transform tables into star databases. From several star databases you can define a constellation with conformed dimensions. Operations have been designed to be intuitive and easy to use. The result greatly facilitates the data transformation process for the exposed situation. The implementation of the multidimensional database obtained can be exported to work with multidimensional analysis tools on spreadsheets or RDBMS. # References