Basics and getting started with the rolap package

Introduction

The multidimensional data model was defined in the 1990s with the aim of supporting data analysis. Data in multidimensional systems is obtained from operational systems and is transformed to adapt it to the new structure.

Transformations can be carried out using professional ETL (Extract, Transform and Load) tools. Recently, tools aimed at end users have emerged, which are also aimed at performing transformation operations. All these tools are very useful to carry out the transformation process, they provide a development environment to define the transformation operations in a general way.

Frequently, the operations to be performed aim to transform tables (with data that comes from operational systems) into a ROLAP (Relational On-Line Analytical Processing) star database, made up of fact and dimension tables, which implements a multidimensional system. With the tools mentioned above, this transformation can be carried out, but it requires a lot of work.

The goal of rolap is to define transformations that allow you to easily obtain ROLAP star databases, composed by fact and dimension tables, from operational tables and to be able to export them in various formats to be used by OLAP query tools.

The rest of this document is structured as follows: First, basic concepts of dimensional modelling and star databases are summarized. The following is an illustrative example of how the package works, to show usage of features it offers. Finally, the document ends with conclusions and bibliography.

Dimensional modelling and star databases

The content of this section is a summary based mainly on Adamson (2010) and Kimball and Ross (2013). It is intended to present the fundamental concepts of the area that are relevant to understand the functionality, naming and use of the rolap package.

Dimensional modelling

Dimensional modelling aims to obtain simple data models. Simplicity is sought for two reasons: so that decision-makers can easily understand the data, and also so that they can be easily queried.

In dimensional modelling, the analysis of a business process is performed modelling how it is measured. The measures are called facts, and the descriptors of the context of the facts are dimensions. Facts are numerical data, contain measures, and decision makers want to see them at various levels of detail, defined by dimensions.

Not all numerical data is a measure in facts (some tools consider it that way). In dimensional modelling the designer has to differentiate between facts and dimensions. Some criteria are considered to distinguish between them, for example:

  • If it can be defined at different levels of detail then it is a fact.
  • If it is quantitative and takes continuous values, then it is a fact.
  • If it provides context then it is a dimension.

Sometimes there are no measures associated with the business process, it is simply recorded that the combination of dimensions has occurred. This situation is often called factless facts, Jensen, Pedersen, and Thomsen (2010) prefer to call it measureless facts. In any case, including when no other measures are available, a measure can be considered that represents the number of times the combination of dimension values occurs.

Dimensions and dimension attributes

Attributes considered by the designer as dimensions can be grouped taking into account the natural affinities between them. In particular, they can be grouped as they describe the “who, what, where, when, how and why” associated with the modelled business process. Two attributes share a natural affinity when they are only related in one context. When their relationships are determined by transactions or activities, they can occur in multiple contexts, if this occurs, they must be located in different dimensions.

In this way, a dimension is made up of a set of naturally related dimension attributes that describe the context of facts. Dimensions are used for two purposes: fact selection and fact grouping with the desired level of detail.

Additionally, in the dimensions hierarchies with levels and descriptors can be defined. More details can be found at Jensen, Pedersen, and Thomsen (2010). These concepts are not used in the current version of the package.

Facts and measures

Facts have a granularity, which is determined by the attributes of the dimensions that are considered at each moment. Thus, a measure in a fact has two components, the numerical property and an formula, frequently the SUM aggregation function, that allows combining several values of this measure to obtain a new value of the same measure with a coarser granularity (Jensen, Pedersen, and Thomsen 2010).

According to their behaviour to obtain a coarser granularity, three types of measures are distinguished: additive, semi-additive and non-additive. For additive measures, SUM is always a valid formula that maintains the meaning of the measure when the granularity changes. For semi-additive measures, there is no point in using SUM when changing the level of detail in any of the dimensions because the meaning of the measure changes, this frequently occurs in dimensions that represents time and measures representing inventory level. For non-additive measures, values cannot be combined across any dimension using SUM because the result obtained has a different meaning from the original measure (generally occurs with ratios, percentages or unit amounts such as unit cost or unit price).

The most useful measures are additive. If we have non-additive measures, they can generally be redefined from other additive measures.

Star databases

Dimensional models implemented in RDBMS (Relational Database Management Systems) are known as ROLAP (Relational On-Line Analytical Processing), if they use a table for each dimension are called ROLAP star databases because of their resemblance to a star-like structure: A fact table in the centre and dimension tables around it. Thus, dimension attributes are columns of the respective dimension tables, and measures are columns of the fact table.

Other possible implementations on RDBMS normalize dimensions and are known as ROLAP snowflake databases. More details can be found at Jensen, Pedersen, and Thomsen (2010). This is not considered in this package.

Dimension tables

Dimension tables contain the context associated with business process measures. Although they can contain any type of data, numerical data is generally not used for dimension attributes because some end user query tools consider any numeric data as a measure.

Dimension attributes with NULL value are a source of problems when querying since DBMS and query tools sometimes handle them inconsistently, the result depends on the product. It is recommended to avoid the use of NULL and replace them with a descriptive text. In the case of dates, it is recommended to replace the NULL values with an arbitrary date in the very far future.

Surrogate keys

A dimension table contains dimension attributes and also a surrogate key column. This column is a unique identifier that has no intrinsic meaning: It is generally an integer and is the primary key for the dimension table. In Adamson (2010) surrogate keys are easily identifiable by the suffix “_key” in the column name (and this criterion has also been applied in rolap package).

Dimension tables also contain natural key columns that uniquely identify associated entities in an operational system. The separation of surrogate keys and natural keys allows the star schema to store changes in dimensions. Therefore, the use of surrogate keys in dimensions is a solution to the SCD (Slowly Changing Dimensions) problem.

Special dimensions

In some cases, for the sake of simplicity, it is helpful to create a table that contains dimension attributes that have no natural affinities to each other, generally these are low-cardinality flags and indicators. The result is what is known as a junk dimension. They do not require any special support, only the designer’s will to define them.

Sometimes some dimension attributes are left in the fact table, usually transaction identifiers. It is considered as the primary key of a dimension that does not have an associated table, for this reason it is known as a degenerate dimension. In this package, each dimension is represented by a table, for a degenerate dimension this table will have a single attribute (for example, the transaction identifier). If the dimension table is subsequently deleted, we will have a degenerate dimension.

Associated with multiple star databases we have the conformed dimensions that are presented in section Conformed dimensions.

Fact table

At the centre of a star database is the fact table. In addition to containing measures, the fact table includes foreign keys that refer to each of the surrogate keys in the dimension tables.

Primary key

A subset of foreign keys (along with possibly keys of degenerate dimensions) is considered to form the primary key of the fact table.

Grain

The subset of dimensions that forms the primary key defines the level of detail stored in the fact table, which is known as the fact table’s grain. In the design process, it is very important for the designer to clearly define the grain of the fact table (it is usually defined by listing the dimensions whose surrogate keys form its primary key): It is a way to ensure that all the facts are stored at the same level of detail.

At the finest grain, a row in the fact table corresponds to the measures of an event and vice versa, it is not influenced by the possible reports that may be obtained. When two facts have different grains, they should be set on different fact tables.

Constellation

It is frequent the need to have several fact tables for various reasons:

  • We find measures with different grain.

  • There are measures that do not occur simultaneously, for example, when one occurs, we have no value for others and vice versa.

In reality it is about different business processes, each one has to have its own fact table but they have dimensions in common. This is known as a constellation which corresponds to the Kimball enterprise data warehouse bus architecture.

Conformed dimensions

When star databases share a set of common dimensions in a constellation, these dimensions are called conformed dimensions.

There are several possibilities to have conformed dimensions, the most obvious form is that the dimension tables share structure and content, that is, they are identical dimensions. This is the one considered in this version of the rolap package.

An illustrative example

In this section an illustrative example is developed to show how the package works.

Data sets

To illustrate how the package works we will use a small part of the Deaths in 122 U.S. cities - 1962-2016. 122 Cities Mortality Reporting System dataset.

Year WEEK Week Ending Date REGION State City Pneumonia and Influenza Deaths All Deaths <1 year (all cause deaths) 1-24 years (all cause deaths) 25-44 years 45-64 years (all cause deaths) 65+ years (all cause deaths)
1962 2 01/13/1962 1 MA Boston 11 270 14 8 11 70 167
1962 4 01/27/1962 1 MA Boston 12 285 22 7 8 73 175
1963 4 01/26/1963 1 MA Boston 10 276 11 14 17 67 167
1964 3 01/18/1964 1 MA Boston 13 325 17 7 24 90 187
1964 6 02/08/1964 1 MA Boston 9 244 13 9 14 61 147
1962 3 01/20/1962 1 CT Bridgeport 2 40 5 1 3 10 21
1962 5 02/03/1962 1 CT Bridgeport 5 46 6 0 3 15 22
1962 8 02/24/1962 1 CT Bridgeport 2 45 2 0 2 16 25
1963 4 01/26/1963 1 CT Bridgeport 2 46 4 0 3 10 29
1964 5 02/01/1964 1 CT Bridgeport 8 45 3 1 2 11 28
1962 9 03/03/1962 1 MA Cambridge 4 39 1 0 2 7 29
1964 2 01/11/1964 1 MA Cambridge 7 31 1 0 2 9 19
1964 5 02/01/1964 1 MA Cambridge 6 27 2 0 0 8 17
1964 9 02/29/1964 1 MA Cambridge 0 26 0 0 2 8 16
1962 4 01/27/1962 1 CT Hartford 1 47 7 1 0 14 25
1962 7 02/17/1962 1 CT Hartford 4 57 3 1 3 21 29
1963 3 01/19/1963 1 CT Hartford 2 66 7 2 3 18 36
1963 7 02/16/1963 1 CT Hartford 4 77 6 1 7 19 44
1963 8 02/23/1963 1 CT Hartford 6 49 3 2 3 14 27
1964 2 01/11/1964 1 CT Hartford 3 53 7 0 2 16 28

In the table above, all the rows of the ft dataset are shown. For each week and city, mortality figures by age group and cause, considered separately, are included (i.e., the combination of age group and cause is not included). In the cause, only a distinction is made between pneumonia or influenza and others. Regarding the data types, all the columns are of type character, they have been imported directly as a tibble, without indicating a specific type.

From this initial dataset, two additional datasets, ft_num and ft_age, have been defined with the transformations shown below.

library(rolap)

ft_num <- ft |>
  dplyr::mutate(`Pneumonia and Influenza Deaths` = as.integer(`Pneumonia and Influenza Deaths`)) |>
  dplyr::mutate(`All Deaths` = as.integer(`All Deaths`))

ft_age <- ft |>
  dplyr::select(-`Pneumonia and Influenza Deaths`,-`All Deaths`) |>
  tidyr::gather("Age", "All Deaths", 7:11) |>
  dplyr::mutate(`All Deaths` = as.integer(`All Deaths`)) |>
  dplyr::mutate(Age = stringr::str_replace(Age, " \\(all cause deaths\\)", ""))

ft_num has the same structure and data as ft, only the data types of the two cause measures change. For ft_age, the columns with cause measures are removed and it is transformed into a tidy table (through tidyr::gather), with the only measure of numeric type and removing redundant descriptions. Below are the first 15 rows of ft_age.

Year WEEK Week Ending Date REGION State City Age All Deaths
1962 2 01/13/1962 1 MA Boston <1 year 14
1962 4 01/27/1962 1 MA Boston <1 year 22
1963 4 01/26/1963 1 MA Boston <1 year 11
1964 3 01/18/1964 1 MA Boston <1 year 17
1964 6 02/08/1964 1 MA Boston <1 year 13
1962 3 01/20/1962 1 CT Bridgeport <1 year 5
1962 5 02/03/1962 1 CT Bridgeport <1 year 6
1962 8 02/24/1962 1 CT Bridgeport <1 year 2
1963 4 01/26/1963 1 CT Bridgeport <1 year 4
1964 5 02/01/1964 1 CT Bridgeport <1 year 3
1962 9 03/03/1962 1 MA Cambridge <1 year 1
1964 2 01/11/1964 1 MA Cambridge <1 year 1
1964 5 02/01/1964 1 MA Cambridge <1 year 2
1964 9 02/29/1964 1 MA Cambridge <1 year 0
1962 4 01/27/1962 1 CT Hartford <1 year 7

ft_num and ft_age are the tables that will be considered as a starting point to obtain star databases from them in this example.

Dimensional modelling

For each table, the goal is to define the attributes that correspond to measures in facts and those that are in dimensions. For facts, measures and their aggregation functions have to be defined. For dimensions, attributes with natural affinity must be grouped. Each attribute should only appear once in the definition.

To avoid having to write the name of the attributes of the table, with the following function we can have them in the form of a string. Thus, we can copy and paste each name as needed.

dput(colnames(ft_num))
#> c("Year", "WEEK", "Week Ending Date", "REGION", "State", "City", 
#> "Pneumonia and Influenza Deaths", "All Deaths", "<1 year (all cause deaths)", 
#> "1-24 years (all cause deaths)", "25-44 years", "45-64 years (all cause deaths)", 
#> "65+ years (all cause deaths)")

dput(colnames(ft_age))
#> c("Year", "WEEK", "Week Ending Date", "REGION", "State", "City", 
#> "Age", "All Deaths")

Star Schema

We create an star_schema object using the star_schema() function, as shown below.

s <- star_schema()

We already have an empty schema, we can add facts and dimensions to it, as shown in the sections that follow.

Dimensions

We can define the dimensions through variables, using the dimension_schema() function and then the define_dimension() function in a star_schema object, or directly in a star_schema object , using the define_dimension() function. In the example below, the result in both cases is the same. The advantage is that by using variables we save defining them several times if the same dimension appears in several star schemes.

# definition 1
when <- dimension_schema(name = "When",
                         attributes = c("Year",
                                        "WEEK"))
s_age <- star_schema() |>
  define_dimension(when)

# definition 2
s_age <- star_schema() |>
  define_dimension(name = "When",
                         attributes = c("Year",
                                        "WEEK"))

Facts

For completeness, facts can also be defined as variables or directly associated with the star_schema object. As for the dimensions, in the example below, the result in both cases is the same.

# definition 1
mrs_cause <- fact_schema(name = "MRS Cause",
               measures = c("Pneumonia and Influenza Deaths",
                            "All Deaths"))
s_cause <- star_schema() |>
  define_facts(mrs_cause)

# definition 2
s_cause <- star_schema() |>
  define_facts(name = "MRS Cause",
               measures = c("Pneumonia and Influenza Deaths",
                            "All Deaths"))

Each measure has an aggregation function associated with it, by default it is SUM. Alternatively, the MAX or MIN functions can be used. AVG cannot be used because the average of averages is not the total average. Instead, a field called nrow_agg (an alternative name can be specified) is always added where the number of records added in the result row is stored. With that value and the result of SUM, the value of the AVG function can be obtained.

The aggregation functions are needed to get the data at the granularity defined in the star schema, which can be equal to or coarser than the granularity defined in the table.

s_cause <- star_schema() |>
  define_facts(name = "MRS Cause",
               measures = c("Pneumonia and Influenza Deaths",
                            "All Deaths"),
               agg_functions = c("MAX", "SUM"),
               nrow_agg = "Num" 
               )

In the above example, aggregation functions have been explicitly defined for the measures and the default name of the measure corresponding to the number of records added has been changed.

Star databases

From a star_schema object and a table (tibble) that contains all the fields that appear in it, using the star_database() function, we obtain a star_database object. As an example we will consider data on causes.

The definition of the star schema for the data considered for the cause case is shown below.

s_cause <- star_schema() |>
  define_facts(name = "MRS Cause",
               measures = c("Pneumonia and Influenza Deaths",
                            "All Deaths"),
               nrow_agg = "Num") |>
  define_dimension(name = "When",
                         attributes = c("Year",
                                        "WEEK",
                                        "Week Ending Date")) |>
  define_dimension(name = "Where",
                          attributes = c("REGION",
                                         "State",
                                         "City"))

Next we get the star database.

db_cause <- star_database(s_cause, ft_num)

For now the objective of this database is to export its tables to be used in OLAP query tools. Two functions are offered for this purpose, firstly snake_case() to transform the column names to snake case and secondly as_tibble_list() to generate a tibble list with the database tables, as shown in the following example.

l_cause <- db_cause |>
  snake_case() |>
  as_tibble_list()

The tables of dimensions and facts of the obtained star database are shown below.

when_key year week week_ending_date
1 1962 2 01/13/1962
2 1962 3 01/20/1962
3 1962 4 01/27/1962
4 1962 5 02/03/1962
5 1962 7 02/17/1962
6 1962 8 02/24/1962
7 1962 9 03/03/1962
8 1963 3 01/19/1963
9 1963 4 01/26/1963
10 1963 7 02/16/1963
11 1963 8 02/23/1963
12 1964 2 01/11/1964
13 1964 3 01/18/1964
14 1964 5 02/01/1964
15 1964 6 02/08/1964
16 1964 9 02/29/1964
where_key region state city
1 1 CT Bridgeport
2 1 CT Hartford
3 1 MA Boston
4 1 MA Cambridge
when_key where_key pneumonia_and_influenza_deaths all_deaths num
1 3 11 270 1
2 1 2 40 1
3 2 1 47 1
3 3 12 285 1
4 1 5 46 1
5 2 4 57 1
6 1 2 45 1
7 4 4 39 1
8 2 2 66 1
9 1 2 46 1
9 3 10 276 1
10 2 4 77 1
11 2 6 49 1
12 2 3 53 1
12 4 7 31 1
13 3 13 325 1
14 1 8 45 1
14 4 6 27 1
15 3 9 244 1
16 4 0 26 1

It can be seen that the data in the fact table has the same granularity as the data in the original table (they have the same number of records).

nrow(ft_num)
#> [1] 20

nrow(l_cause[[3]])
#> [1] 20

The same can be seen in the num field in the fact table which shows the number of records added to get each record, in this case it is always 1. Dimensions contain only the distinct value combinations for the attribute sets they contain.

Rename fact measures and dimension attributes

In this case it has not been considered necessary, but if desired, it is possible to rename both the dimension attributes and the fact measures.

To do this, the functions get_attribute_names() and get_measure_names() are available to consult the current names, and set_attribute_names() and set_measure_names() to change them if necessary.

Constellations

From two or more star databases we can define a constellation.

Definition of star databases

To more effectively demonstrate the functionality, we are going to define the star databases with a coarser granularity than the original tables; additionally we are going to select only some of the rows from the tables.

For the cause case is shown below.

when <- dimension_schema(name = "When",
                         attributes = c("Year"))

where <- dimension_schema(name = "Where",
                          attributes = c("State",
                                         "City"))
s_cause <- star_schema() |>
  define_facts(name = "MRS Cause",
               measures = c("Pneumonia and Influenza Deaths",
                            "All Deaths")) |>
  define_dimension(when) |>
  define_dimension(where)

ft_num2 <- ft_num |>
  dplyr::filter(Year > "1962") |>
  dplyr::filter(City == "Boston" | City == "Bridgeport")

db_cause <- star_database(s_cause, ft_num2) |>
  snake_case()

l_cause <- db_cause |>
  as_tibble_list()

The tables obtained as a result are shown below.

when_key year
1 1963
2 1964
where_key state city
1 CT Bridgeport
2 MA Boston
when_key where_key pneumonia_and_influenza_deaths all_deaths nrow_agg
1 1 2 46 1
1 2 10 276 1
2 1 8 45 1
2 2 22 569 2

Next, we define the star database and the rest of the operations in the case of age groups.

s_age <- star_schema() |>
  define_facts(name = "MRS Age",
               measures = c("All Deaths")) |>
  define_dimension(when) |>
  define_dimension(where) |>
  define_dimension(name = "Who",
                         attributes = c("Age"))

ft_age2 <- ft_age |>
  dplyr::filter(Year < "1964") |>
  dplyr::filter(City != "Boston" & City != "Bridgeport")

db_age <- star_database(s_age, ft_age2) |>
  snake_case()

l_age <- db_age |>
  as_tibble_list()

The tables obtained as a result are also shown below.

when_key year
1 1962
2 1963
where_key state city
1 CT Hartford
2 MA Cambridge
who_key age
1 1-24 years
2 25-44 years
3 45-64 years
4 65+ years
5 <1 year
when_key where_key who_key all_deaths nrow_agg
1 1 1 2 2
1 1 2 3 2
1 1 3 35 2
1 1 4 54 2
1 1 5 10 2
1 2 1 0 1
1 2 2 2 1
1 2 3 7 1
1 2 4 29 1
1 2 5 1 1
2 1 1 5 3
2 1 2 13 3
2 1 3 51 3
2 1 4 107 3
2 1 5 16 3

As we have filtered the data, it can be seen that the dimension tables only contain the necessary data for each star database but, for common dimensions, they share the same structure.

Constellation definition

Next we define a constellation formed by the two star databases.

ct <- constellation("MRS", db_cause, db_age)

Constellation tables can also be exported as a tibble list.

lc <- ct |>
  as_tibble_list()

Below are the tables of the constellation’s star databases. The instances of the dimensions have been integrated so that the tables are common to both databases.

when_key year
1 1962
2 1963
3 1964
where_key state city
1 CT Bridgeport
2 CT Hartford
3 MA Boston
4 MA Cambridge
who_key age
1 1-24 years
2 25-44 years
3 45-64 years
4 65+ years
5 <1 year
when_key where_key pneumonia_and_influenza_deaths all_deaths nrow_agg
2 1 2 46 1
2 3 10 276 1
3 1 8 45 1
3 3 22 569 2
when_key where_key who_key all_deaths nrow_agg
1 2 1 2 2
1 2 2 3 2
1 2 3 35 2
1 2 4 54 2
1 2 5 10 2
1 4 1 0 1
1 4 2 2 1
1 4 3 7 1
1 4 4 29 1
1 4 5 1 1
2 2 1 5 3
2 2 2 13 3
2 2 3 51 3
2 2 4 107 3
2 2 5 16 3

These tables can be directly exported in the format required by the OLAP query tool that we are going to use.

Constellations can be defined from star databases and also from other constellations. For example, let’s define a new star database based on the previous examples.

s_age2 <- star_schema() |>
  define_facts(name = "MRS Age 2",
               measures = c("All Deaths")) |>
  define_dimension(when) |>
  define_dimension(where) |>
  define_dimension(name = "Who",
                         attributes = c("Age"))

db_age2 <- star_database(s_age2, ft_age) |>
  snake_case()

Next we define a constellation, it is made up of the previously defined constellation and the new star database.

ct2 <- constellation("MRS2", ct, db_age2)

Export results

To continue working with the result of the implementation of the multidimensional database (star database or constellation) it can be exported as a tibble list, suitable for example to pass the data to a spreadsheet, or as an object of the class dm, which allows passing the data to a RDBMS.

In some cases, we may be interested in having the data in the form of a table again, especially if we want to carry out some statistical analysis with the integrated data. We can export them as a single tibble list.

Functions are defined, applicable to objects of both star databases and constellations.

tibble list

The following shows the results that are obtained for a star database.

db_cause |>
  as_tibble_list()
#> $when
#> # A tibble: 2 × 2
#>   when_key year 
#>      <int> <chr>
#> 1        1 1963 
#> 2        2 1964 
#> 
#> $where
#> # A tibble: 2 × 3
#>   where_key state city      
#>       <int> <chr> <chr>     
#> 1         1 CT    Bridgeport
#> 2         2 MA    Boston    
#> 
#> $mrs_cause
#> # A tibble: 4 × 5
#>   when_key where_key pneumonia_and_influenza_deaths all_deaths nrow_agg
#>      <int>     <int>                          <int>      <int>    <int>
#> 1        1         1                              2         46        1
#> 2        1         2                             10        276        1
#> 3        2         1                              8         45        1
#> 4        2         2                             22        569        2

Also for a constellation.

ct |>
  as_tibble_list()
#> $when
#> # A tibble: 3 × 2
#>   when_key year 
#>      <int> <chr>
#> 1        1 1962 
#> 2        2 1963 
#> 3        3 1964 
#> 
#> $where
#> # A tibble: 4 × 3
#>   where_key state city      
#>       <int> <chr> <chr>     
#> 1         1 CT    Bridgeport
#> 2         2 CT    Hartford  
#> 3         3 MA    Boston    
#> 4         4 MA    Cambridge 
#> 
#> $who
#> # A tibble: 5 × 2
#>   who_key age        
#>     <int> <chr>      
#> 1       1 1-24 years 
#> 2       2 25-44 years
#> 3       3 45-64 years
#> 4       4 65+ years  
#> 5       5 <1 year    
#> 
#> $mrs_cause
#> # A tibble: 4 × 5
#>   when_key where_key pneumonia_and_influenza_deaths all_deaths nrow_agg
#>      <int>     <int>                          <int>      <int>    <int>
#> 1        2         1                              2         46        1
#> 2        2         3                             10        276        1
#> 3        3         1                              8         45        1
#> 4        3         3                             22        569        2
#> 
#> $mrs_age
#> # A tibble: 15 × 5
#>    when_key where_key who_key all_deaths nrow_agg
#>       <int>     <int>   <int>      <int>    <int>
#>  1        1         2       1          2        2
#>  2        1         2       2          3        2
#>  3        1         2       3         35        2
#>  4        1         2       4         54        2
#>  5        1         2       5         10        2
#>  6        1         4       1          0        1
#>  7        1         4       2          2        1
#>  8        1         4       3          7        1
#>  9        1         4       4         29        1
#> 10        1         4       5          1        1
#> 11        2         2       1          5        3
#> 12        2         2       2         13        3
#> 13        2         2       3         51        3
#> 14        2         2       4        107        3
#> 15        2         2       5         16        3

dm object

The dm package allows you to link tables implemented using the tibble class in R with tables in RDBMS.

The following example shows the transformation of a star database to an object of class dm.

# star database
db_cause_dm <- db_cause |>
  as_dm_class()
class(db_cause_dm)
#> [1] "dm"
db_cause_dm
#> ── Metadata ────────────────────────────────────────────────────────────────────
#> Tables: `when`, `where`, `mrs_cause`
#> Columns: 10
#> Primary keys: 3
#> Foreign keys: 2

Also for a constellation.

# constellation
ct_dm <- ct |>
  as_dm_class(pk_facts = TRUE)
class(ct_dm)
#> [1] "dm"
ct_dm
#> ── Metadata ────────────────────────────────────────────────────────────────────
#> Tables: `when`, `where`, `who`, `mrs_cause`, `mrs_age`
#> Columns: 17
#> Primary keys: 5
#> Foreign keys: 5

For example, the dm class object can be used to represent the constellation tables, as it’s shown in the following (it can also be used to store them in a DBMS).

ct_dm |> 
  dm::dm_draw(rankdir = "LR", view_type = "all")

In the example below, a local SQLite database is used to demonstrate how to store the model in a database. dm works with any RDBMS supported by DBI.

db <- DBI::dbConnect(RSQLite::SQLite())
ct_dm_db <- dm::copy_dm_to(db, ct_dm)
ct_dm_db
#> ── Table source ────────────────────────────────────────────────────────────────
#> src:  sqlite 3.46.0 []
#> ── Metadata ────────────────────────────────────────────────────────────────────
#> Tables: `when`, `where`, `who`, `mrs_cause`, `mrs_age`
#> Columns: 17
#> Primary keys: 5
#> Foreign keys: 5
DBI::dbDisconnect(db)

Single tibble list

The following shows the results that are obtained for a star database seen as a single tibble list. Each single tibble is the result of performing an inner join between the fact table and the dimension tables.

db_cause |>
  as_single_tibble_list()
#> $mrs_cause
#> # A tibble: 4 × 6
#>   year  state city       pneumonia_and_influenza_deaths all_deaths nrow_agg
#>   <chr> <chr> <chr>                               <int>      <int>    <int>
#> 1 1963  CT    Bridgeport                              2         46        1
#> 2 1963  MA    Boston                                 10        276        1
#> 3 1964  CT    Bridgeport                              8         45        1
#> 4 1964  MA    Boston                                 22        569        2

Also for a constellation.

ct |>
  as_single_tibble_list()
#> $mrs_cause
#> # A tibble: 4 × 6
#>   year  state city       pneumonia_and_influenza_deaths all_deaths nrow_agg
#>   <chr> <chr> <chr>                               <int>      <int>    <int>
#> 1 1963  CT    Bridgeport                              2         46        1
#> 2 1963  MA    Boston                                 10        276        1
#> 3 1964  CT    Bridgeport                              8         45        1
#> 4 1964  MA    Boston                                 22        569        2
#> 
#> $mrs_age
#> # A tibble: 15 × 6
#>    year  state city      age         all_deaths nrow_agg
#>    <chr> <chr> <chr>     <chr>            <int>    <int>
#>  1 1962  CT    Hartford  1-24 years           2        2
#>  2 1962  CT    Hartford  25-44 years          3        2
#>  3 1962  CT    Hartford  45-64 years         35        2
#>  4 1962  CT    Hartford  65+ years           54        2
#>  5 1962  CT    Hartford  <1 year             10        2
#>  6 1962  MA    Cambridge 1-24 years           0        1
#>  7 1962  MA    Cambridge 25-44 years          2        1
#>  8 1962  MA    Cambridge 45-64 years          7        1
#>  9 1962  MA    Cambridge 65+ years           29        1
#> 10 1962  MA    Cambridge <1 year              1        1
#> 11 1963  CT    Hartford  1-24 years           5        3
#> 12 1963  CT    Hartford  25-44 years         13        3
#> 13 1963  CT    Hartford  45-64 years         51        3
#> 14 1963  CT    Hartford  65+ years          107        3
#> 15 1963  CT    Hartford  <1 year             16        3

Conclusions

rolap package offers a set of operations that allow us to transform tables into star databases. From several star databases you can define a constellation with conformed dimensions.

Operations have been designed to be intuitive and easy to use. The result greatly facilitates the data transformation process for the exposed situation.

The implementation of the multidimensional database obtained can be exported to work with multidimensional analysis tools on spreadsheets or RDBMS.

References

Adamson, Christopher. 2010. Star Schema: The Complete Reference. McGraw Hill Professional.
Jensen, Christian S., Torben Bach Pedersen, and Christian Thomsen. 2010. Multidimensional Databases and Data Warehousing. Morgan & Claypool Publishers.
Kimball, Ralph, and Margy Ross. 2013. The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling. John Wiley & Sons.