--- title: "Generating the When Dimension based on date and time tables" date: "2024-01-08" output: rmarkdown::html_vignette bibliography: bibliography.bib vignette: > %\VignetteIndexEntry{Generating the When Dimension based on date and time tables} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # Introduction The *When Dimension* plays a fundamental role in *Multidimensional Systems*, it allows us to express **when** the analysed focus of attention have occurred. This dimension corresponds to the generically called “Time Dimension”, as named in @kimball1996data and @adamson2010star. Later it has also been called "Date Dimension" (date-based dimension) and "Time Dimension" (time-of-day-based dimension) to express granularity [@kimball2013data]. We prefer to call it "When Dimension" because granularity is not involved in that term. Although conceptually the date and time can be represented together in a single dimension, in ROLAP (*Relational On-Line Analytical Processing*) star database systems, it is common to implement these concepts separately, sometimes in separate tables "to avoid a row count explosion in the date dimension" [@kimball2013data] by adding the time for each day. In @kimball2013data these dimensions are considered *static dimensions*: they are built at the beginning of the project for the period under consideration and are referenced when adding the instances to the fact table. As mentioned there: "Typically, these dimensions are built in an afternoon with a spreadsheet." The purpose of the `when` package is to assist in the implementation of the ***When Dimension*** in Multidimensional Systems implemented on a ROLAP star database, regardless of its granularity, with the following features: - It supports the generation of tables with the granularity needed (second, minute, hour, date, week, month, quarter, semester or year) according to the relational star schema design criteria [@kimball2013data]. - We consider separate tables for **time-of-day** data and data obtained from **date** to avoid a row count explosion if a single table were used. - It can be used to generate tables **statically**, indicating a period, or **dynamically**, indicating data series. - In addition to attributes at the selected granularity, other attributes can be added at coarser granularity to facilitate querying. We can also add a surrogate key. - We can obtain the tables in `tibble` format and in other formats (*xlsx*, *csv* or *Relational DBMS* table). It relies on the functions offered by the [`lubridate`](https://CRAN.R-project.org/package=lubridate) package to obtain the components from the date and time. The rest of this document is structured as follows: First, the general process of defining these dimensions is presented. Next, we present dimension-specific aspects that we can define, both for date-based and time-based dimensions. Finally, the document ends with conclusions and bibliography. # Definition process The definition process consists of the following steps: 1. Creating an object. 1. Definition of general aspects. 1. Definition of instances. 1. Selection of levels and attributes. 1. Generation of the result. 1. Getting the dimension table. Steps 2, 3 and 4 can be done in any order, they can even be included in step 1. Below we discuss each of them. ## Creating an object The definition process begins by creating an object of the `when` class. ```{r setup} library(when) w_date <- when() ``` By default, an object is created based on date, but we can also create it based on time by indicating it using the `type` parameter. ```{r} w_time <- when(type = 'time') ``` Virtually all configuration options can be defined using the object creation function. They can also be defined later using specific functions, as shown in the following section. ## Definition of general aspects {#general_aspects} For example, using the following combination of functions we get the same result as defining the type when we create the object. ```{r} w_time_2 <- when() |> define_characteristics(type = 'time') identical(w_time, w_time_2) ``` By default it considers the system language. Since our system is Spanish, for the literals of day and month names to appear in English, we must configure it explicitly. ```{r, eval=FALSE} w_date <- w_date |> define_characteristics(locale = Sys.setlocale("LC_TIME", "English")) ``` In section [Other configuration options](#other_config_options) we present the rest of the configuration options for dimension generation. For the examples that we are going to use, the default options of the package are enough. Next we have to select and configure the levels and define the instances. These operations, along with the current one, can be performed in any order. ## Definition of instances We can define the instances by giving a range of values or by indicating the specific instances. The values in both cases will depend on the type of dimension (date-based or time-based). We can indicate these values at the time of creating the class object or through the `define_instances()` function. - In the case of date, we can indicate a year, year and month (with the format `yyyy-mm`), or year month and day (`yyyy-mm-dd`). - In the case of time, we can indicate an hour, hour and minute (with the format `hh:mm`), or hour, minute and second (`hh:mm:ss`). In both cases we can use the string format, the one provided by the [`lubridate`](https://CRAN.R-project.org/package=lubridate) package or, when it is a single number (year or hour), we can use an integer value. For example, below they are used with the format of `lubridate`. ```{r} w_date <- w_date |> define_instances(start = lubridate::today(), end = lubridate::today() + lubridate::years(5)) ``` In the following definitions we configure the dimensions so that they consider only the indicated dates. The second will be appropriate if we want to obtain the dimension at the year level, because we are indicating the specific values to consider (not a period). ```{r} w_date_2_1 <- when( values = c( "2023-12-31", "2023-01-01", "2022-12-31", "2022-01-01", "2021-12-31", "2021-01-01" ) ) w_date_2_2 <- w_date |> define_instances(values = 2020:2030) ``` The following definition will be appropriate at any level of detail because we are giving the range of values to consider (dates between those two years). ```{r} w_date_3 <- w_date |> define_instances(start = 2020, end = 2030) ``` When only the year is indicated, it is considered its first day. Therefore the previous definition is equivalent to the following one. ```{r} w_date_4 <- w_date |> define_instances(start = "2020-01-01", end = "2030-01-01") identical(w_date_3, w_date_4) ``` In the case of time, by default, it is considered at second level and all seconds of the day. ```{r} w_time_3 <- w_time |> define_instances(start = "00:00:00", end = "23:59:59") identical(w_time, w_time_3) ``` We can also indicate a period or a set of times according to the indicated criteria. In the following example, the two definitions are equivalent. ```{r} w_time_4 <- w_time |> define_instances(start = 8, end = 17) w_time_5 <- w_time |> define_instances(start = "08:00:00", end = "17:00:00") identical(w_time_4, w_time_5) ``` ## Selection of levels and attributes In the definition, by default, the attributes that we have considered to be most frequently used have been included. We can consult them for each level using the `get_level_attribute_names()` function with the parameter `selected = TRUE`. If we define the parameter as `selected = FALSE` (default option), it will return all available attributes. We can indicate the name of the level or, if we do not indicate any, it will obtain the attributes of all the levels. We obtain the level names using the `get_level_names()` function. Below are examples of using these functions for the date-based dimension. ```{r} w_date |> get_level_attribute_names(selected = TRUE) w_date |> get_level_names() w_date |> get_level_attribute_names(name = 'month', selected = TRUE) w_date |> get_level_attribute_names(name = 'month') ``` Also for the time-based dimension. ```{r} w_time |> get_level_attribute_names() w_time |> get_level_names() ``` There is a function to configure each level. We can select or deselect each attribute individually using the attribute name as a boolean parameter of the function. We can also select or deselect all attributes defined at the level using the special `include_all` and `exclude_all` parameters: Once all the attributes have been updated with these special parameters, in the same function call, we can select or deselect the attributes we want. For example, if we do not want to include the name of the month, we can configure it using the `select_month_level()` function or at the time of object creation with a parameter of the same name, as shown below. ```{r} w_date_5 <- w_date |> select_month_level(month_name = FALSE) w_date_6 <- when( start = lubridate::today(), end = lubridate::today() + lubridate::years(5), month_name = FALSE ) identical(w_date_5, w_date_6) w_date_5 |> get_level_attribute_names(name = 'month', selected = TRUE) ``` If we only want to include the name of the month, we can do it as follows. ```{r} w_date_7 <- w_date |> select_month_level(exclude_all = TRUE, month_name = TRUE) w_date_7 |> get_level_attribute_names(name = 'month', selected = TRUE) ``` There is available a similar function for each level: time, date, week, month, quarter, semester and year. In the case of the date-based dimension levels, we can select or deselect them directly using the `select_date_levels()` function. It allows us to treat the levels globally in the same way that we have been treating the attributes. For example, to exclude the month level we can do the following. ```{r} w_date_8 <- w_date |> select_date_levels(month_level = FALSE) w_date_8 |> get_level_attribute_names(name = 'month', selected = TRUE) ``` We can also perform that same operation at object creation time, as shown below. ```{r} w_date_9 <- when( start = lubridate::today(), end = lubridate::today() + lubridate::years(5), month_level = FALSE ) ``` In the case of a time-based dimension, as we have seen, we only consider one level called "time", as shown again below. ```{r} w_time |> get_level_names() ``` We can configure the attributes it includes using the `select_time_level()` function. The attribute called "hour" must always be included. If we do not include the "minute" attribute then "second" will not be included either. Below are some examples of definition and the result obtained. ```{r} w_time_6 <- w_time |> select_time_level(exclude_all = TRUE) w_time_6 |> get_level_attribute_names(selected = TRUE) w_time_7 <- w_time |> select_time_level(minute = FALSE) w_time_7 |> get_level_attribute_names(selected = TRUE) ``` ## Generation of the result Once the characteristics of the dimension are defined, we can generate its table. Although we have not yet generated the dimension table, the following function returns the names of the attributes it will contain. They can be obtained as a string or as a vector of strings. ```{r} w_date |> get_table_attribute_names(as_string = FALSE) ``` If it is generated in the form of a string (`as_string = TRUE`, default option), the result can be used to easily change the name of the attributes, when the table is generated, using the `set_table_attribute_names()` function, if deemed necessary, as shown in section [Renaming the attributes](#rename_att). When we consider that the dimension is already properly defined, we can generate its corresponding table using the `generate_table()` function. We update the definition of the object. Any errors found will be notified to us. ```{r} w_date <- w_date |> generate_table() w_time <- w_time |> generate_table() ``` Once the table is generated, we can export it. If we want to access the table and it is not yet generated, it will send us an error message. If we make changes to the configuration, it is our responsibility to generate the table again. ## Getting the dimension table {#getting_dim_table} We can access the dimension table to work with it in R (for example using the [`rolap`](https://CRAN.R-project.org/package=rolap) package), or export it in the format that is most convenient for us to work with other tools. Next we access the generated tables and display them. First of all for the date. ```{r} t_date <- w_date |> get_table() ``` The first and last rows of the obtained result are shown below. ```{r, results = "asis"} pander::pandoc.table(rbind(head(t_date, 5), tail(t_date, 5)), split.table = Inf) ``` Let's generate the dimension at the week level including all its attributes (if we change the definition, we have to generate the table again). ```{r} t_date <- w_date |> select_date_levels(day_level = FALSE) |> select_week_level(include_all = TRUE) |> generate_table() |> get_table() ``` The first and last rows of the obtained result are shown below. ```{r, results = "asis"} pander::pandoc.table(rbind(head(t_date, 5), tail(t_date, 5)), split.table = Inf) ``` Next for the time. ```{r} t_time <- w_time |> get_table() ``` The start and end of the result table is shown below. ```{r, results = "asis"} pander::pandoc.table(rbind(head(t_time, 5), tail(t_time, 5)), split.table = Inf) ``` `day_part` is predefined in English. Literals or associated hours can be accessed and changed using the `get_day_part()` and `set_day_part()` functions, as shown in section [Names of the parts of the day](#name_parts_day). Now we are going to generate it at the minute level. ```{r} t_time <- w_time |> select_time_level(second = FALSE) |> generate_table() |> get_table() ``` The start and end from the new result table is shown below. ```{r, results = "asis"} pander::pandoc.table(rbind(head(t_time, 5), tail(t_time, 5)), split.table = Inf) ``` We can take the generated tables to any Relational DBMS, as shown below, we just have to pass a connection to the database as a parameter. ```{r database} my_db <- DBI::dbConnect(RSQLite::SQLite()) w_date |> get_table_rdb(my_db) w_time |> get_table_rdb(my_db) DBI::dbListTables(my_db) DBI::dbDisconnect(my_db) ``` In the previous example, since we have not explicitly defined any names for the tables in the configuration, they have been assigned as a name the element on which they are based (date or time). We can define these names, as shown in section [Table name](#table_name). Functions to get the tables in *xlsx* and *csv* format are also available: `get_table_xlsx()` and `get_table_csv()`. # Other configuration options {#other_config_options} In section [Definition of general aspects](#general_aspects) we have presented the necessary configuration options for the examples we have used. In particular, we have only needed to define the type on which the dimension is based (time or date) and the language for naming days and months (because it is different from the language of our operating system). The default options have been adequate for the examples shown. In this section we are going to show other configuration options that may be necessary to define our dimensions. ## Dimension table features The result of a definition operation performed with the `where` package is a table of a dimension at a given level of granularity, with the attributes selected for that level or associated coarser levels. So far we have focused on level attributes and table instances. In this section we are going to configure other characteristics of it. ### Table name {#table_name} By default, the table is assigned the name of the base component used to generate the dimension: date or time. We have seen it in the tables that we have stored in the database, in the example of the section [Getting the dimension table](#getting_dim_table). We can define the name of the dimension at the time of object creation or through the `define_characteristics()` function, as shown below. ```{r} wd_1 <- when(name = 'dim_where') wd_2 <- when() |> define_characteristics(name = 'dim_where') ``` This name is the one that will be assigned to the database table where we store it or to the files we obtain. Below is the case for a relational database. ```{r} my_db <- DBI::dbConnect(RSQLite::SQLite()) wd_1 |> generate_table() |> get_table_rdb(my_db) DBI::dbListTables(my_db) DBI::dbDisconnect(my_db) ``` Also the case of obtaining it in *csv* format (if we do not indicate a location, it stores it in one by default). ```{r} wd_2 |> generate_table() |> get_table_csv() ``` ### Surrogate key By default the dimension table will include a surrogate key. In some situations we may be interested in the table having the date itself as its primary key (for example if we want to partition the fact table by the date). Through the `surrogate_key` parameter, included in the `define_characteristics()` function and also the `when()` function, we can indicate that a surrogate key is not created, as it's shown in the following. ```{r} when() |> get_table_attribute_names(as_string = FALSE) when(surrogate_key = FALSE) |> get_table_attribute_names(as_string = FALSE) ``` The name of the surrogate key is `id`, is the first attribute of the table. If necessary, we can rename the attributes, including the surrogate key as shown in the next section. ### Renaming the attributes {#rename_att} If we get a table in `tibble` format, we can rename its attributes using the functions of this class. But if we want to obtain it in a file or in a database, we may be interested in renaming its attributes before obtaining it. In order to rename the attributes, we first have to generate the table. ```{r} wd_3 <- when() |> generate_table() ``` We can rename them using the `set_table_attribute_names()` function, passing a name for each of the available attributes. To facilitate the operation, using the `get_table_attribute_names()` function, we can obtain the attributes in string format. With this we only have to replace the names that we want to change, as shown below. ```{r} wd_3 |> get_table_attribute_names() wd_3 <- wd_3 |> set_table_attribute_names( c( 'id_when', 'date', 'month_day', 'week_day', 'day_name', 'day_num_name', 'year_week', 'week', 'year_month', 'month', 'month_name', 'month_num_name', 'year' ) ) wd_3 |> get_table_attribute_names(as_string = FALSE) ``` ## Names of the parts of the day {#name_parts_day} We have associated each hour with the usual part of the day in English. We can consult them and change them if necessary (for example, to define them in another language) using the `get_day_part()` and `set_day_part()` functions. Below is an example where the definition is modified to consider only two parts in the day. ```{r} when() |> get_day_part() when() |> set_day_part(hour = c(20:23, 0:5), name = "Night") |> set_day_part(hour = c(6:19), name = "Day") |> get_day_part() ``` ## Configuration of day and week levels This section shows the configuration options available for the day and week levels. ### The first day of the week For days, in addition to the language of the name of the day of the week, we can define the start day of the week. There are two possibilities: Monday or Sunday. This determines the numbering of the days. We can define it using the parameter `week_starts_monday` (which by default has the value TRUE) in the functions `define_characteristics()` or `when()`, as shown below. ```{r} wd_1 <- when(week_starts_monday = FALSE) wd_2 <- when() |> define_characteristics(week_starts_monday = FALSE) ``` ### Numbering of weeks in the year Using the value of the `type` parameter (`type = "date"` or `type = "time"`) in the `define_characteristics()` or `when()` functions we can indicate whether the dimension is based on date or time. This parameter can additionally take two more values: "iso" and "epi". Using these values (together with the "date" value) we can indicate the type of numbering of the weeks in the year. That is, the week number associated with each date depends on the type of date dimension selected: - Default (`type = "date"`): It numbers blocks of 7 days beginning on January 1 (regardless of what day of the week it is). The last week of the year will last less than 7 days. - ISO 8601 (`type = "iso"`): It numbers blocks of 7 days from Monday to Sunday. The first and last week of the year can contain days from the previous or next year. - Epidemiological (`type = "epi"`): This week is like ISO 8601 only that it considers that the week begins on Sunday. For the first and last days of the year, the year associated with the week may be different from the year of the date, depending on the type selected ("iso" or "epi"). The "year" attribute (alone or in combination with other attributes) displays the year associated with the date, regardless of the type selected. Only the "year_week" attribute displays the year of the week according to the selected type. The next section shows how to obtain the attribute obtaining function and how to change it if we consider it necessary (for example, to get the different year types together). ### Get date range of weeks If we want to work with the dimension at the week level, we may be interested in having the start or end date for each week. To obtain them and be able to use them to generate the dimension, we have the `get_week_date_range()` function. ```{r} t <- get_week_date_range(start = "2024-01-01", end = "2029-12-31") ``` A period and the type of week numbering ("date", "iso" or "epi") are indicated (by default it is "date"). Returns the result in `tibble` form. The start and end from the new result is shown below. ```{r, results = "asis"} pander::pandoc.table(rbind(head(t, 5), tail(t, 5)), split.table = Inf) ``` We can use the column of the dates that interest us to generate the dimension. Additionally, we leave only the date and the levels that interest us (week and year). ```{r} tw <- when(values = t[["last"]], month_level = FALSE) |> select_day_level(exclude_all = TRUE, date = TRUE) |> generate_table() |> get_table() ``` The start and end from the new result table is shown below. ```{r, results = "asis"} pander::pandoc.table(rbind(head(tw, 5), tail(tw, 5)), split.table = Inf) ``` ## Change the definition function of an attribute {#change_def_function_att} Each attribute is defined by a specific function. All definition functions have the same structure: they receive a *table*, a column with the date or time *values*, possibly additional configuration parameters, and return the *table* to which a new column with the attribute name and values has been added, calculated from the values. We can get the definition function of any attribute using the `get_attribute_definition_function()` function. Below is the function for the "year" attribute and also the function of the year and week combination, "year_week" attribute (it uses one of the additional parameters available, the dimension type). ```{r} wd <- when() wd |> get_attribute_definition_function(name = "year") wd |> get_attribute_definition_function(name = "year_week") ``` In view of the functions, we can define one with the same structure and establish it as a new attribute definition function using the `set_attribute_definition_function()` function. For example, below shows how to change the year attribute definition function to get an additional attribute if the year associated with the week can be different from the year of the date. ```{r} f <- function(table, values, ...) { dots <- list(...) type <- dots[['type']] table[['year']] <- as.character(lubridate::year(values)) if (type == 'iso') { table[['week_year']] <- as.character(lubridate::isoyear(values)) } else if (type == 'epi') { table[['week_year']] <- as.character(lubridate::epiyear(values)) } table } wd <- wd |> set_attribute_definition_function(name = "year", f) ``` In this case, what we do is add two columns for the year: one with the year of the date and another with the year of the week if this can be different from the first. We can see the result by changing the type of the dimension and generating the table, as shown below (the two year attributes have been added). ```{r} t <- wd |> define_characteristics(type = 'iso') |> generate_table() |> get_table() names(t) ``` # Conclusions The `when` package offers a set of operations that allow us to generate the *When Dimension* based on date or time at the level of detail we need. We create an object of the `when` class and, from it, we configure the instances and characteristics of the levels and attributes of the dimension. To define the instances we can indicate a range of values or provide the specific values from which to generate the rest of the attributes. As a result, we can obtain tables for this dimension to work from R, for example with the [`rolap`](https://CRAN.R-project.org/package=rolap) package, or work in other environments, including Relational DBMSs. # Bibliography