--- title: "geodimension: Definition of Geographic Dimensions" author: "Jose Samos (jsamos@ugr.es)" date: "2020-11-26" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{geodimension: Definition of Geographic Dimensions} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # Introduction The *multidimensional data model* was defined with the aim of supporting data analysis. In multidimensional systems, data is structured in facts and dimensions[^1]. The *geographic dimension* plays a fundamental role in multidimensional systems. Apart from the analysis possibilities it offers, like any other dimension, it is very interesting to have the possibility of representing the reports obtained from multidimensional systems, using their geographic dimensions on a map, or performing spatial analysis on them. This functionality is supported by packages [`rolap`](https://cran.r-project.org/package=rolap) and [`geomultistar`](https://cran.r-project.org/package=geomultistar). [^1]: Basic concepts of dimensional modelling and star schemas are presented in [`rolap`](https://CRAN.R-project.org/package=rolap) vignettes. To define a geographic dimension in a star schema, we need a table with attributes corresponding to the levels of the dimension. Additionally, we will also need one or more geographic layers to represent the data using this dimension. We can obtain this data from available vector layers of geographic information. In simple cases, one layer is enough. We often need several layers related to each other. The relationships can be defined by common attribute values or can be inferred from the respective geographic information. The main objective of this package is to support the definition of geographic dimensions from layers of geographic information that can be used in multidimensional systems. In particular, through packages [`rolap`](https://cran.r-project.org/package=rolap) and [`geomultistar`](https://cran.r-project.org/package=geomultistar). The rest of this document is structured as follows: First, an illustrative example of how the package works is developed. Then, the document ends with conclusions. # An illustrative example Suppose we have a multidimensional design on US data and the geographic dimension is defined at the *place* level. For each place we have its name and the code of the state in which it is located. It would be interesting to have other levels of detail in this dimension to be able to perform *roll-up* operations. With this objective, we look for layers of geographic information. In [United States Census Bureau](https://www.census.gov) we find layers at various levels of detail, including *place*, *county* and *state*. Furthermore we find the relationships in table form between *division*, *region* and *country*. For this example we obtain them from the package itself (we could read them from a *GeoPackage* or in any other format using the [`sf`](https://cran.r-project.org/package=sf) package). ```{r} library(geodimension) layer_us_place <- gd_us |> get_level_layer("place") layer_us_county <- dplyr::inner_join( get_level_data_geo(gd_us, "county"), get_level_layer(gd_us, "county"), by = c("geoid", "statefp", "name", "type") ) |> sf::st_as_sf() layer_us_state <- dplyr::inner_join( get_level_data_geo(gd_us, "state"), get_level_layer(gd_us, "state"), by = c("statefp", "division", "region", "stusps", "name") ) |> sf::st_as_sf() ``` The content of variable `us_division` is shown below. ```{r, results = "asis", echo = FALSE} pander::pandoc.table(geodimension::us_division, split.table = Inf) ``` The layers and the table are related to each other. In some cases they have attributes in common, in others, although there is a relationship, it may not be explicitly defined. We can use `geodimension` to support the definition of these relationships. Once defined, it will also offer us support to exploit them and obtain information from them. Thus, three phases can be distinguished: - Definition of levels. - Definition of relationships. - Obtaining information. ## Definition of levels In the package, each conceptual level of the geographical dimension is called `geolevel`. To define a `geolevel`, we need a layer and the set of attributes that make up the layer's key (which uniquely identify each of its instances). We can previously check if a set of attributes form a key of the layer using the `check_key()` function. ```{r} names(layer_us_place) check_key(layer_us_place, key = c("name", "statefp")) check_key(layer_us_place, key = "geoid") ``` We might expect the place name (`name`) and state code (`statefp`) to be sufficient to identify a place, however, they are not. We check that the `geoid` field is a valid key, therefore, it will be the one we use to define the `geolevel`. We can check the geometry that is considered for the definition of the level by means of the `get_geometry()` function (it simplifies the types into *point*, *line* and *polygon*). In addition, we give each level a name to be able to refer to it, as shown below. ```{r} get_geometry(layer_us_place) place <- geolevel(name = "place", layer = layer_us_place, key = "geoid") ``` For `county` it is the same as for `place`, the name and the code of the state do not compose a valid key. In this case, the geometry is polygon. Additionally, a layer with another geometry can be associated using the `add_geometry()` function. Since the layer includes longitude and latitude, we can generate a geographic layer of points using the `coordinates_to_geometry()` function. ```{r} check_key(layer_us_county, key = c("name", "statefp")) check_key(layer_us_county, key = "geoid") get_geometry(layer_us_county) county <- geolevel(name = "county", layer = layer_us_county, key = c("geoid")) |> add_geometry(coordinates_to_geometry(layer_us_county)) ``` For *state* the situation is similar to the previous cases. ```{r} us_state_point <- coordinates_to_geometry(layer_us_state, lon_lat = c("intptlon", "intptlat")) state <- geolevel(name = "state", layer = layer_us_state, key = "statefp") |> add_geometry(layer = us_state_point) ``` For both *county* and *state*, fields with longitude and latitude were available. If we only have the *polygon* geometry, we can obtain the *point* geometry using the `complete_point_geometry()` function. For the rest of the levels, we do not have a layer with specific geographic information, but we can obtain it from the previous layers. Below is only the definition of the levels. ```{r} division <- geolevel( name = "division", layer = us_division, attributes = c("country", "region_code", "division_name"), key = "division_code" ) |> add_geometry(layer = layer_us_state, layer_key = "division") |> complete_point_geometry() region <- geolevel( name = "region", layer = us_division, attributes = c("country", "region_name"), key = "region_code" ) |> add_geometry(layer = layer_us_state, layer_key = "region") |> complete_point_geometry() ``` For *division* and *region*, we define the level from the data table. Using the `add_geometry()` function, we add a layer of geographic information obtained from a finer granularity level of detail, which contains some field that relates them. Finally, we obtain the *point* type geometry from the previously defined *polygon* geometry. ```{r} country <- geolevel( name = "country", layer = get_level_layer(region), attributes = "country", key = "country" ) |> complete_point_geometry() ``` For *country*, since the state layer does not contain any field that directly relates to it, using the `get_level_layer()` function, we can obtain a new layer from any of the previously defined levels. Once the levels are defined, then we will define the dimension and the relationships between the levels. ## Definition of relationships To define a `geodimension`, we give it a name and start from any `geolevel`. If we want the names to follow the snake case criteria, we can indicate this using the `snake_case = TRUE` parameter and the conversion will be performed automatically for all operations. Next, we add the rest of the `geolevels` in any order. ```{r} gd <- geodimension(name = "gd_us", level = region, snake_case = TRUE) |> add_level(division) |> add_level(state) |> add_level(country) |> add_level(place) |> add_level(county) ``` Next, we can define the relationships that we want to consider between the levels. In a relationship there are two parts, the *lower level* and the *upper level*. To define the relationships, the following points must be taken into account: - There are no restrictions on the relationships we define, as long as the relationship can be established. - It is only necessary to define direct relationships (from them, indirect ones are derived). - Relationships can be defined using attributes with common values between the levels or through geographic relationships between their instances. - If we want to reference the upper level through attributes, these must form a valid key for the level (it does not necessarily have to be the key that was indicated when defining it, this is used by default). - To define a relationship using geographic properties, the upper level must be of type *polygon*. The relationships between *state*, *region*, *division* and *country* are defined below. ```{r} gd <- gd |> relate_levels( lower_level_name = "state", lower_level_attributes = "division", upper_level_name = "division" ) |> relate_levels( lower_level_name = "division", lower_level_attributes = "region_code", upper_level_name = "region" ) |> relate_levels( lower_level_name = "region", lower_level_attributes = "country", upper_level_name = "country" ) ``` The relationship between *state* and *division* is defined by a *state* attribute (`division`) that matches the *division* key. In the same way, the relationships between *division* and *region*, and *region* and *country* are defined. In addition to these relationships there is a relationship between *place* and *state* and also between *county* and *state*. In both cases it can be defined by attributes. ```{r} gd <- gd |> relate_levels( lower_level_name = "place", lower_level_attributes = "county_geoid", upper_level_name = "county" ) |> relate_levels( lower_level_name = "county", lower_level_attributes = "statefp", upper_level_name = "state" ) ``` In this case, we have attributes to establish the relationships. In some cases, we can resort to the geographical relationships that exist between the levels. For example, to relate *place* and *county*, using the `relate_levels()` function with the parameter `by_geography = TRUE`, a field is created at the lowest level that reflects the geographical relationship obtained. ```{r} gd_2 <- gd |> relate_levels(lower_level_name = "place", upper_level_name = "county", by_geography = TRUE) ``` We can check if all the instances have been related using the `get_unrelated_instances()` function: ```{r} nrow(get_unrelated_instances(gd_2, lower_level_name = "place", upper_level_name = "county")) ``` Since there are no unrelated instances, each instance of *place* has been linked to the *county* whose boundaries contain it. With these operations we have defined a `geodimension`. ## Obtaining information From a `geodimension` we can obtain information in table or layer format, to define a geographic dimension in a star schema. We can also define new versions of the dimension. We can consult the levels of the `geodimension` using the following function: ```{r } gd |> get_level_names() ``` A new *geodimension* is defined by selecting a subset of levels, which we want to take into account when obtaining information, or to define new dimensions. If necessary, relationships are generated between the selected levels: if there were indirect relationships defined between them that no longer exist when levels are deleted. ```{r } gds <- gd |> select_levels(level_names = c("state", "division", "region", "country")) gds |> get_level_names() ``` From any level of the `geodimension`, a data table can be obtained that includes only the data of the level or all the data inherited from higher levels. For each level we can indicate whether or not a prefix is added to identify the origin of the fields. By default it is added, as we can see below. ```{r} ld <- gd |> get_level_data(level_name = "state") names(ld) ld <- gd |> get_level_data(level_name = "state", inherited = TRUE) names(ld) ``` Previously, if we need it, we can obtain the name of the levels from which a level will inherit attributes. ```{r} gd |> get_higher_level_names(level_name = "state", indirect_levels = TRUE) ``` If we need only part of the data, for example, instead of all the places only the cities, we can get the table, modify it and set it again to modify the level. ```{r} ld_place <- gd |> get_level_data(level_name = "place") nrow(ld_place) ld_place <- ld_place |> dplyr::filter(type == "city") gd <- gd |> set_level_data(level_name = "place", data = ld_place) ld_place_2 <- gd |> get_level_data(level_name = "place") nrow(ld_place_2) ``` In this case we have only filtered it, we can also modify, delete or add attributes: it is checked that the key and foreign keys are still defined in the new table. For a level we can obtain the available geometries and a layer with the attribute configuration we want and the selected geometry. ```{r} gd |> get_level_geometries(level_name = "division") ll <- gd |> get_level_layer(level_name = "division", geometry = "polygon", only_key = TRUE) plot(sf::st_shift_longitude(ll)) ``` We can obtain a table with level data and geographic data represented in the form of points, with longitude and latitude, to be included in other tools that use this format. ```{r, results = "asis"} ld_geo <- gd |> get_level_data_geo(level_name = "division") pander::pandoc.table(ld_geo, split.table = Inf) ``` In addition to these functions, the package offers the possibility to change the CRS of all the layers of a *geodimension* using the `transform_crs()` function. # Conclusions The geographic dimension is very relevant for multidimensional systems. We can enrich a basic geographic dimension through information available in vector layers, generally, we will need several layers. Relationships between layers can be established through attributes or through the geographic relationships between their instances. The definition of these relationships can be systematized and is in part what is intended in this package. Additionally, once a `geodimension` has been defined, with the support of this package, we can easily obtain the attribute table with all the attributes of the levels that we want (if we need it), and also the layers with associated geographic information.