# DHSr: Advanced Regression, Compound Statistical Analysis, and Spatial Data Tools

The DHSr package offers advanced tools designed for repeated regression
analysis, particularly well-suited for demographic and health survey
datasets, though it can be applied to other datasets as well. It
includes capabilities for compound statistical analyses using Stein’s
estimate and features a unique approach to spatial data analysis,
enabling the identification of clusters within hot-hot and cold-cold
regions. The package also includes a basic map visualization feature,
primarily intended for visualizing calculations. For users seeking more
control over mapping, the derived variables from the package can be used
with other mapping packages for enhanced visualization options

## Installation

You can install the released version of `DHSr` from CRAN with:
install.packages(“DHSr”)

## Example: Replm2

### Running Regressions Across All Locations

The `Replm2` function allows you to run linear regressions across all
locations in a dataset. Below is a basic example using a simple formula.

here are the parameters :

`data`: The dataset containing the variables you want to analyze.

`formula`:This specifies the model you want to fit, using standard R
formula syntax (e.g., years\_education ~ gender\_female +
household\_wealth).

`location_var`: The name of the variable in your dataset that identifies
different locations or groups (e.g., district\_code). The function will
run a separate regression for each unique value in this variable.This
parameter is mandatory. If the data is not spatial data, but if the data
has other grouping varible that also can be passed here to perform
repeated regression for each group

`response_distribution` : Specifies the type of regression model to fit.
Commonly, “normal” is used for linear regression. For logistic or other
types of models, you would specify an appropriate distribution.

`family`: (Optional) Used when you are performing a generalized linear
model (GLM), such as logistic regression. For linear regression with a
normal distribution, this can be set to NULL.

    library(DHSr)
    #> DHSr package loaded with all dependencies.
    library(dplyr)
    #> 
    #> Attaching package: 'dplyr'
    #> The following objects are masked from 'package:stats':
    #> 
    #>     filter, lag
    #> The following objects are masked from 'package:base':
    #> 
    #>     intersect, setdiff, setequal, union
    set.seed(123)
    dummy_data <- data.frame(
      years_education = rnorm(100, 12, 3),    # Represents years of education
      gender_female = rbinom(100, 1, 0.5),    # 1 = Female, 0 = Male
      household_wealth = sample(1:5, 100, replace = TRUE),  # Wealth index from 1 to 5
      district_code = sample(1:10, 100, replace = TRUE) # Represents district codes
    )%>% arrange(district_code)

    shapefile_path <- "C:/Users/91911/Documents/India%3A_State_Boundary_2021_.shp"
    # Define a simple regression formula
    formula <- years_education ~ gender_female + household_wealth + household_wealth:gender_female

    # Run the regression across all locations (districts)
    results1 <- Replm2(dummy_data, formula, "district_code", "normal")

    # Print the results
    print(head(results1))
    #> # A tibble: 6 × 16
    #>   location `estimate_(Intercept)` estimate_gender_female estimate_household_we…¹
    #>      <int>                  <dbl>                  <dbl>                   <dbl>
    #> 1        1                   22.4                -11.6                   -2.81  
    #> 2        2                   13.0                  2.29                  -0.229 
    #> 3        3                   12.4                  1.89                  -0.389 
    #> 4        4                   12.1                 -0.116                 -0.0620
    #> 5        5                   19.5                 -3.46                  -2.00  
    #> 6        6                   11.3                  1.67                   0.517 
    #> # ℹ abbreviated name: ¹​estimate_household_wealth
    #> # ℹ 12 more variables: `estimate_gender_female:household_wealth` <dbl>,
    #> #   `estimate_R-squared` <dbl>, `std_error_(Intercept)` <dbl>,
    #> #   std_error_gender_female <dbl>, std_error_household_wealth <dbl>,
    #> #   `std_error_gender_female:household_wealth` <dbl>,
    #> #   `std_error_R-squared` <dbl>, `p_value_(Intercept)` <dbl>,
    #> #   p_value_gender_female <dbl>, p_value_household_wealth <dbl>, …

## Example: Replmre2

The `Replmre2` function is designed to run mixed-effects models across
all locations defined by a categorical variable in your dataset. It
supports linear mixed-effects models (LMEMs), making it suitable for
continuous response variables and normally distributed errors.

`random_effect_var` : Specifies the grouping variable for random
effects. Note : one common issue ‘singularity error’ occurs with complex
random effects; to avoid try dropping that particular location from the
data.

    set.seed(123)
    # Create HHid (Household ID), grouping every 3-4 records, and convert to character
    dummy_data$HHid <- as.character(rep(1:20, each = 5, length.out = nrow(dummy_data)))

    # Test the function with full_rank dataset
    data <- dummy_data
    # Define a simple regression formula
    formula <- years_education ~ gender_female + household_wealth:gender_female

    location_var <- "district_code"
    random_effect_var <- "HHid"

    results <- DHSr::Replmre2(data, formula, location_var, random_effect_var)
    #> Warning: 'r.squaredGLMM' now calculates a revised statistic. See the help page.
    print(head(results))
    #> # A tibble: 6 × 16
    #>   location `estimate_(Intercept)` estimate_gender_female estimate_gender_femal…¹
    #>      <int>                  <dbl>                  <dbl>                   <dbl>
    #> 1        1                   12.6                -1.73                     0.522
    #> 2        2                   12.3                 3.03                    -1.08 
    #> 3        3                   11.1                 3.27                    -0.110
    #> 4        4                   11.9                 0.0698                   0.124
    #> 5        5                   12.0                 4.04                    -1.06 
    #> 6        6                   12.9                 0.906                   -0.660
    #> # ℹ abbreviated name: ¹​`estimate_gender_female:household_wealth`
    #> # ℹ 12 more variables: `estimate_Marginal R-squared` <dbl>,
    #> #   `estimate_Conditional R-squared` <dbl>, `std_error_(Intercept)` <dbl>,
    #> #   std_error_gender_female <dbl>,
    #> #   `std_error_gender_female:household_wealth` <dbl>,
    #> #   `std_error_Marginal R-squared` <dbl>,
    #> #   `std_error_Conditional R-squared` <dbl>, `p_value_(Intercept)` <dbl>, …

## Example: Repglmre2

`Repglmre2` : Extends Replmre2 to non-normal regressions like logistic
or Poisson

    # Create a binary outcome variable for years of education
    dummy_data$education_binary <- ifelse(dummy_data$years_education > 11, 1, 0)

    # Define a logistic regression formula
    formula <- education_binary ~ gender_female + household_wealth:gender_female

    location_var <- "district_code"
    random_effect_var <- "HHid"

    # Run the logistic mixed-effects model across all locations (districts)
    results <- DHSr::Repglmre2(data = dummy_data, formula = formula, 
                               location_var = location_var, random_effect_var = random_effect_var, 
                               family = binomial())

    # Print the results
    print(head(results))
    #> # A tibble: 6 × 16
    #>   location `estimate_(Intercept)` estimate_gender_female estimate_gender_femal…¹
    #>      <int>                  <dbl>                  <dbl>                   <dbl>
    #> 1        1                  0.75                  -0.228                2.17e- 2
    #> 2        2                  0.800                  0.200               -7.77e-12
    #> 3        3                  0.462                  0.449               -1.96e-17
    #> 4        4                  0.714                  0.514               -1.43e- 1
    #> 5        5                  0.750                  0.114               -4.55e- 2
    #> 6        6                  0.715                  0.131               -1.35e- 1
    #> # ℹ abbreviated name: ¹​`estimate_gender_female:household_wealth`
    #> # ℹ 12 more variables: `estimate_Marginal R-squared` <dbl>,
    #> #   `estimate_Conditional R-squared` <dbl>, `std_error_(Intercept)` <dbl>,
    #> #   std_error_gender_female <dbl>,
    #> #   `std_error_gender_female:household_wealth` <dbl>,
    #> #   `std_error_Marginal R-squared` <dbl>,
    #> #   `std_error_Conditional R-squared` <dbl>, `p_value_(Intercept)` <dbl>, …

## Example: listw, spdeplisa

`listw` : Essential for cluster map creation and Stein’s beta; optional
if weights exist.

`spdeplisa` : Computes LISA values using native spdep functions. It’s
recommended to run this before cluster\_map for consistency, though it
may not be needed if LISA signs and indicators are precomputed

`cluster_map` : Produces clusters from spatial data, with cluster IDs
and admin boundary listings (e.g., districts or states). The min\_
parameter sets the minimum number of admin boundaries per cluster, and
min\_area sets the minimum cluster area. The lisa\_cutoff adjusts the
threshold for clustering, useful for continuous clusters. The function
provides a basic map; for more dynamic mapping, users can use otyher
packages ( such as ggplot2::geom\_sf) after obtaining cluster IDs. The
lisa\_label and label specify the type of clusters (e.g., High-High).

\##cluster\_map parameter:

`dataset`: Input spatial dataset ( “sf” “data.frame” object)
`lisa_value`: Variable used for LISA analysis `lisa_label`: Variable
indicating LISA significance `label`: Specific label for clustering
(e.g., “High-High”) `lisa_cutoff`: Threshold for LISA values
`location_var`: Column name for location variable `location_name`:
Column name for location name `level2`: Optional second level of
geography `id_start`: Starting ID for cluster numbering `comparison`:
Comparison operator for cutoff `min_area`: Minimum area for valid
clusters `min_`: Minimum number of districts per cluster `title`: Title
for the resulting plot `subtitle`: Subtitle for the resulting plot
`footnote`: Footnote for the resulting plot `legend_position`: Position
of the legend on the plot `color_scheme`: Color scheme for the plot

    # Load the sf package if not already loaded
    library(sf)
    #> Warning: package 'sf' was built under R version 4.3.3
    #> Linking to GEOS 3.11.2, GDAL 3.8.2, PROJ 9.3.1; sf_use_s2() is TRUE

    # Group by district_code and calculate mean household_wealth
    mean_wealth <- dummy_data %>%
      group_by(district_code) %>%
      summarize(mean_household_wealth = mean(household_wealth, na.rm = TRUE))




    # Use the custom listw function with loc_shape and loc_data parameters
    listw <- listw(
      shapefile_path = shapefile_path,  # Use the predefined shapefile path
      data = mean_wealth, 
      loc_shape = "objectid",            # The key column to be used for merging in the shapefile
      loc_data = "district_code",       # The column in data that will be used if loc_shape doesn't exist
      weight_function = function(d) exp(-d / 0.2)
    )
    #> Reading layer `India%3A_State_Boundary_2021_' from data source 
    #>   `C:\Users\91911\Documents\India%3A_State_Boundary_2021_.shp' 
    #>   using driver `ESRI Shapefile'
    #> Simple feature collection with 36 features and 19 fields
    #> Geometry type: MULTIPOLYGON
    #> Dimension:     XY
    #> Bounding box:  xmin: 68.12382 ymin: 6.754078 xmax: 97.4116 ymax: 37.08835
    #> Geodetic CRS:  WGS 84
    #> Warning in nb2listw(res$neighbours, glist = res$weights, style = style, : zero
    #> sum general weights


    # Read the shapefile
    shape_data<- st_read(shapefile_path)
    #> Reading layer `India%3A_State_Boundary_2021_' from data source 
    #>   `C:\Users\91911\Documents\India%3A_State_Boundary_2021_.shp' 
    #>   using driver `ESRI Shapefile'
    #> Simple feature collection with 36 features and 19 fields
    #> Geometry type: MULTIPOLYGON
    #> Dimension:     XY
    #> Bounding box:  xmin: 68.12382 ymin: 6.754078 xmax: 97.4116 ymax: 37.08835
    #> Geodetic CRS:  WGS 84

    # Ensure uniqueness by REGCODE
    shape_data <- shape_data %>% distinct(objectid, .keep_all = TRUE) 
    #objectid is the location-shape(loc_shape) in listw

    # Merge shapefile data with the provided dataset
    mean_wealth_geom<- merge(shape_data, mean_wealth, by.x = "objectid", by.y="district_code", all.y = TRUE)


    lisa2 <- Spdeplisa(mean_wealth_geom, "mean_household_wealth",listw)

    result <- cluster_map(lisa2, "lisa_I", "sign_combination3", "High-High", 0.4, "objectid", "name", "objectid", id_start = 0, comparison = ">", min_area = 0, min_ = 1,
                                 title = "Wealth Clusters ", subtitle = "State Level", footnote = NULL, legend_position = "right", color_scheme = "D")

    plot<- result$plot
    clusters <-result$dataset_with_clusters

    #merge the OLS beta's for each location 
    cluster_beta <- merge(clusters, results1, by.x = "objectid", by.y = "location")

    # Example usage
    results_with_stein_beta <- DHSr::stein_beta(
      data = cluster_beta,
      cluster_id = "cluster_id",
      beta = "estimate_gender_female"  # Replace with the actual column name for beta in your data
    )
