================================================================================
SUPPLEMENTARY MATERIALS
================================================================================

Paper:  "Inequality of Opportunity in Mexico: A Cultural Capital Approach
         Using Conditional Inference Trees and Shapley Decomposition"

Data:   ESRU Social Mobility Survey (ESRU-EMOVI) 2023
        Centro de Estudios Espinosa Yglesias (CEEY)

================================================================================
1. DATA AVAILABILITY
================================================================================

The ESRU-EMOVI 2023 microdata must be obtained directly from the Centro de
Estudios Espinosa Yglesias (CEEY) at:

    https://ceey.org.mx/emovi/

The survey includes three Stata (.dta) files:
  - entrevistado_2023.dta   (individual-level respondent data)
  - hogar_2023.dta          (household-level data)
  - inclusion_2023.dta      (financial inclusion module, optional)

After downloading, place the .dta files in:

    data/raw/emovi/Data/

The data dictionary (Excel) should be placed in:

    data/raw/emovi/Diccionario ESRU EMOVI 2023.xlsx

================================================================================
2. SOFTWARE REQUIREMENTS
================================================================================

R version: >= 4.3.0 (developed with R 4.5.2)

Required R packages (installed automatically by 00_setup.R):

  Data manipulation:
    - tidyverse (>= 2.0)     Data wrangling and visualization
    - haven (>= 2.5)         Reading Stata .dta files
    - readxl                  Reading Excel files
    - yaml                    Reading YAML configuration
    - janitor                 Cleaning variable names
    - labelled                Handling labelled Stata data

  Tree-based models:
    - partykit (>= 1.2)      Conditional inference trees (ctree)
    - party                   Conditional inference forests (cforest)

  Inequality metrics:
    - ineq                    Gini, Theil, entropy measures
    - acid                    Advanced inequality decomposition

  Interpretability:
    - pdp                     Partial dependence plots
    - iml                     ICE plots
    - vip                     Variable importance

  Survey analysis:
    - survey                  Complex survey design support

  Index construction:
    - FactoMineR              Multiple Correspondence Analysis (MCA)
    - factoextra              MCA visualization (diagnostics)
    - psych                   Cronbach's alpha (diagnostics)

  Reporting:
    - rmarkdown, knitr        Report generation
    - kableExtra, gt          Table formatting
    - patchwork               Combining multi-panel figures
    - scales                  Axis formatting

  Utilities:
    - here                    Project-relative paths
    - glue                    String interpolation
    - assertthat              Input validation
    - tictoc                  Execution timing
    - viridis                 Color palettes
    - gridExtra               Multi-plot layouts
    - rpart, rpart.plot       CART trees (comparison only)

================================================================================
3. PROJECT STRUCTURE
================================================================================

Inequality-of-Opportunity/
|-- config/
|   |-- config.yaml              Project configuration (paths, seeds, params)
|   |-- variable_roles.yaml      Variable classification and metadata
|
|-- data/
|   |-- raw/emovi/Data/          Raw EMOVI 2023 .dta files (NOT included)
|   |-- processed/               Generated RDS files from preprocessing
|
|-- src/R/
|   |-- 00_setup.R               Environment setup (packages, seeds, paths)
|   |-- 01_load_data.R           Load raw Stata data, save as RDS
|   |-- 02_preprocess.R          Data cleaning and index construction
|   |-- 03_ctree_model.R         Conditional inference tree estimation
|   |-- 05_iop_metrics.R         IOp metric calculations (Gini, MLD, R2)
|   |-- 06_sensitivity_analysis.R  Robustness checks
|   |-- 07_cohort_analysis.R     Birth cohort IOp trends
|   |-- 08_shapley_decomposition.R  Shapley value decomposition of IOp
|   |-- 09_run_full_analysis.R   Master pipeline script
|   |-- 11_mca_diagnostics.R     MCA diagnostic plots and Cronbach's alpha
|   |-- 12_descriptive_statistics.R  Descriptive statistics tables
|   |-- ctree_publication.R      Publication-quality ctree figure
|   |-- visualize_trees.R        Tree visualization and IOp computation
|   |-- regen_figures.R          Regenerate figures from saved CSV data
|   |-- utils/
|       |-- iop_functions.R      Reusable IOp helper functions
|       |-- data_loader.R        Data loading utilities
|
|-- outputs/
|   |-- figures/                 All generated figures (.png, 300 dpi)
|   |-- tables/                  All generated tables (.csv)
|   |-- models/                  Saved model objects (.rds)

================================================================================
4. SCRIPT DESCRIPTIONS
================================================================================

--- Core Pipeline Scripts (run in order) ---

00_setup.R
    Purpose:  Initialize the R environment for all subsequent scripts.
    Details:  Installs and loads all required packages. Reads project
              configuration from config/config.yaml and variable metadata
              from config/variable_roles.yaml. Sets the global random seed
              (42) and stores reproducibility seeds for cross-validation
              and bootstrap. Creates output directories. Defines a custom
              ggplot2 theme (theme_iop) and utility functions for saving
              figures, tables, and logging.

01_load_data.R
    Purpose:  Load raw ESRU-EMOVI 2023 Stata files and convert to RDS.
    Details:  Reads entrevistado_2023.dta and hogar_2023.dta using the
              haven package. Saves as RDS files in data/processed/ for
              faster subsequent loading. Validates that expected data
              files exist before proceeding.
    Depends:  00_setup.R

02_preprocess.R
    Purpose:  Data preprocessing, variable recoding, and index construction.
    Details:  Cleans variable names (janitor), converts Stata-labelled
              variables to R factors. Constructs five circumstance indices:
              (1) Household Economic Index via Multiple Correspondence
              Analysis (MCA) on 16 asset/service binary variables, with
              automatic orientation verification so higher values indicate
              wealthier households.
              (2) Neighborhood Quality Index from 8 community amenity
              variables (p33a-h), scored on a 0-100 scale.
              (3) Crowding Index as persons per bedroom (p22/p24), with
              winsorization at the 99th percentile.
              (4) Financial Inclusion Index as a binary indicator of
              whether the family had any formal financial services
              (savings, credit card, insurance) at age 14.
              (5) Cultural Capital Index was evaluated but dropped due to
              80% variable overlap with the Household Economic Index.
              Saves the preprocessed dataset as entrevistado_clean.rds.
    Depends:  00_setup.R, 01_load_data.R (output)
    Outputs:  data/processed/entrevistado_clean.rds

03_ctree_model.R
    Purpose:  Fit conditional inference trees to partition the sample into
              "types" defined by exogenous circumstances.
    Details:  Defines multiple circumstance sets: minimal (K=5), standard
              (K=9), extended, household, neighborhood, cultural, and
              maximum (K=14). The standard set includes: father's education
              (educp), mother's education (educm), father's occupational
              class (clasep), sex (sexo), indigenous language (p111), skin
              tone on the PERLA scale (p112), region at age 14 (region_14),
              birth cohort (cohorte), and rural/urban status at 14 (p21).
              Uses partykit::ctree with Bonferroni-adjusted quadratic test
              statistics. Default hyperparameters: mincriterion=0.95,
              minsplit=100, minbucket=50, maxdepth=6. Includes 5-fold
              cross-validation over a parameter grid for tuning. Exports
              type summary tables and tree visualizations.
    Key function: run_ctree_analysis(circumstance_set)
    Depends:  00_setup.R

05_iop_metrics.R
    Purpose:  Compute Inequality of Opportunity indices from tree partitions.
    Details:  Implements the ex-ante parametric approach following Roemer
              (1998) and Ferreira & Gignoux (2011): IOp = I(smoothed) /
              I(total), where the smoothed distribution replaces each
              individual's income with their type mean (ctree prediction).
              Computes Gini-based IOp share, MLD-based IOp share (with
              exact additive between/within decomposition), R-squared
              based IOp share, Theil index, and coefficient of variation.
              Also provides bootstrap confidence intervals (percentile
              method, B=100 by default). Generates IOp decomposition bar
              charts and type distribution plots.
    Key functions: calculate_iop_share(), decompose_mld(), bootstrap_iop()
    Depends:  00_setup.R

06_sensitivity_analysis.R
    Purpose:  Test the robustness of IOp estimates across specifications.
    Details:  Three dimensions of sensitivity:
              (1) Circumstance set sensitivity: compares IOp estimates across
              minimal, standard, extended, household, neighborhood, cultural,
              maximum, and no-parental specifications.
              (2) Subsample sensitivity: national, urban, rural, male,
              and female subsamples.
              (3) Hyperparameter sensitivity: varies mincriterion (0.90,
              0.95, 0.99) and maxdepth (4, 6, 8) for a 3x3 grid.
              Each specification fits a separate ctree and computes Gini,
              MLD, and R-squared IOp shares.
    Key function: run_full_sensitivity()
    Depends:  00_setup.R, 05_iop_metrics.R

07_cohort_analysis.R
    Purpose:  Analyze trends in IOp across birth cohorts.
    Details:  Partitions the sample by birth cohort (4 age groups: 25-34,
              35-44, 45-54, 55-64) and estimates IOp separately within
              each cohort using the standard circumstance set minus the
              cohort variable (K=8). Uses bootstrap inference (B=1000
              by default) with stratified within-cohort resampling to
              construct sampling distributions of OLS trend slopes, since
              standard asymptotic inference is unreliable with only 4
              cohort observations. Generates 95% bootstrap confidence
              intervals for per-cohort IOp estimates and bootstrap p-values
              for the monotonic trend hypothesis. Produces two-panel
              figures: (a) IOp shares by cohort with bootstrap CIs, and
              (b) total Gini coefficient by cohort.
    Key function: run_cohort_analysis()
    Depends:  00_setup.R, 05_iop_metrics.R

08_shapley_decomposition.R
    Purpose:  Formal Shapley value decomposition of aggregate IOp following
              Ferreira & Gignoux (2011) and Hufe et al. (2017).
    Details:  IMPORTANT: This is NOT SHAP (SHapley Additive exPlanations
              from machine learning). The Shapley IOp decomposition asks:
              "What is each circumstance's marginal contribution to total
              IOp, averaged over all possible orderings of circumstances?"
              For K circumstances, this requires computing IOp for all 2^K
              subsets (K=9: 512 subsets; K=12: 4,096 subsets). For each
              subset, a separate ctree is fitted and the IOp share computed.
              The Shapley value for each circumstance k is the weighted
              average of its marginal contributions across all coalition
              orderings: phi_k = Sum_{S} [|S|!(K-|S|-1)!/K!] * [IOp(S+k)
              - IOp(S)]. Generates bar plots of Shapley contributions and
              plots of IOp as a function of the number of included
              circumstances (demonstrating the lower bound property).
              Optionally compares with SHAP-based importance if available.
    Key function: run_shapley_analysis(circumstance_set)
    Depends:  00_setup.R, 05_iop_metrics.R

09_run_full_analysis.R
    Purpose:  Master script that executes the complete Paper 1 pipeline.
    Details:  Sequentially runs: (1) data loading, (2) cohort analysis
              with bootstrap inference, (3) Shapley decomposition with
              the standard K=9 specification. Compiles a comprehensive
              summary (analysis_summary.rds) with all key results. Logs
              execution time for each stage.
    Depends:  00_setup.R, 05_iop_metrics.R, 06_sensitivity_analysis.R,
              07_cohort_analysis.R, 08_shapley_decomposition.R

--- Supplementary & Diagnostic Scripts ---

11_mca_diagnostics.R
    Purpose:  Generate diagnostic information for the MCA-based household
              economic index.
    Details:  Produces a scree plot of MCA eigenvalues showing the
              percentage of inertia explained by each dimension. Generates
              variable contribution plots for the first two MCA dimensions.
              Computes Cronbach's alpha for the neighborhood quality index
              (p33a-h). Uses factoextra and psych packages.
    Outputs:  outputs/figures/mca_scree_plot.png,
              outputs/figures/mca_variable_contributions.png,
              outputs/tables/mca_diagnostics.csv,
              outputs/tables/cronbach_alpha_neighborhood.csv

12_descriptive_statistics.R
    Purpose:  Generate descriptive statistics tables for the paper.
    Details:  Produces four panels: (A) outcome variable summary with
              Gini and MLD, (B) frequency tables for all categorical
              circumstance variables, (C) summary statistics for
              continuous/index variables, (D) mean income by key
              circumstance groups. All tables saved as CSV.
    Outputs:  outputs/tables/descriptive_*.csv

ctree_publication.R
    Purpose:  Generate a publication-quality ctree figure with English
              labels for the expanded specification (K=12).
    Details:  Renames all internal variable codes to readable English
              labels before fitting the ctree. Uses maxdepth=4 and
              minbucket=200 for visual clarity. Outputs at 1800x1000
              pixels, 120 dpi.
    Output:   outputs/figures/ctree_publication.png

visualize_trees.R
    Purpose:  Original tree estimation and visualization script.
    Details:  A self-contained script that loads raw EMOVI data, prepares
              the standard circumstance set (K=7, without skin tone or
              cohort), fits both a ctree (partykit) and an rpart tree,
              computes IOp (Gini-based) and R-squared, and generates
              four tree visualization figures. Also computes rpart
              variable importance. This was the initial analysis script;
              the modular pipeline (00-09) supersedes it for the final
              paper results, but it is retained for reproducibility.

regen_figures.R
    Purpose:  Regenerate publication figures from saved CSV output.
    Details:  Reads pre-computed results from outputs/tables/ and
              regenerates cohort trend plots and MCA scree plots
              without re-running the full analysis. Useful for
              adjusting figure aesthetics without recomputing.

--- Utility Modules ---

utils/iop_functions.R
    Purpose:  Reusable helper functions for IOp analysis.
    Details:  Input validation (outcome variables, circumstance variables),
              extended inequality metrics (Atkinson index, generalized
              entropy GE(alpha)), parametric IOp lower bound (OLS-based)
              and non-parametric IOp upper bound (fine partition), multi-
              set comparison functions, subgroup analysis functions, and
              LaTeX table formatting helpers.

utils/data_loader.R
    Purpose:  Functions for loading and merging EMOVI datasets from raw
              Stata files.

================================================================================
5. EXECUTION ORDER
================================================================================

To reproduce all results from scratch:

  Step 1: Place EMOVI 2023 .dta files in data/raw/emovi/Data/
  Step 2: Open R in the project root directory

  # Full pipeline (recommended):
  source("src/R/09_run_full_analysis.R")

  # Or step-by-step:
  source("src/R/00_setup.R")                   # ~5 sec
  source("src/R/01_load_data.R")               # ~10 sec
  source("src/R/02_preprocess.R")              # ~30 sec
  source("src/R/12_descriptive_statistics.R")   # ~10 sec
  source("src/R/11_mca_diagnostics.R")          # ~15 sec

  # Then run analysis interactively:
  source("src/R/03_ctree_model.R")
  results <- run_ctree_analysis("standard")     # ~15 sec

  source("src/R/05_iop_metrics.R")
  iop <- run_iop_analysis(results$model, ...)   # ~5 sec

  source("src/R/06_sensitivity_analysis.R")
  sens <- run_full_sensitivity()                # ~5 min

  source("src/R/07_cohort_analysis.R")
  cohorts <- run_cohort_analysis()              # ~15 min (with B=1000 bootstrap)

  source("src/R/08_shapley_decomposition.R")
  shapley <- run_shapley_analysis(
    circumstance_set = "standard")              # ~30-60 min (512 models)

  # Publication figure:
  source("src/R/ctree_publication.R")           # ~10 sec

Approximate total runtime: 50-90 minutes on a modern desktop
(the Shapley decomposition with K=9 is the bottleneck).

================================================================================
6. CONFIGURATION FILES
================================================================================

config/config.yaml
    Master configuration: file paths, random seeds (global=42, cv=123,
    bootstrap_start=1001), ctree hyperparameter grids, cross-validation
    settings, output format preferences (PNG at 300 dpi).

config/variable_roles.yaml
    Complete variable metadata: outcome variables (primary: log per-capita
    income; secondary: education, occupational class, subjective mobility),
    circumstance variables grouped by domain (parental, demographic,
    household, financial), constructed index definitions (MCA method,
    component variables, scoring schemes), and pre-defined circumstance
    sets (minimal through maximum).

================================================================================
7. REPRODUCIBILITY NOTES
================================================================================

- All random number generation uses fixed seeds (global=42, cv=123,
  bootstrap starting at 1001) defined in config/config.yaml.
- The skin tone variable (p112) in the raw EMOVI data is encoded as
  characters A through K. Scripts convert this to numeric 1-11 using
  an explicit mapping: A=1, B=2, ..., K=11.
- The MCA-based household economic index automatically checks the sign
  of the first dimension and flips it if needed so that higher values
  correspond to wealthier households.
- The crowding index is winsorized at the 99th percentile to handle
  implausible extreme values.
- "NS" (No Sabe / Don't Know) responses are recoded to NA before index
  construction.
- The financial inclusion index was converted from a continuous 0-100
  scale to a binary indicator because 79% of observations had zero
  services, making a continuous measure uninformative.
- The cultural capital index was evaluated but excluded from the final
  analysis due to 80% variable overlap with the household economic index
  (correlation = -0.80), which would introduce severe multicollinearity.

================================================================================
8. KEY RESULTS REFERENCE
================================================================================

The standard specification (K=9 circumstances, ctree) produces:
  - IOp (Gini-based):  ~52%
  - IOp (MLD-based):   ~27%
  - IOp (R-squared):   ~13%
  - Number of types:   varies by specification (~15-30 terminal nodes)
  - Sample size:       ~13,164 complete cases

These are lower-bound estimates of true IOp, as the set of observed
circumstances is necessarily incomplete.

================================================================================
9. CONTACT
================================================================================

For questions about the replication code or data access, please
contact the corresponding author.

================================================================================