================================================================================ SUPPLEMENTARY MATERIALS ================================================================================ Paper: "Inequality of Opportunity in Mexico: A Cultural Capital Approach Using Conditional Inference Trees and Shapley Decomposition" Data: ESRU Social Mobility Survey (ESRU-EMOVI) 2023 Centro de Estudios Espinosa Yglesias (CEEY) ================================================================================ 1. DATA AVAILABILITY ================================================================================ The ESRU-EMOVI 2023 microdata must be obtained directly from the Centro de Estudios Espinosa Yglesias (CEEY) at: https://ceey.org.mx/emovi/ The survey includes three Stata (.dta) files: - entrevistado_2023.dta (individual-level respondent data) - hogar_2023.dta (household-level data) - inclusion_2023.dta (financial inclusion module, optional) After downloading, place the .dta files in: data/raw/emovi/Data/ The data dictionary (Excel) should be placed in: data/raw/emovi/Diccionario ESRU EMOVI 2023.xlsx ================================================================================ 2. SOFTWARE REQUIREMENTS ================================================================================ R version: >= 4.3.0 (developed with R 4.5.2) Required R packages (installed automatically by 00_setup.R): Data manipulation: - tidyverse (>= 2.0) Data wrangling and visualization - haven (>= 2.5) Reading Stata .dta files - readxl Reading Excel files - yaml Reading YAML configuration - janitor Cleaning variable names - labelled Handling labelled Stata data Tree-based models: - partykit (>= 1.2) Conditional inference trees (ctree) - party Conditional inference forests (cforest) Inequality metrics: - ineq Gini, Theil, entropy measures - acid Advanced inequality decomposition Interpretability: - pdp Partial dependence plots - iml ICE plots - vip Variable importance Survey analysis: - survey Complex survey design support Index construction: - FactoMineR Multiple Correspondence Analysis (MCA) - factoextra MCA visualization (diagnostics) - psych Cronbach's alpha (diagnostics) Reporting: - rmarkdown, knitr Report generation - kableExtra, gt Table formatting - patchwork Combining multi-panel figures - scales Axis formatting Utilities: - here Project-relative paths - glue String interpolation - assertthat Input validation - tictoc Execution timing - viridis Color palettes - gridExtra Multi-plot layouts - rpart, rpart.plot CART trees (comparison only) ================================================================================ 3. PROJECT STRUCTURE ================================================================================ Inequality-of-Opportunity/ |-- config/ | |-- config.yaml Project configuration (paths, seeds, params) | |-- variable_roles.yaml Variable classification and metadata | |-- data/ | |-- raw/emovi/Data/ Raw EMOVI 2023 .dta files (NOT included) | |-- processed/ Generated RDS files from preprocessing | |-- src/R/ | |-- 00_setup.R Environment setup (packages, seeds, paths) | |-- 01_load_data.R Load raw Stata data, save as RDS | |-- 02_preprocess.R Data cleaning and index construction | |-- 03_ctree_model.R Conditional inference tree estimation | |-- 05_iop_metrics.R IOp metric calculations (Gini, MLD, R2) | |-- 06_sensitivity_analysis.R Robustness checks | |-- 07_cohort_analysis.R Birth cohort IOp trends | |-- 08_shapley_decomposition.R Shapley value decomposition of IOp | |-- 09_run_full_analysis.R Master pipeline script | |-- 11_mca_diagnostics.R MCA diagnostic plots and Cronbach's alpha | |-- 12_descriptive_statistics.R Descriptive statistics tables | |-- ctree_publication.R Publication-quality ctree figure | |-- visualize_trees.R Tree visualization and IOp computation | |-- regen_figures.R Regenerate figures from saved CSV data | |-- utils/ | |-- iop_functions.R Reusable IOp helper functions | |-- data_loader.R Data loading utilities | |-- outputs/ | |-- figures/ All generated figures (.png, 300 dpi) | |-- tables/ All generated tables (.csv) | |-- models/ Saved model objects (.rds) ================================================================================ 4. SCRIPT DESCRIPTIONS ================================================================================ --- Core Pipeline Scripts (run in order) --- 00_setup.R Purpose: Initialize the R environment for all subsequent scripts. Details: Installs and loads all required packages. Reads project configuration from config/config.yaml and variable metadata from config/variable_roles.yaml. Sets the global random seed (42) and stores reproducibility seeds for cross-validation and bootstrap. Creates output directories. Defines a custom ggplot2 theme (theme_iop) and utility functions for saving figures, tables, and logging. 01_load_data.R Purpose: Load raw ESRU-EMOVI 2023 Stata files and convert to RDS. Details: Reads entrevistado_2023.dta and hogar_2023.dta using the haven package. Saves as RDS files in data/processed/ for faster subsequent loading. Validates that expected data files exist before proceeding. Depends: 00_setup.R 02_preprocess.R Purpose: Data preprocessing, variable recoding, and index construction. Details: Cleans variable names (janitor), converts Stata-labelled variables to R factors. Constructs five circumstance indices: (1) Household Economic Index via Multiple Correspondence Analysis (MCA) on 16 asset/service binary variables, with automatic orientation verification so higher values indicate wealthier households. (2) Neighborhood Quality Index from 8 community amenity variables (p33a-h), scored on a 0-100 scale. (3) Crowding Index as persons per bedroom (p22/p24), with winsorization at the 99th percentile. (4) Financial Inclusion Index as a binary indicator of whether the family had any formal financial services (savings, credit card, insurance) at age 14. (5) Cultural Capital Index was evaluated but dropped due to 80% variable overlap with the Household Economic Index. Saves the preprocessed dataset as entrevistado_clean.rds. Depends: 00_setup.R, 01_load_data.R (output) Outputs: data/processed/entrevistado_clean.rds 03_ctree_model.R Purpose: Fit conditional inference trees to partition the sample into "types" defined by exogenous circumstances. Details: Defines multiple circumstance sets: minimal (K=5), standard (K=9), extended, household, neighborhood, cultural, and maximum (K=14). The standard set includes: father's education (educp), mother's education (educm), father's occupational class (clasep), sex (sexo), indigenous language (p111), skin tone on the PERLA scale (p112), region at age 14 (region_14), birth cohort (cohorte), and rural/urban status at 14 (p21). Uses partykit::ctree with Bonferroni-adjusted quadratic test statistics. Default hyperparameters: mincriterion=0.95, minsplit=100, minbucket=50, maxdepth=6. Includes 5-fold cross-validation over a parameter grid for tuning. Exports type summary tables and tree visualizations. Key function: run_ctree_analysis(circumstance_set) Depends: 00_setup.R 05_iop_metrics.R Purpose: Compute Inequality of Opportunity indices from tree partitions. Details: Implements the ex-ante parametric approach following Roemer (1998) and Ferreira & Gignoux (2011): IOp = I(smoothed) / I(total), where the smoothed distribution replaces each individual's income with their type mean (ctree prediction). Computes Gini-based IOp share, MLD-based IOp share (with exact additive between/within decomposition), R-squared based IOp share, Theil index, and coefficient of variation. Also provides bootstrap confidence intervals (percentile method, B=100 by default). Generates IOp decomposition bar charts and type distribution plots. Key functions: calculate_iop_share(), decompose_mld(), bootstrap_iop() Depends: 00_setup.R 06_sensitivity_analysis.R Purpose: Test the robustness of IOp estimates across specifications. Details: Three dimensions of sensitivity: (1) Circumstance set sensitivity: compares IOp estimates across minimal, standard, extended, household, neighborhood, cultural, maximum, and no-parental specifications. (2) Subsample sensitivity: national, urban, rural, male, and female subsamples. (3) Hyperparameter sensitivity: varies mincriterion (0.90, 0.95, 0.99) and maxdepth (4, 6, 8) for a 3x3 grid. Each specification fits a separate ctree and computes Gini, MLD, and R-squared IOp shares. Key function: run_full_sensitivity() Depends: 00_setup.R, 05_iop_metrics.R 07_cohort_analysis.R Purpose: Analyze trends in IOp across birth cohorts. Details: Partitions the sample by birth cohort (4 age groups: 25-34, 35-44, 45-54, 55-64) and estimates IOp separately within each cohort using the standard circumstance set minus the cohort variable (K=8). Uses bootstrap inference (B=1000 by default) with stratified within-cohort resampling to construct sampling distributions of OLS trend slopes, since standard asymptotic inference is unreliable with only 4 cohort observations. Generates 95% bootstrap confidence intervals for per-cohort IOp estimates and bootstrap p-values for the monotonic trend hypothesis. Produces two-panel figures: (a) IOp shares by cohort with bootstrap CIs, and (b) total Gini coefficient by cohort. Key function: run_cohort_analysis() Depends: 00_setup.R, 05_iop_metrics.R 08_shapley_decomposition.R Purpose: Formal Shapley value decomposition of aggregate IOp following Ferreira & Gignoux (2011) and Hufe et al. (2017). Details: IMPORTANT: This is NOT SHAP (SHapley Additive exPlanations from machine learning). The Shapley IOp decomposition asks: "What is each circumstance's marginal contribution to total IOp, averaged over all possible orderings of circumstances?" For K circumstances, this requires computing IOp for all 2^K subsets (K=9: 512 subsets; K=12: 4,096 subsets). For each subset, a separate ctree is fitted and the IOp share computed. The Shapley value for each circumstance k is the weighted average of its marginal contributions across all coalition orderings: phi_k = Sum_{S} [|S|!(K-|S|-1)!/K!] * [IOp(S+k) - IOp(S)]. Generates bar plots of Shapley contributions and plots of IOp as a function of the number of included circumstances (demonstrating the lower bound property). Optionally compares with SHAP-based importance if available. Key function: run_shapley_analysis(circumstance_set) Depends: 00_setup.R, 05_iop_metrics.R 09_run_full_analysis.R Purpose: Master script that executes the complete Paper 1 pipeline. Details: Sequentially runs: (1) data loading, (2) cohort analysis with bootstrap inference, (3) Shapley decomposition with the standard K=9 specification. Compiles a comprehensive summary (analysis_summary.rds) with all key results. Logs execution time for each stage. Depends: 00_setup.R, 05_iop_metrics.R, 06_sensitivity_analysis.R, 07_cohort_analysis.R, 08_shapley_decomposition.R --- Supplementary & Diagnostic Scripts --- 11_mca_diagnostics.R Purpose: Generate diagnostic information for the MCA-based household economic index. Details: Produces a scree plot of MCA eigenvalues showing the percentage of inertia explained by each dimension. Generates variable contribution plots for the first two MCA dimensions. Computes Cronbach's alpha for the neighborhood quality index (p33a-h). Uses factoextra and psych packages. Outputs: outputs/figures/mca_scree_plot.png, outputs/figures/mca_variable_contributions.png, outputs/tables/mca_diagnostics.csv, outputs/tables/cronbach_alpha_neighborhood.csv 12_descriptive_statistics.R Purpose: Generate descriptive statistics tables for the paper. Details: Produces four panels: (A) outcome variable summary with Gini and MLD, (B) frequency tables for all categorical circumstance variables, (C) summary statistics for continuous/index variables, (D) mean income by key circumstance groups. All tables saved as CSV. Outputs: outputs/tables/descriptive_*.csv ctree_publication.R Purpose: Generate a publication-quality ctree figure with English labels for the expanded specification (K=12). Details: Renames all internal variable codes to readable English labels before fitting the ctree. Uses maxdepth=4 and minbucket=200 for visual clarity. Outputs at 1800x1000 pixels, 120 dpi. Output: outputs/figures/ctree_publication.png visualize_trees.R Purpose: Original tree estimation and visualization script. Details: A self-contained script that loads raw EMOVI data, prepares the standard circumstance set (K=7, without skin tone or cohort), fits both a ctree (partykit) and an rpart tree, computes IOp (Gini-based) and R-squared, and generates four tree visualization figures. Also computes rpart variable importance. This was the initial analysis script; the modular pipeline (00-09) supersedes it for the final paper results, but it is retained for reproducibility. regen_figures.R Purpose: Regenerate publication figures from saved CSV output. Details: Reads pre-computed results from outputs/tables/ and regenerates cohort trend plots and MCA scree plots without re-running the full analysis. Useful for adjusting figure aesthetics without recomputing. --- Utility Modules --- utils/iop_functions.R Purpose: Reusable helper functions for IOp analysis. Details: Input validation (outcome variables, circumstance variables), extended inequality metrics (Atkinson index, generalized entropy GE(alpha)), parametric IOp lower bound (OLS-based) and non-parametric IOp upper bound (fine partition), multi- set comparison functions, subgroup analysis functions, and LaTeX table formatting helpers. utils/data_loader.R Purpose: Functions for loading and merging EMOVI datasets from raw Stata files. ================================================================================ 5. EXECUTION ORDER ================================================================================ To reproduce all results from scratch: Step 1: Place EMOVI 2023 .dta files in data/raw/emovi/Data/ Step 2: Open R in the project root directory # Full pipeline (recommended): source("src/R/09_run_full_analysis.R") # Or step-by-step: source("src/R/00_setup.R") # ~5 sec source("src/R/01_load_data.R") # ~10 sec source("src/R/02_preprocess.R") # ~30 sec source("src/R/12_descriptive_statistics.R") # ~10 sec source("src/R/11_mca_diagnostics.R") # ~15 sec # Then run analysis interactively: source("src/R/03_ctree_model.R") results <- run_ctree_analysis("standard") # ~15 sec source("src/R/05_iop_metrics.R") iop <- run_iop_analysis(results$model, ...) # ~5 sec source("src/R/06_sensitivity_analysis.R") sens <- run_full_sensitivity() # ~5 min source("src/R/07_cohort_analysis.R") cohorts <- run_cohort_analysis() # ~15 min (with B=1000 bootstrap) source("src/R/08_shapley_decomposition.R") shapley <- run_shapley_analysis( circumstance_set = "standard") # ~30-60 min (512 models) # Publication figure: source("src/R/ctree_publication.R") # ~10 sec Approximate total runtime: 50-90 minutes on a modern desktop (the Shapley decomposition with K=9 is the bottleneck). ================================================================================ 6. CONFIGURATION FILES ================================================================================ config/config.yaml Master configuration: file paths, random seeds (global=42, cv=123, bootstrap_start=1001), ctree hyperparameter grids, cross-validation settings, output format preferences (PNG at 300 dpi). config/variable_roles.yaml Complete variable metadata: outcome variables (primary: log per-capita income; secondary: education, occupational class, subjective mobility), circumstance variables grouped by domain (parental, demographic, household, financial), constructed index definitions (MCA method, component variables, scoring schemes), and pre-defined circumstance sets (minimal through maximum). ================================================================================ 7. REPRODUCIBILITY NOTES ================================================================================ - All random number generation uses fixed seeds (global=42, cv=123, bootstrap starting at 1001) defined in config/config.yaml. - The skin tone variable (p112) in the raw EMOVI data is encoded as characters A through K. Scripts convert this to numeric 1-11 using an explicit mapping: A=1, B=2, ..., K=11. - The MCA-based household economic index automatically checks the sign of the first dimension and flips it if needed so that higher values correspond to wealthier households. - The crowding index is winsorized at the 99th percentile to handle implausible extreme values. - "NS" (No Sabe / Don't Know) responses are recoded to NA before index construction. - The financial inclusion index was converted from a continuous 0-100 scale to a binary indicator because 79% of observations had zero services, making a continuous measure uninformative. - The cultural capital index was evaluated but excluded from the final analysis due to 80% variable overlap with the household economic index (correlation = -0.80), which would introduce severe multicollinearity. ================================================================================ 8. KEY RESULTS REFERENCE ================================================================================ The standard specification (K=9 circumstances, ctree) produces: - IOp (Gini-based): ~52% - IOp (MLD-based): ~27% - IOp (R-squared): ~13% - Number of types: varies by specification (~15-30 terminal nodes) - Sample size: ~13,164 complete cases These are lower-bound estimates of true IOp, as the set of observed circumstances is necessarily incomplete. ================================================================================ 9. CONTACT ================================================================================ For questions about the replication code or data access, please contact the corresponding author. ================================================================================