# Narrative Dynamics in Korean Flash Fiction: Computational Analysis Pipeline

This repository provides the code and data signals necessary to replicate the findings of our research.

### 📢 Important Note on Reproducibility & Data Availability
To comply with copyright restrictions regarding the original literary texts, the raw dataset (`flash_fiction_merged.csv`) is not included in this repository. 

- **For Transparency:** Scripts for data preprocessing and LLM-based signal extraction (Stages 1–3) are provided to document our methodology. These scripts require high-performance computing (e.g., Solar-10.7B LLM) and the restricted raw corpus.
- **For Replication:** We provide the **pre-computed master file** (`flash_fiction_with_surprisal_coherence_semantic.csv`), which contains the numerical signals for all analyzed stories. Users can immediately replicate the statistical analysis, trajectory clustering, and peak dynamics by running the `FlashFiction_Analysis.ipynb` notebook.
---

## 🚀 Quick Start (Analysis Only)

To replicate the statistical analysis and visualization:
1. Ensure the **`flash_fiction_with_surprisal_coherence_semantic.csv`** (Final Master File) is in your data directory.
2. Run the **`FlashFiction_Analysis.ipynb`** notebook. 
3. This notebook utilizes pre-calculated numerical signals to generate research findings without requiring the restricted raw texts.

---

## 🛠 Installation Guide

The environment setup is optimized for the analysis and visualization phase. You can set up the environment using either Conda or Pip.

### Method A: Conda (Recommended)
```bash
# Create the environment using environment.yml
conda env create -f environment.yaml
conda activate flashfiction_analysis
```

### Method B: Pip
```bash
# Create a fresh environment and install via requirements.txt
conda create -n flashfiction_analysis python=3.10 -y
conda activate flashfiction_analysis
pip install -r requirements.txt
```

*Note: Dependencies for heavy LLMs (Solar-10.7B) and raw text processing have been excluded from the current setup files to maintain a lightweight analysis environment.*

---

## 📂 Data Inventory

To comply with copyright laws, we provide the calculated numerical signals while restricting access to the original literary texts.

| File Name | Provided? | Description |
| :--- | :---: | :--- |
| `book_list_summary.csv` | **Yes** | Metadata (ISBN, Title, Author, Publisher, etc.) |
| **`flash_fiction_with_surprisal_coherence_semantic.csv`** | **Yes** | **Final Master File** (Numerical signals only) |
| `flash_fiction_merged.csv` | No | Initial raw dataset with full texts (Restricted) |
| `flash_fiction_merged_filtered.csv` | No | Refined dataset after outlier removal (Restricted) |

---

## ⚙️ Research Pipeline (Methodological Reference)

The following stages describe how the narrative signals were generated. These scripts are provided to ensure the transparency of our research methodology.

### **[Phase 1] Data Generation (Reference Only)**
*These steps require access to the restricted raw dataset and a high-compute environment.*
1. **Preprocessing & Filtering (`check_sent_stats.py`)**
   - Sentence segmentation using `KSS` and `Mecab`.
   - Outlier Removal: Stories in the bottom 5% and top 5% of the sentence count distribution are excluded.
2. **Surprisal Extraction (`calculate_surprisal.py`)**
   - Model: `Solar-10.7B` (LLM) utilizing a 3,500 token sliding window.
   - Metric: Sentence-level Surprisal (Negative Log-Likelihood).
3. **Discourse Signal Calculation (`coherence_topic_calc.py`)**
   - Model: `ko-sroberta-multitask` (SBERT).
   - **Local Coherence**: $\text{Coherence}_t = \cos(v_t, v_{t-1})$
   - **Semantic Shift**: $\text{Semantic Shift}_t = 1 - \cos(v_t, \mu_{1..t-1})$

### **[Phase 2] Final Analysis (Executable)**
*This stage is performed using the provided numerical signals.*
The master output is utilized in **`FlashFiction_Analysis.ipynb`**. This notebook performs:
* **Input:** `flash_fiction_with_surprisal_coherence_semantic.csv` 
1. **Stability Diagnostics**: Detection and removal of initial 'burn-in' noise.
2. **Trajectory Clustering**: Identification of narrative archetypes and structural patterns.
3. **Peak Dynamics**: Point-wise and dynamic recovery analysis (TTR, Slope) following narrative shocks.

**Upon execution, all generated statistical reports and numerical summaries are saved to the `statistical_outputs/` directory, while all visualization plots and figures are exported to the `figure_outputs/` directory for further review.**
