Developing a Tool for Storytelling Quality Assessment Using Acoustic Features

Michelle Benimovich¹, Ori Zarchi¹, and Tzipi Horowitz-Kraus^1,2

Phone: +972-522-989298, Email: Tzipi.Kraus@technion.ac.il

¹Department of Biomedical Engineering, Technion - IIT, Haifa, Israel

Tzipi Horowitz-Kraus, PhD

²Educational Neuroimaging Group, Department of Education in Science and Technology, Technion - IIT, Haifa, Israel

Corresponding author:

Educational Neuroimaging Group, Faculty of Education in Science and Technology, Faculty of Biomedical Engineering, Technion – Israel Institute of Technology, Haifa, Israel

Abstract

Storytelling is a key part of early childhood development, especially when it's interactive and expressive. Research shows that dynamic reading styles help activate brain areas linked to attention, imagination, and language, but most current tools still rely on subjective observations rather than on objective, measurable features. Here, we introduce a tool that analyzes the quality of storytelling and provides a score based on three main aspects: how expressive the storyteller is (i.e. the level of voice engagement measured by the monotonous of the speech), how clear the speech is (i.e. pronunciation accuracy), and how natural the storytelling feels (in terms of speech rhythm and flow). One hundred and nineteen recordings were used, compiled by manually segmenting longer audio files into focused speech segments. These segments were scored by the project authors on three dimensions—Expressiveness, Clarity, and Naturalness—using a 1 to 5 scale. Acoustic features such as pitch variability, amplitude dynamics, formant dispersion, and various spectral descriptors were extracted using Python libraries including Librosa and pyAudioAnalysis. Two complementary approaches were applied. First, a GUI-based app was developed to extract and visualize features against labeled benchmarks. Second, a machine learning analysis was performed using Random Forest regression with Leave-One-Out Cross-Validation to explore predictive patterns and identify key acoustic indicators. Feature selection improved predictive performance significantly, with pitch-related features consistently emerging as the most informative. Results revealed that expressive storytelling was characterized by higher pitch and amplitude variability, and clearer articulation, while naturalness features showed weaker correlations. These findings support the feasibility of using automated acoustic analysis to evaluate storytelling quality.

Keywords:

Machine Learning

Speech Signal Processing

Acoustic Feature Analysis

Introduction

Storytelling is a cornerstone of children's language development, impacting cognitive and neurobiological growth. Interactive reading, such as dialogic reading, engages brain areas like the angular gyrus for comprehension and the precuneus for imagination and attention [1][2]. Such reading not only boosts brain connectivity in networks associated with attention and imagination, but also strengthens white matter pathways important for language processing and reading readiness, fostering long-term academic success [3]. Early book reading activates the right temporal-parietal junction (TPJ), supporting joint attention and mentalizing, while screen-based storytelling engages these areas less effectively and is associated with reduced activation in regions linked to comprehension and narrative integration [4].

Current tools for evaluating storytelling quality focus primarily on traditional or observational approaches, often overlooking objective measures such as voice dynamics or engagement metrics. Research highlights the importance of analyzing storytelling parameters, including pitch variation, pacing, and vocal emotion, to enhance interactivity and engagement [5], [6], [7]. A rich and dynamic vocal delivery has been associated with increased listener focus and stronger activation of brain areas involved in imagery and emotional processing, emphasizing the potential benefits of more interactive storytelling.

Quantifying storytelling quality is critical for assessing the richness of linguistic exposure in parent-child or caregiver-child dynamics. While traditional evaluations rely on subjective observation, acoustic analysis and machine learning can offer more objective insight into how stories are delivered, reducing bias and enabling consistent evaluation across contexts.

The goal of the project is to create a tool that analyzes speech characteristics—such as pitch variation, prosodic dynamics, and vocal clarity—based on subjective scores of expressiveness, clarity, and naturalness [8], [9], [10]. These scores serve as labels to explore which acoustic features best reflect storytelling quality. By addressing current gaps, the tool provides a quantitative approach to evaluating how a story sounds, with potential applications in research and early education.

Methods

Datasets

The datasets for the current study included sixty-nine full-length Hebrew-based storytelling recordings, obtained from parent-child joint reading. Each recording lasted approximately 2–5 minutes and featured an adult reading a story aloud.

Data segmentation

To increase the number of samples and isolate adult speech, the recordings were manually segmented into shorter clips containing clear storytelling segments with only adult speech. Noisy recordings were excluded prior to segmentation. This process resulted in a final dataset of 119 labeled segments, each treated as a separate sample for analysis.

Labeling

Each of the 119 segments was manually assigned a score by the project team on three perceptual dimensions: 1)Expressiveness – degree of vocal variation and engagement; 2)Clarity – articulation and speech intelligibility; 3)Naturalness – smoothness and rhythm of delivery. Scores were ranged from 1 (low) to 5 (high) and were based solely on auditory impression, without consideration of linguistic content or transcripts.

Preprocessing

Prior to feature extraction, all recordings underwent the following preprocessing steps: Silence trimming – Long segments of silence were removed to focus analysis on active speech. Resampling – All recordings were resampled to a consistent sampling rate to ensure uniform frequency analysis. RMS (Root Mean Square) normalization – Root mean square normalization was applied to standardize volume across all samples.

Phase 1: App-based feature analysis

Feature Extraction and Interface Integration

For the first phase, a small set of simple, interpretable acoustic features was extracted from each recording: Pitch change rate, Pitch variability, Amplitude range, Amplitude variability, F1_std, F2_std (formant variability) and Speech rate – estimated by generating a transcript using the Whisper library and counting the number of words. These features were selected for their intuitive relevance to storytelling quality and ease of visualization.

To enable user interaction, upon uploading a new recording, the tool automatically extracted key acoustic features, including pitch variability, pitch change rate, fluency (words per minute), total words, and duration. Afterwards, it displayed the numerical values of these features for the uploaded recording, and plotted the new recording's feature values on scatter plots with pre-labeled examples, using fitted trend curves to indicate estimated Expressiveness and Clarity scores. The tool also marked the new recording on the plots with red dashed lines to benchmark its location relative to manually rated samples visually.

Phase 2: Machine learning analysis

Extended feature extraction

To support predictive modeling, over 200 acoustic components were extracted per recording, using Python libraries such as Librosa and pyAudioAnalysis [11]. These included high-dimensional spectral features and their temporal derivatives, such as: MFCCs, along with Delta and Delta-Delta MFCCs, Mel-spectrogram descriptors, spectral contrast, centroid, roll-off, and bandwidth.

Modeling and Validation

A Random Forest Regressor was employed for each of the three quality dimensions (Expressiveness, Clarity, Naturalness), leveraging ensemble methods to enhance predictive accuracy and control overfitting. Model performance was evaluated using Leave-One-Out Cross-Validation (LOOCV) to maximize use of the limited dataset and assess generalization reliability. Subsequently, a Random Forest Classifier was explored to better align with the categorical nature of binned scores [12].

Dimensionality reduction and feature selection

Due to the disproportion between the number of features (> 200) and the number of training samples (119), feature selection was applied to reduce dimensionality and mitigate the risk of overfitting. A high feature-to-sample ratio increases model complexity, which can lead to poor generalization. To address this, subsets of features were selected based on feature importance scores from the trained Random Forest model. The model was retrained multiple times using features that surpassed different importance thresholds.

Feature importance evaluation

Following training, feature importance scores were used to identify the most contributed most to the predictions. This served as the basis for the interpretation of model behavior.

Results

Phase 1: App-based feature analysis

In the first phase, the relationship between individual acoustic features and storytelling quality scores assigned by human raters was examined. Pitch change rate showed correlations of 0.45 with Clarity, 0.29 with Expressiveness, and 0.27 with Naturalness. Pitch variability correlated with Clarity (0.25), Expressiveness (0.06), and Naturalness (0.07). Amplitude variability, amplitude range, and amplitude change rate showed correlations between − 0.09 and 0.17 across all labels. F1_std and F2_std showed low correlations ranging from − 0.10 to 0.18. The correlation values are shown in Fig. 1.

Fig. 1

Correlation heatmap between selected acoustic features and storytelling quality scores (expressiveness, clarity, naturalness).

To complement the correlation analysis, pitch change rate values were also plotted against human-rated scores. Figure 2 displays the distribution of pitch change rate values across Expressiveness scores, while Fig. 3 shows the same distribution across Clarity scores. In both figures, group means are shown alongside linear fits for each score level.

Fig. 2

Boxplot for the pitch change rate values across the Expressiveness score groups. Group means are indicated by black dots, and a linear trend line is fitted across score levels.

Fig. 3

Boxplot of pitch change rate values across Clarity score groups. Group means are indicated by black dots, and a linear trend line is fitted across score levels.

The application (https://audiowebtest.onrender.com/), which can be interacted with by uploading an audio file in MP3, MP4, or WAV format through a simple web interface. After selecting a file, they click the “Analyze” button to initiate the process. The app automatically converts the file to WAV format, if necessary, extracts key acoustic features such as pitch variability, pitch change rate, fluency, word count, and duration, and displays the results directly on the screen. The analyzed recording is then visualized on two scatter plots - Expressiveness versus pitch variability, and Clarity versus pitch change rate, allowing users to see where their recording falls relative to human-rated examples. The information from the uploaded recording is marked with a red dashed line to enable clear visual benchmarking. See Fig. 4.

Fig. 4

Web app interface showing extracted acoustic features and the recording’s position on scatter plots for expressiveness and clarity.

Figure 4 shows an example of the interface output from the app. The uploaded recording had a pitch variability of 100.7 Hz and a pitch change rate of 7.87 Hz/sec. These values are displayed both numerically and visually on the scatter plots, where the new recording is marked with red dashed lines. The plots position the new sample in relation to the labeled reference dataset. This allows users to observe how their recording compares to existing samples.

Phase 2: Machine learning analysis results

The feature selection resulted in 13 features for Expressiveness and Clarity, and 30 for Naturalness. The machine learning algorithm applied to predict storytelling quality scores based on a larger set of acoustic features revealed R² scores of 0.2971 for Expressiveness (threshold: 0.012), 0.3076 for Clarity (threshold: 0.016), and 0.2085 for Naturalness (threshold: 0.008). Figure 5 illustrates R² scores versus feature importance thresholds for the Expressiveness and Clarity dimensions, showing peaks at the optimal thresholds followed by declines as fewer features are retained. The trend in the Naturalness dimension was similar. See Fig. 5 for these results.

Fig. 5

Regression results: R² scores versus feature importance thresholds for Expressiveness and Clarity.

These low R² values indicate limited predictive power, with the model explaining only 20–30% of the variance in scores. Selected features included pitch-related metrics (e.g., pitch_variability for Expressiveness) and spectral descriptors (e.g., mfcc_3_mean for Clarity), aligning with expectations from Phase 1. However, the regression approach penalized small deviations harshly. This oversensitivity, combined with the inherent noise from subjective labeling and the small dataset, contributed to the underwhelming performance.

To better align with the categorical nature of quality perception, where distinctions between broad levels (e.g., low, mediocre, high) are more relevant than precise numerical differences, we binned the 1–5 scores into three classes: 1 ([1, 1.5, 2]), 2 ([2.5, 3, 3.5]), and 3 ([4, 4.5, 5]). Random Forest classification was then applied, also with LOOCV and nested feature selection to avoid data leakage. This yielded improved accuracies: 0.6723 for Expressiveness (65 features), 0.8571 for Clarity (28 features), and 0.6303 for Naturalness (32 features). Figure 6 shows accuracy versus feature importance thresholds, with clear peaks at the best thresholds, validating the selection process.

Fig. 6

Classification results: Accuracy scores versus feature importance thresholds for Expressiveness and Clarity.

The confusion matrices (Fig. 7) further illustrate performance. For Expressiveness, the model correctly classified 59/72 class 3 samples but struggled with class 1 (0/2 correct) and class 2 (24/45 correct), reflecting moderate imbalance. Clarity showed strong diagonal dominance (91/93 class 3 correct), but poor minority class performance (0/1 class 1, 11/25 class 2). Naturalness followed a similar pattern, with 71/71 class 3 correct but only 16/41 class 2 and 0/7 class 1. Selected features emphasized pitch and spectral dynamics (e.g., pitch_variability and mfcc_delta_2_mean for Expressiveness; pitch_change_rate and mel_spec_8_std for Clarity), consistent with Phase 1 correlations.

Fig. 7

Classification: Confusion matrices for best threshod performance for Expressiveness and Clarity and Naturalness.

Discussion

The goal of this study was to determine whether measurable acoustic features, particularly pitch variation, prosodic dynamics, and vocal clarity, could reliably reflect human-rated storytelling quality in terms of Expressiveness, Clarity, and Naturalness. In line with our hypotheses, the results showed that changes in pitch rate were the most informative feature, demonstrating the strongest correlations with both Expressiveness (r = 0.29) and Clarity (r = 0.45). Other features, such as pitch variability and amplitude dynamics, showed weaker or more variable associations with the rated dimensions.

This study demonstrates that acoustic features, particularly pitch and spectral dynamics, can quantify storytelling quality dimensions across two complementary phases. In Phase 1, box plots of simple features like pitch change rate revealed systematic trends with Expressiveness and Clarity scores, supported by fitted functions, while a GUI enabled users to upload recordings and visualize their placement against pre-labeled benchmarks, offering an intuitive tool for real-time assessment. Phase 2 advanced this through machine learning, where the shift from Random Forest regression to classification provided more actionable insights, achieving accuracies up to 0.8571 despite dataset constraints. These accuracies represent a marked improvement over regression, as binning reduced sensitivity to minor score variations and better captured perceptual categories. The high Clarity score (0.8571) aligns with its relatively objective nature—listeners can more consistently identify clear speech—leading to stronger model alignment with ratings. However, class imbalance (e.g., 93/119 class 3 for Clarity) likely inflated accuracies, as the majority-class baseline (predicting class 3 always) yields 0.605, 0.782, and 0.597 for Expressiveness, Clarity, and Naturalness, respectively. The model's bias toward class 3, evident in the confusion matrices, underscores this effect, particularly with scarce class 1 samples (1–7 across categories).

Limitations

The results of the current study should be taken in account in light of the the following limitations. These include a small dataset (119 samples), which risks overfitting in high-dimensional space, and subjective labeling by only two raters, lacking a robust ground truth. The class imbalance, with few class 1 samples (1–7 across categories) and a dominance of class 3 (60.5%-78.2%), may bias the model toward the majority class, inflating accuracy metrics, particularly for Clarity (0.8571), where consistent subjective ratings likely contributed to its high predictability. Future work should prioritize expanding to a much larger dataset to improve generalizability and reduce overfitting risks. Incorporating multi-rater averages from hundreds or thousands of listeners would establish a more robust ground truth, mitigating subjectivity. To automate preprocessing, advanced tools for speaker diarization and separation (e.g., via libraries like pyannote.audio) could replace manual segmentation, enhancing efficiency in isolating storyteller speech [13]. Once a larger dataset is available, more advanced models such as neural networks (e.g., LSTM or Transformer-based architectures) could be explored to capture complex non-linear patterns beyond Random Forest capabilities. These enhancements would refine the tool for practical applications in early education and neuroimaging research.

Conclusion

This study sets out to develop and evaluate a tool for assessing storytelling quality using measurable acoustic features. To achieve this, we manually labeled 119 storytelling recordings on three perceptual dimensions—Expressiveness, Clarity, and Naturalness—and extracted a wide range of audio features for analysis. We implemented both an interactive app for visualizing individual recordings and a machine learning pipeline for predictive modeling.

The results indicate that pitch-based features, particularly pitch change rate, are strong indicators of perceived clarity and expressiveness. Classification models outperformed regression, highlighting the usefulness of binning perceptual scores into broader categories.

Key questions remain regarding how to improve the prediction of naturalness, reduce model bias introduced by class imbalance, and generalize findings to larger and more diverse populations. Future work should expand the dataset, incorporate ratings from multiple listeners to reduce subjectivity, and explore advanced models capable of capturing temporal and semantic context in storytelling. These steps would support broader application of the tool in educational and clinical settings

Author Contribution

Ori, Michelle: analyzed the data, created the application, write the original version of the MsTzipi: Data curation, supervision, wrote the Ms

Acknowledgement

N/A

References

COUNCIL, ON EARLY, CHILDHOOD, et al. (Aug. 2014). Literacy Promotion: An Essential Component of Primary Care Pediatric Practice. Pediatrics, 134(2), 404–409. 10.1542/peds.2014-1384

Hutton, J. S., et al. (May 2017). Story time turbocharger? Child engagement during shared reading and cerebellar activation and connectivity in preschool-age children listening to stories. Plos One, 12(5), e0177398. 10.1371/journal.pone.0177398

Horowitz-Kraus, T., Magaliff, L. S., & Schlaggar, B. L. (2024). Neurobiological Evidence for the Benefit of Interactive Parent–Child Storytelling: Supporting Early Reading Exposure Policies, Policy Insights Behav. Brain Sci., vol. 11, no. 1, pp. 51–58, Mar. 10.1177/23727322231217461

Horowitz-Kraus, T., & Hutton, J. S. (2015). From emergent literacy to reading: how learning to read changes a child’s brain. Acta Paediatrica, 104(7), 648–656. 10.1111/apa.13018

Pecukonis, M., Yücel, M., Lee, H., Knox, C., Boas, D. A., & Tager-Flusberg, H. (Mar. 2025). Do Children’s Brains Function Differently During Book Reading and Screen Time? A fNIRS Study. Developmental Science, 28(2), e13615. 10.1111/desc.13615

Travis, K. E., Leitner, Y., Feldman, H. M., & Ben-Shachar, M. (2015). Cerebellar white matter pathways are associated with reading skills in children and adolescents, Hum. Brain Mapp., vol. 36, no. 4, pp. 1536–1553, Apr. 10.1002/hbm.22721

Hutton, J. S., Horowitz-Kraus, T., Mendelsohn, A. L., DeWitt, T., Holland, S. K., Authorship, C-M-I-N-D., & Consortium (2015). Home Reading Environment and Brain Activation in Preschool Children Listening to Stories, Pediatrics, vol. 136, no. 3, pp. 466–478, Sept. 10.1542/peds.2015-0359

Garzotto, F., Paolini, P., & Sabiescu, A. (2010). Interactive storytelling for children, in Proceedings of the 9th International Conference on Interaction Design and Children, in IDC ’10. New York, NY, USA: Association for Computing Machinery, pp. 356–359. 10.1145/1810543.1810613

Kirsch, C. (2016). Developing language skills through collaborative storytelling on iTEO, Lit. Inf. Comput. Educ. J., vol. 7, no. 2, June 10.20533/licej.2040.2589.2016.0298

10.

Singh, J. (2022). pyAudioProcessing: Audio Processing, Feature Extraction, and Machine Learning Modeling, scipy, June 10.25080/majora-212e5952-017

11.

Giannakopoulos, T. (2015). pyAudioAnalysis: An Open–Source Python Library for Audio Signal Analysis, PLOS ONE, vol. 10, no. 12, e0144610, Dec. 10.1371/journal.pone.0144610

12.

sklearn.ensemble.RandomForestRegressor — scikit–learn, documentation.

13.

Bredin, H. (2019). pyannote.audio: neural building blocks for speaker diarization, Nov. 04, arXiv: arXiv:1911.01255. 10.48550/arXiv.1911.01255

Yes