Hybrid Architecture for Automatic Video-Based Fall Detection Using YOLOv11, MediaPipe Pose, and LSTM Networks

Juan M.Triviño1Emailjmtrivinop@correo.usbcali.edu.co

Andrés F.Lasso1Emailaflassop@correo.usbcali.edu.co

Carlos M.Paredes1✉Emailcmparedesv@usbcali.edu.co

Victor M.Peñeñory1Emailvmpeneno@usbcali.edu.co

1LIDIS,Faculty of EngineerinUniversidad de San Buenaventura760035CaliColombia

Abstract

Falls represent one of the leading causes of injury and loss of autonomy among older adults worldwide. This work proposes a lightweight hybrid deep learning architecture for automatic fall detection, combining person detection with YOLOv11m, human pose estimation with MediaPipe, and temporal analysis using a long short-term memory network. Evaluated on the Le2i dataset, the model classified frames into normal activity, fall in progress, and person on the floor, achieving an overall accuracy of 99.23% and a weighted F1-score of 97.38%. The system matches or outperforms recent hybrid and transformer-based approaches while requiring lower computational resources, demonstrating its suitability for real-time embedded or home monitoring applications. Future work will focus on performance in uncontrolled environments and optimization for edge computing.

Keywords

Video-based fall detection

spatio-temporal analysis

YOLOv11

MediaPipe

LSTM

elderly monitoring

Juan M. Triviño , Andrés F. Lasso , Carlos M. Paredes and Victor M. Peñeñory: These authors contributed equally to this work.

Introduction

Falls are a major public health concern among older adults, affecting a significant proportion of those over 65 and increasing with age. They lead to health complications, hospitalizations, loss of independence, and reduced quality of life, while placing a substantial burden on healthcare systems Mercy2024.

Given the high prevalence and serious consequences of falls among older adults, there is a need for systems that allow continuous monitoring to help mitigate the risk of severe incidents Kaur2025. Fall detection devices, which can be wearable or non wearable, offer a practical solution. Wearable devices provide high accuracy but may cause discomfort and require proper placement, while non wearable systems reduce user dependence and include vision based and environment based approaches, each with limitations related to lighting, privacy, multiple users, and adaptation to new environments 10.3389/frobt.2020.00071.

Both wearable and non wearable fall detection devices rely on analytical methods to determine whether an event is a fall. Early solutions used simple threshold-based criteria, while machine learning classifiers such as support vector machines, k-nearest neighbors, decision trees, and random forest improved accuracy at the cost of higher computational demands. Recently, deep learning approaches have gained relevance for their robustness and adaptability. Among these, long short-term memory networks capture temporal dependencies, convolutional neural networks enable efficient object and video analysis, and transformer-based architectures integrate spatial and temporal information with strong performance inproceedings, Liu2025, Nez-Marcos2024].

Many existing fall detection approaches rely on complex architectures, multiple sensors, or limited datasets, revealing the need for solutions that balance accuracy, generalization, computational efficiency, and practical applicability. To address this, a hybrid system was developed combining person detection with YOLOv11m, pose estimation with MediaPipe, and temporal analysis using a long short-term memory network. The system was evaluated on the Le2i dataset, with video frames labeled as normal activity, fall in progress, or person on the floor and organized into sequences of 30 frames. The proposed model achieved performance comparable to recent hybrid approaches.

The main contributions of this work are summarized as follows:

The study proposes a lightweight hybrid architecture that combines person detection (YOLOv11m), human pose estimation (MediaPipe), and temporal modeling with LSTM networks, enabling the capture of spatial, postural, and sequential patterns associated with fall events.

A structured preprocessing and labeling strategy is introduced for the Le2i Coffee Room dataset, converting video-level annotations into frame-level labels and generating fixed 30-frame sequences that strengthen temporal learning and improve model robustness.

Experimental evaluation shows that the proposed approach achieves high accuracy and weighted F1-scores while requiring substantially lower computational resources compared to recent transformer-based or hybrid architectures.

The document is organized as follows: In Section2 the methods investigated so farto address the described issue are reviewed, Sect. 3 presents the proposed methodology, Sect. 11 reports the experimental results, Sect. 12 provides the discussion and comparison with state-of-the-art works, and Sect. 13 presents conclusions and future research directions.

Related works

Along with deep learning algorithms, a recent trend in studies is the use of multiple datasets for the training process. Among the most notable are UP-Fall, UR-Fall, and LE2i. Table 1 summarizes the most representative methodologies published between 2024 and 2025, detailing the technologies employed, their architectural approaches, and the main results obtained across different datasets, such as UP-Fall, UR-Fall, and Le2i. In general, hybrid methods that integrate convolutional networks with transformers or attention modules achieve accuracy values above 96%, highlighting the potential of deep learning in this area.

begin{table}[htpb]\centering\caption{Summary of fall detection methodologies with their technologies and results}\label{tab:fall_detection_methods}\begin{tabular}{p{1cm} p{4.5cm} p{4.2cm} p{2.2cm}}\topruleStudy & Methodology & Technologies & Accuracy (%) \\\midruleLiu2025 & Hybrid model (Modified YOLOv8s + AlphaPose + BCIoU loss) & Object detection, pose estimation, sparse convolutions, BCIoU optimization & +4.3% accuracy / +4.5% F1 / +37.5% speed (Le2i) \\\citep{Nez-Marcos2024} & Deep Learning (Transformer + CNN + GRU with feature fusion) & Video, spatiotemporal extraction, pyramidal network & 95.45% (UR-Fall) / 99.17% (UP-Fall) \\Khan2025 & Deep CNN (RBNet with self-attention + TSA optimization) & Residual networks with self-attention, Tree Seed algorithm & 93.2–92.5% (Soonchunhyang Univ. Dataset) \\Raza2025 & Pose estimation with ML and Vision Transformers & OpenPose, AlphaPose, HRNet, Vision Transformers & 98.90% (Le2i), 96.44% (UP-Fall), 98.43% (UR-Fall) \\Cai2025 & Deep Learning (Video Swin Transformer with hierarchical self-attention) & Vision Transformer, spatiotemporal attention, multiresolution & 96.1% (Le2i) / 97.0% (UR-Fall) / F1=96.4% / Recall=95.8% \\Shin2025 & GCN + Sep-TCN (three spatiotemporal streams) & Skeleton-based learning, body graphs, separable temporal convolution & 99.68% (ImViA) / 99.97% (UP-Fall) / 99.47% (FU-Kinect) / 98.97% (UR-Fall) \\Fu2025 & 3D-SCNN with sparse convolutions for spatiotemporal optimization & Sparse CNN, video analysis, computational load reduction & 99.82% (UR-Fall) / 96.59% (Multi-Camera) \\Kibet2024 & Transformer encoder-decoder on MediaPipe Pose joint sequences & Pose estimation, temporal analysis, Transformer & 97.6% (own dataset) \\Cai2025b & Vision Transformer (ViT) for global fall detection in video & Global self-attention, action recognition, motion analysis & 95.8% (Le2i, UR-Fall) / Sens.=94.6% / F1=95.2% \\Dutt2024 & Modular DL (STFT + 1D-CNN + OpenPose + GradCAM) & Pose estimation, Fourier Transform, interpretability (GradCAM) & 96–98% (UR-Fall, NTU RGB+D, MCFD) \\Ma2025 & Deep Learning (YOLOv11 + STGCN) integrated with Edge Computing & YOLOv11, AlphaPose, STGCN, deployment on Jetson devices (NX and AGX Orin) & Acc., Rec., F1 >0.98 (own dataset); FP 12–16%, FN 15–18% (edge test) \\Li2024 & Deep Learning (Pyramid Network + Transformer + Feature Fusion + GRU) & CNN for image reduction, Transformer with pooling for spatial feature extraction, feature fusion module, GRU for temporal extraction & 99.61% (UR-Fall), 99.33% (Le2i) \\\bottomrule\end{tabular}\end{table}

Based on the review presented in Table 1, it is observed that most recent studies combine CNNs with attention-based or time-series models, such as LSTM or Transformers. This trend reflects a shift toward hybrid models capable of integrating spatial and temporal information more efficiently.

Regarding datasets, the most commonly used remain UR-Fall, UP-Fall, and Le2i, as they are publicly available and facilitate the comparison of results among different studies. Various works [Gaya-Morey2024, Capodici2025 highlight their role as benchmarks for validating new fall detection models. However, they also point out that these datasets have limitations, as they were recorded in controlled environments with few participants and conditions that are not representative of real-world scenarios. For this reason, new and more diverse datasets are being developed, including people of different ages, contexts, and environments, aiming to improve model generalization and performance in real-life situations.

Based on the limitations observed in the available datasets and current methodologies, there is a need to explore new combinations of architectures that efficiently integrate detection, pose estimation, and temporal analysis. The contribution of this work lies in proposing and evaluating a system that combines three main modules: person detection using YOLOv11m, pose estimation with MediaPipe, and temporal analysis using an LSTM network, to determine its effectiveness in the automatic detection of falls in older adults.

Methodology

The general approach is grounded in the sequential integration of three main processes: person detection, pose estimation, and temporal analysis of body movement. From video sequences, the system identifies the individuals present in the scene, extracts the structural information of their joints, and analyzes the temporal patterns associated with postural transitions that characterize a fall event.The following subsections describe in detail the system components, the dataset used, the preprocessing stages, the model architecture, and the training procedure.

Data sources and preprocessing

For the development and evaluation of the model, a selection from the public Le2i Fall Detection Dataset was used, which is widely employed in computer vision–based fall detection research Charfi2013. This dataset contains video recordings in indoor environments such as living rooms, offices, and corridors, capturing both simulated falls and daily activities. The sequences were recorded with RGB and RGB-D cameras and include variations in lighting, camera position, and fall direction, providing diversity and realism to the analyzed scenarios Charfi2013.

In this work, the Coffee Room 1 and Coffee Room 2 subsets of dataset were used, comprising a total of 70 videos recorded in a controlled environment simulating a typical living room. The recordings have a uniform resolution of 320×180 pixels and were captured with RGB cameras under different lighting and perspective conditions.

Each video includes one or more sequences in which participants perform daily activities (such as walking, sitting, or picking up objects) and simulated falls in different directions (frontal, backward, and lateral). Figures 1–3 show representative examples of the three main scenarios considered in the dataset: a daily activity (1), a fall in progress (2), and a person on the floor after the fall (3).

Fig. 1

Example frame corresponding to a daily activity.

Fig. 2

Example frame corresponding to a fall in progress.

Fig. 3

Example frame corresponding to a person on the floor after the fall.

Each frame in the dataset does not originally contain an associated action label. The dataset only provides temporal markers indicating the start and end of each fall within the videos. Based on this information, a frame-by-frame labeling process was developed to assign a class to each frame. The defined labels were: 0, corresponding to routine or non-fall activities; 1, for frames in which the person is in the process of falling; and 2, for those in which the person is already on the floor.

This procedure made it possible to transform the temporal information of the dataset into a fully labeled dataset suitable for training and evaluating the supervised classification model. Table 2 presents an example of the structure of the labels assigned to each frame within a video.

begin{table}[h!]\caption{Example of the tags for each frame within a video.}\label{tab:tabla_ejemplo_anotacion}\centering\begin{tabular}{lcc}\hlineVideo & Frame & Label \\\hlineCoffee_room_01-fall-01 & 1 & 0 \\Coffee_room_01-fall-01 & 2 & 0 \\Coffee_room_01-fall-01 & 3 & 0 \\\vdots & & \\Coffee_room_01-fall-01 &

$(n)$

--1 & 0 \\\hlineCoffee_room_01-fall-01 &

$(n)$

& 1 \\Coffee_room_01-fall-01 &

$(n)$

+1 & 1 \\Coffee_room_01-fall-01 &

$(n)$

+2 & 1 \\\vdots & & \\Coffee_room_01-fall-01 &

$(h)$

& 1 \\\hlineCoffee_room_01-fall-01 &

$(h)$

+1 & 2 \\Coffee_room_01-fall-01 &

$(h)$

+2 & 2 \\Coffee_room_01-fall-01 &

$(h)$

+3 & 2 \\\vdots & & \\Coffee_room_01-fall-01 &

$(x)$

& 2 \\\hline\end{tabular}\end{table}

In this table,

$n$

represents the frame number where the fall begins,

$h$

corresponds to the frame where the fall ends, and

$x$

indicates the total number of frames in the video. In this way, the temporal sequence of each video is segmented into three clearly differentiated phases, allowing the model to progressively learn the transition between normal activities and fall events.

An important limitation of the dataset is that it does not include keypoint or joint annotations, which represent the skeletal position of the person in each frame. These coordinates are essential for training the LSTM network and performing classification based on body postures. Therefore, a complementary pose estimation process was developed using the MediaPipe Pose model to generate the required joint coordinates and enable the training of the proposed model.

During preprocessing, the videos were converted to RGB format and segmented into frames at a rate of 30 frames per second (fps). From these frames, the YOLOv11m person detection model was applied, which allowed delimiting the regions of interest (bounding boxes) corresponding to the individuals present in each scene. The detected regions were then processed using the MediaPipe Pose model, which estimates the coordinates of 33 human body keypoints. Figure 4 shows the different stages of this process.

Fig. 4

Processing flow of each frame.

Since MediaPipe internally performs image resizing and normalization before inference, no manual spatial normalization was applied beforehand. The resulting coordinates were stored as two-dimensional vectors

$(x, y)$

. These coordinates were associated with the corresponding label for each frame, generating a dataset composed of 66 features formed by the 33 pairs of coordinates for each keypoint and the target variable.

It is important to note that both YOLOv11 and MediaPipe Pose, being computer vision models based on deep neural networks, were not able to reliably detect the person and their joint points in all frames. Therefore, frames in which no valid detection was obtained were discarded from the final dataset to maintain the quality and consistency of the sequences used for training.

Table 3 presents a simplified example of the resulting dataset structure corresponding to a single video, showing the assigned labels, estimated joint coordinates, and an additional field indicating the validity of each detection.

begin{table}[h!]\caption{Summary structure of the dataset with labels, keypoint coordinates, and the \texttt{keypoints_info} field for each frame.}\label{tab:keypoints}\centering\begin{tabular}{lccc}\hlineFrame & Label &

$(x_0,y_0, \ldots, x_{32},y_{32}))$

keypoints \\\hline1 & 0 &