Soft Active EMG Interface for Machine Learning-Enabled Silent Speech Recognition

YutaKurotaki1,2

ShunsukeYamakoshi1

ReitaroYoshida2

YutakaIsoda1

TamamiTakano1

YujiIsano1

YusukeMiyake2

KentaroKuribayashi2

HirokiOta1✉Emailota-hiroki-xm@ynu.ac.jp

Department of Mechanical EngineeringYokohama National University79-5 Tokiwadai, Hodogaya-Ku240-8501YokohamaKanagawaJapan

2Pepabo Research and Development InstituteGMO Pepabo, Inc26-1 Sakuragaoka, Shibuya Ward150-8512TokyoJapan

Yuta Kurotaki^1,2, Shunsuke Yamakoshi¹, Reitaro Yoshida², Yutaka Isoda¹, Tamami Takano¹, Yuji Isano¹, Yusuke Miyake², Kentaro Kuribayashi², Hiroki Ota^1*

¹Department of Mechanical Engineering, Yokohama National University, 79 − 5 Tokiwadai, Hodogaya-Ku, Yokohama, Kanagawa 240–8501, Japan

²Pepabo Research and Development Institute, GMO Pepabo, Inc., 26 − 1 Sakuragaoka, Shibuya Ward, Tokyo 150–8512, Japan

*Corresponding author: Hiroki Ota, ota-hiroki-xm@ynu.ac.jp

Abstract

Silent speech recognition (SSR) provides an alternative communication pathway where spoken sound is absent. However, conventional approaches are limited by the requirement of constant facial attachment, privacy concerns, and unstable signal acquisition. Herein, we propose a soft active electromyography (EMG) interface that enables word-level SSR via machine learning. Worn on the hand, the device uses a fingertip electrode that can be placed near the lips to acquire EMG signals only when desired. The interface integrates liquid-metal (LM) interconnects, transparent flexible printed circuit (FPC) electrodes, and elastomer encapsulation to ensure high mechanical stability during finger motion. A deep neural network, trained on these stable signals, achieved 94.3% accuracy in classifying a 30-word vocabulary, demonstrating robust linguistic discrimination. Furthermore, real-time drone control validated the practicality of this approach in noisy and privacy-sensitive environments where conventional voice recognition fails. This study highlights the potential of soft, wearable EMG systems as secure, intuitive human–machine interfaces.

Introduction

SSR has attracted attention as a human–machine interface (HMI) technology that conveys human intent to machines without audible speech¹. It offers a new communication pathway in noisy environments and situations requiring confidentiality, as well as for individuals with speech impairments. In the future, it is expected to serve as a foundation for intuitive and secure interactions that transcend dependence on audible speech.

To realize SSR, conventional approaches have explored optical lip-reading^2,3,4, ultrasonic detection of tongue and lip motion^5,6,7, electroencephalography (EEG)-based language estimation^8,9,10, and infrared sensors¹¹. However, optical and ultrasonic methods are constrained by environmental conditions, privacy concerns, and device size. Brain–machine interfaces (BMIs) often capture abundant non-speech information, reducing controllability, while accelerometer/gyroscope sensors¹² impose a significant burden due to their large size. In other words, conventional methods have not achieved high-precision signal acquisition and wearability/social acceptability suitable for daily use.

Despite the development of multiple SSR approaches, there remains a pressing demand for devices that enable daily use of silent speech. To achieve this, it is essential to improve wearability around the face and head while providing a system that minimizes user burden yet ensures sufficient recognition accuracy.

As a possible solution to the aforementioned problem, flexible devices employing soft materials have attracted attention. Reported examples include surface EMG (sEMG) sensors based on PEDOT:PSS/polyurethane composites¹³, strain sensors comprising metal thin films and polymer films¹⁴, and triboelectric sensors made of PVC/nylon/sponge¹⁵, all of which provide low invasiveness and skin compatibility. Accordingly, SSR devices based on soft materials are promising for daily use.

However, previous soft-material-based methods require sensors to be permanently attached to the face or throat, which compromises user comfort and affects aesthetics and social acceptance^16,17,18. More critically, such continuously worn devices may acquire biosignals without the user’s intent, leading to risks of privacy invasion and security concerns¹⁹. Thus, while continuous sensing has often been considered an advantage for wearables, it becomes a bottleneck in the context of SSR.

To address challenges in aesthetics, privacy, and security, SSR requires not only flexibility and wearability but also a novel operation modality that functions on-demand under user control. This concept is relevant not only to SSR but also to smartphones that exploit location histories and healthcare devices that process biometric information—both contexts in which the handling of sensitive personal data requires strict consideration^20,21.

In this study, we developed a wearable soft active electromyography (EMG) interface that operates based on intentional user input. The device is worn on the hand, and by placing a fingertip electrode near the lips, EMG signals are acquired only when necessary. Structurally, the system integrates a flexible elastomer (Ecoflex), a transparent FPC substrate, and highly conductive yet stretchable LM interconnects, enabling stable electrical connectivity even under repeated complex finger movements. This high mechanical stability ensures consistent and reproducible EMG signals, supporting accurate speech recognition through machine-learning algorithms. Moreover, by eliminating the need for continuous facial attachment, the device gives users control over when signals are acquired, simultaneously ensuring privacy and security. Furthermore, we demonstrated its utility and applicability through SSR and drone operation in noisy environments. The developed soft EMG interface can simultaneously realize machine-learning-compatible stable signal acquisition and user-driven on-demand operation, establishing a new paradigm for practical SSR.

Results

Design for Intent-Driven SSR

The developed wearable smart silent speech interface, mounted on the hand, enables language recognition and machine control through mouth movements alone, without audible speech. By placing the fingertip near the mouth only when intended (Fig. 1a), the exposed electrode on the fingertip pad makes contact with the area around the lips, enabling the acquisition of EMG signals from the perioral muscles. These signals are then processed by machine-learning algorithms to enable character recognition and control of smart devices such as drones (Fig. 1b).

Fig. 1

Intent-driven SSR. a Conceptual diagram of the proposed wearable silent voice interface. Electrodes are placed on the hand, and the hand is brought near the mouth only when necessary to acquire EMG signals. b EMG signals are acquired by placing the hand around the mouth to perform silent speech. A DNN recognizes the silent speech to operate the device. c Device can be worn on a finger. It integrates LM wiring, Ecoflex, and transparent FPC electrodes to track hand and mouth movements. d Device is used by placing the fingertip around the mouth. It features four electrodes for EMG acquisition, positioned around the mouth.

In contrast to devices that require sensors to be continuously attached to the face, this device operates only when the user intentionally touches the mouth area with the fingertip, performing language recognition or machine input only at that moment. As the device is physically separated from the perioral skin except during input, no silent speech is sensed outside of intended usage, and external attempts to trigger sensing by third-party attacks are impossible. This addresses privacy and security concerns. Moreover, it mitigates aesthetic issues for users in cultural contexts where attaching sensors to the face or throat is undesirable. In addition, as a feature of silent speech, words can be recognized in situations where one wishes to remain quiet and avoid being overheard. No audible sound is used; hence, device operation remains unaffected even in noisy environments.

The EMG electrode attached to the fingertip is composed of a transparent FPC substrate and Ecoflex, with the electrode arranged on the inner surface of a tubular fingertip structure. An EMG amplification circuit is mounted on the dorsum of the hand, connected to the fingertip electrode via LM interconnects. As the finger is structurally complex and highly stretchable, LM wiring is used; this wiring offers superior electrical conductivity, minimal resistance fluctuation under deformation, and stable resistance under repeated strain compared with conventional silver paste, PEDOT:PSS, and carbon-based conductors. By employing LM wiring between the fingertip and the back of the hand, EMG signals can be stably transmitted to the circuit even during repeated finger bending. This technical feature is crucial when incorporating machine learning, which requires large amounts of training data, with stretchable devices²² (Fig. 1c).

An example of device usage is shown in Fig. 1d. The device is worn on the hand, the fingertip is placed near four muscles surrounding the lips, and EMG signals are acquired from these perioral muscles. From these signal patterns, silent speech is recognized via machine learning.

Device Design and Characterization

An active EMG interface was fabricated that combines high skin conformability with mechanical flexibility. The device body employs Ecoflex—a soft elastomer material—as the packaging layer, forming a structure that can be mounted on the dorsum of the hand (Fig. 2a). An EMG circuit for amplification and signal processing (Supplementary Fig. 1) is implemented on the Ecoflex substrate, while surface electrodes (EMG electrodes) for direct skin contact are placed on the inner side of the fingertip (Supplementary Fig. 2).

Fig. 2

Device structure of the developed soft active EMG interface and data processing. a Electrodes are placed on the Ecoflex substrate to enable attachment to the fingertip. LM wiring connects to the EMG circuit, which is then covered with Ecoflex. b Cross section of the device showing the electrodes placed on the fingertip for skin contact during non-vocal speech, along with the EMG amplification circuit on the wrist side. c Flowchart showing the process of placing four electrodes on the skin, amplifying the EMG signal, passing it through low- and high-pass filters, and capturing the movement around the lips via analog-to-digital conversion.

The EMG electrodes were fabricated as a bipolar pair on a transparent FPC substrate. The reference electrode was placed on the circuit side on the back of the hand. The inter-electrode distance was set to 5 mm to enable EMG measurement within the limited fingertip area. The electrodes were soldered onto copper foils on the transparent FPC substrate to facilitate skin contact. They were adhered to the Ecoflex substrate using Sil-Poxy and then encapsulated with Ecoflex. The EMG electrodes and circuit were interconnected using LM (Supplementary Fig. 3). The device was designed for repeated use and removability; tensile testing of the fabricated electrodes confirmed that the resistance of the electrode–circuit interconnection via LM remained stable after repeated use (Supplementary Fig. 4). Finally, the entire structure was encapsulated by spray-coating Ecoflex. Owing to its planar structure, the device can be folded around the fingertip, positioning the EMG electrodes on the finger pad to form the interface (Fig. 2b).

The interface was designed to conform closely to the fingers, with fingertip electrodes functioning as dry electrodes. When placed on the perioral muscles, these electrodes detect weak EMG signals associated with speech articulation with high sensitivity. The acquired EMG signals are amplified within the EMG circuit, passed through low-pass and high-pass filters to remove noise, subjected to analog-to-digital conversion (ADC), and then transmitted to an external computing device for real-time processing (Fig. 2c).

The fingertip electrodes were positioned anatomically to correspond to four speech-related muscles: the buccinator, orbicularis oris, depressor labii inferioris, and mentalis. The buccinator stabilizes the cheeks during plosive and fricative production. The orbicularis oris is responsible for rounding and protrusion of the lips, while the depressor labii inferioris moves the lower lip, and the mentalis acts in coordination with movements that retract the mouth corners. By aligning the electrodes with these muscles, EMG activity associated with speech articulation could be effectively recorded (Fig. 3a).

Fig. 3

Electrode interfaces for EMG recording. a Placement of fingertip electrodes on four lip-related muscles: buccinator, orbicularis oris, depressor labii inferioris, and mentalis. b Representative EMG waveform of the mentalis muscle, indicating alternating rest and silent articulation activity. c Four-channel EMG signals recorded during the articulation of “Move” and “Turn,” highlighting distinct muscle activation patterns across phonetic gestures.

An example EMG waveform of the mentalis muscle is shown for a single channel during speech and rest (Fig. 3b).

The signal indicates EMG activity during 1 s of silent speech, followed by 1 s of rest and another 1 s of silent speech. In the amplification circuit used in this study, muscle activity is captured within the range 0–3.3 V, centered at 1.6 V. This baseline facilitates analog-to-digital conversion for preprocessing EMG signals as inputs to machine-learning models.

Using this device, we conducted experiments to measure muscle activity during silent speech. Figure 3c presents four-channel EMG signals recorded during silent articulation of the words “Move” and “Turn,” which were selected from a 30-word vocabulary. The signals correspond to the buccinator, orbicularis oris, depressor labii inferioris, and mentalis muscles, reflecting localized muscle activity associated with articulation. Distinct EMG patterns were observed between Move and Turn, with differences corresponding to coordinated movements of the respective muscles. These results suggest that the proposed device can capture input signals sufficient to distinguish between intended words, thereby functioning as an effective silent speech interface.

Implementation of Machine Learning-Integrated Silent Speech Interface

Using the high-conformability EMG device described in the previous section, we constructed a machine-learning model for word recognition during silent speech and achieved classification of voiceless articulation. During silent speech, fingertip electrodes were placed near four perioral muscles, and EMG signals were acquired at a sampling rate of 1 kHz (1-ms intervals). From the four-channel EMG signals, Mel-frequency cepstral coefficients (MFCCs) were computed and used as inputs to a deep neural network (DNN) trained on EMG-derived MFCC feature (Fig. 4a). The architecture of the DNN (Supplementary Fig. 5) comprised multiple fully connected (Dense) layers following the input layer, with dropout and batch normalization applied to each layer. This design stabilized training and suppressed overfitting.

Fig. 4

Machine learning-based recognition of silent speech. a Acquisition of multichannel EMG signals and feature extraction using MFCCs. b Preprocessing pipeline including framing, FFT, filter-bank analysis, and DCT for MFCC computation. c Classification performance of the DNN, which recognized a 30-word vocabulary with 94.3% accuracy. d Feature distribution before training, showing overlapping word clusters. e Feature distribution after DNN classification, demonstrating improved separability of silent speech words.

The EMG signals acquired from the device were preprocessed via framing into fixed time windows; a fast Fourier transform (FFT) was then applied to each frame. The resulting frequency components were processed through a Mel filter bank and subjected to a discrete cosine transform (DCT) to obtain MFCCs. The first- through sixth-order MFCCs were extracted as features and used as DNN inputs (Fig. 4b).

Model performance was maximized by introducing optimization techniques step-by-step and verifying their effects. Ultimately, the combination of input-channel selection, data splitting, cross-validation, callbacks, hyperparameter tuning, and normalization achieved the highest accuracy of 94.3% (Fig. 4c).

We examined whether adjusting the number of input channels allowed the model to learn features more effectively. The performance with three input channels is shown in Supplementary Fig. 6. The four-channel configuration achieved the best results overall. The generalization performance was fairly evaluated by randomly dividing the dataset into 1,500 training samples and 300 test samples from a total of 1,800 samples. Furthermore, five-fold cross-validation was applied to enhance the reliability of model evaluation. By repeatedly using the entire dataset for both training and testing, variance due to specific splits was reduced, enabling more robust estimation of model performance. For classification results, t-distributed stochastic neighbor embedding (t-SNE) was applied to visualize the feature space, confirming that separability between words was improved after application of the machine-learning model (Figs. 4d and e). These results demonstrate that the proposed approach is effective for EMG-based silent speech interfaces and highlight its practicality as a non-audible communication method.

Application to Human–Computer Interaction: Demonstration via Drone Control

To evaluate the applicability of the developed high-conformability EMG device, we conducted a proof-of-concept experiment in which silent speech was used for real-time word classification and subsequent human–computer interaction through drone control.

Participants wore the device and articulated operation commands without producing audible speech. The corresponding EMG signals were analyzed and classified in real time, enabling recognition of drone control commands (Fig. 5a). The device was connected to a microcontroller, while the trained model was executed on a PC to classify silent speech in real time.

Fig. 5

Application to human–computer interaction through drone control demonstration.

a Mechanism for speech recognition without vocalization for drone control. Words classified by a pretrained model, displayed on screen after the device was connected to Arduino. b Example MFCC features derived from four-channel EMG signals when uttering “Stop” and “Start.” c Experiment simulating drone flight using four commands (“Start,” “Move,”, “Turn,” and “Stop”). d Flight scenario: Recognized commands are transmitted to the quadcopter via Wi-Fi, enabling real-time execution of operations from 1 to 7 in sequence. e Demonstration of drone takeoff, movement, turning, and landing executed solely through SSR using the device.

For each of the four EMG channels over which silent speech of drone commands were recorded, six-dimensional MFCCs were computed as features (Fig. 5b). MFCCs, which are designed to emulate the human auditory system, were calculated by segmenting the target signal into short time windows, performing the short-time Fourier transform (STFT) to obtain the power spectrum. A Mel filter bank was then applied to generate the Mel-frequency spectrum. Subsequently, a DCT was performed on the logarithmic Mel spectrum. The six lowest-order MFCCs were extracted from each channel and used as input features.

Using these features, we implemented a simulation in which four silent speech commands—Start, Move, Turn, and Stop—controlled a drone’s flight sequence. To verify the feasibility of drone control via silent speech, we conducted a flight simulation using a small quadcopter (Tello), with the system classifying EMG-based MFCC features and transmitting commands over Wi-Fi (Fig. 5c). The drone executed a full sequence of operations—from takeoff to circling and landing—using only these four silent speech commands. Recognition of Start triggered autonomous takeoff from the desk and a transition to stable hovering. Once the drone was airborne, recognition of Move prompted it to fly straight over a preset distance, effectively replacing the forward and backward control of the Tello app’s virtual joystick. Recognition of Turn caused the drone to rotate by 45° either clockwise or counterclockwise, emulating joystick-based yaw control. Finally, recognition of Stop after the flight sequence instructed the drone to land automatically. All commands were classified in real time from EMG-derived MFCC features and transmitted to the drone via Wi-Fi (Fig. 5d).

As a result, complex drone maneuvers, including takeoff, translation, turning, and landing, were accurately executed using only EMG-based command inputs through silent speech, without producing audible sound (Fig. 5e). These findings demonstrate that the proposed device functions as a silent speech interface with both real-time capability and high classification accuracy. Moreover, the results highlight its utility as an intuitive, nonverbal method for interaction with machines and robots, particularly in environments where speech is impractical or confidentiality is required. A demonstration of this operation is shown in Supplementary Video 1.

Discussion

In EMG measurements for SSR, particularly with daily use in mind, electrode portability and reusability have been critical challenges. To address this, we developed EMG measurement electrodes and amplification circuits designed to prioritize portability and reusability. Both wet and dry electrodes are available for use²³ to acquire EMG signals from the skin around the mouth; an appropriate type has to be selected depending on the purpose. Gel-based electrodes²⁴ and microneedle electrodes^25,26 exhibit high adhesion. Although gel electrodes enable secure fixation, they require attachment of gel to the skin, which is unsuitable for a device designed for repeated attachment and removal. Meanwhile, microneedle electrodes are invasive and therefore deemed unsuitable for daily use, which would involve repeated skin stimulation. Hence, we adopted an active EMG mechanism to acquire EMG signals noninvasively from the skin surface.

Active EMG enables noninvasive measurement without gels. Typically, such electrodes require fixation using adhesive tape or bands; insufficient fixation increases susceptibility to noise, rendering stable EMG acquisition difficult²⁷. Electrodes based on PEDOT:PSS^13,28, carbon nanotubes or carbon black^29,30,31, and silver paste³² have been reported. Although reusable PEDOT:PSS electrodes have been investigated³³, their resistance fluctuates with strain and temperature. Thus, they were not suitable for this study, where the electrodes had to withstand pressure during manual fixation near the face and remain stable despite environmental variations when worn on the palm. Regarding carbon and silver, prior work compared cPDMS electrodes made of carbon-PDMS, AgPDMS electrodes made of silver-PDMS, and stainless-steel electrodes³⁴. The material characteristics of dry electrodes have also been evaluated³⁵. Considering these results, we adopted solder-coated electrodes that connect easily to copper foil on transparent FPC substrates and exhibit minimal resistance changes due to strain.

Ecoflex was applied around the electrode’s contact area with the skin to prevent slippage during fingertip-based manual fixation against the perioral muscles. In this study, by combining Ecoflex as a flexible base material with transparent FPC electrodes and employing LM wiring that followed finger motion, we designed a system that enabled users to intentionally press electrodes against the skin for stable placement on target muscles (Fig. 2b). This structure eliminated the need for adhesives or tape, while still achieving stable fixation by the user and overcoming the limitations of conventional dry electrodes. Thus, the system combined ease of use with reusability. The EMG amplification circuit incorporated both low-pass and high-pass filters, enabling extraction and amplification of weak EMG signals while removing environmental noise and unwanted frequency components (Fig. 2c).

Although the fingertip electrodes employ solder-coated metallic contacts, which exhibit conventional dry-electrode characteristics, the overall interface integrates soft-elastomer encapsulation and LM interconnects. Thus, while the electrode surface is not intrinsically soft, the device as a whole provides high mechanical compliance and wearability, justifying the designation of a “soft active EMG interface.”

For silent speech classification using machine learning, electrodes were placed on the perioral region without skin pretreatment.

Participants repeated a cycle of 1 s of rest followed by 1 s of silent speech, during which EMG signals were acquired. In a single-channel measurement of the mentalis muscle, although noise was present, voltage increases corresponding to muscle activation during speech movements were observed (Fig. 3b). During articulation of Move (characterized by lip protrusion), EMG signals were recorded in channel 3 (depressor labii inferioris) and channel 4 (mentalis). These muscles are closely involved in lip protrusion and closure³⁶. The depressor labii inferioris lowers the lower lip, while the mentalis protrudes it forward. Detecting their activity during the lip-pursing motion of Move supports the anatomical validity of the measurement. Similarly, during silent articulation of Turn, EMG signals were recorded for approximately 0.75 s in channel 2 (orbicularis oris) and channel 4 (mentalis) as the lips closed. The orbicularis oris functions in closing and rounding the lips, while the mentalis raises the lower lip and wrinkles the chin, moving the lower lip upward and forward³⁷. Therefore, the results indicate that relevant muscle activity during silent speech was successfully captured.

These findings demonstrate that distinctive features of words can be detected during silent speech. Articulatory gestures such as bilabial closure (plosives), constriction (labiodental fricatives), and rounding (approximants) are widely observed across languages. According to the UCLA Phonological Segment Inventory Database, 446 of 451 languages (99%) include bilabial plosives; many languages also feature labiodental fricatives and rounded approximants³⁸. This suggests that the results of the present study may be extended beyond English to other languages (Fig. 3c).

By integrating the EMG signals acquired using the proposed device into a machine-learning model, the word classification accuracy for silent speech was increased to 94.3% (Fig. 4c). This result suggests that speech can be sufficiently distinguished based on the activity of four perioral muscles. In contrast to previous studies that directly processed EMG signals, our classification model applied a feature transformation into MFCCs—a representation widely used in speech recognition—and used these as training data (Figs. 4a and b, Supplementary Fig. 7). This approach has proven highly effective in EMG-based SSR research^39,40,41 and similarly produced a significant improvement in classification accuracy in this study. The use of MFCCs with convolutional neural networks (CNNs) achieves higher accuracy than classifiers such as support vector machines (SVMs)⁴¹, and we adopted this perspective in selecting the model. We further attempted classification using a Conformer model⁴² based on Transformer architectures^42,43; however, we only achieved an accuracy of 90.6%. Ultimately, the MFCC + DNN model achieved the highest accuracy of 94.3% (Supplementary Fig. 8). This indicates that under the condition of limited data in our experiments, the simpler DNN had a better generalization capability than the more complex Conformer model.

The device is designed to be worn on the hand, with electrodes placed on the fingertips to measure perioral muscle activity. Owing to anatomical constraints of the hand, the maximum number of electrodes is five. In this study, we evaluated four electrodes placed from the thumb to the ring finger, which achieved the high classification accuracy of 94.3%. This demonstrates that high-performance recognition can be realized without adhesives and with a limited number of channels. Previous studies have indicated that increasing the number of electrodes (e.g., 7, 8, or 12) leads to more feature information and thus higher classification accuracy³⁹. In our system, the practical constraint is that users manually press the device against the mouth region. This results in a dynamic measurement environment rather than a completely static one, with noise and signal instability caused by electrode placement and hand movement. Despite these limitations—only four electrodes and dynamic operation—the device achieved a high classification accuracy of 94.3%. Comparisons using two- and three-electrode configurations also indicated that four electrodes yielded the highest accuracy, consistent with prior studies in which accuracy improved with additional electrodes (Supplementary Table 1).

Furthermore, the classification accuracy achieved in this study is considered sufficiently high compared with previous studies that assumed fixed electrodes under static measurement environments³⁹. Visualization using t-SNE confirmed that words articulated silently could be distinguished in the feature space. This result demonstrates the feasibility of SSR based on EMG signals, consistent with prior studies that used t-SNE to evaluate classification of movements⁴⁴. These findings support the effectiveness of the proposed device and preprocessing approach (Figs. 4d and e).

As a concrete application example of the proposed wearable silent speech interface, we conducted drone operation experiments. Conventional voice recognition systems are susceptible to interference from loud operational sounds such as drone startup and flight noise, as well as background noise in environments such as industrial or disaster sites, rendering acceptance of speech input from soft or whispered voices challenging^45,46. In contrast, the present system performs SSR based on EMG signals, without requiring microphone input. As a result, control via command input remained feasible during the experiments, unaffected by environmental noise or the drone’s operational sounds. This characteristic enables the system to overcome the practical limitations faced by conventional speech recognition devices in noisy environments and function as an HMI with high robustness against environmental noise. Specifically, EMG signals acquired by the device were fed into a pretrained silent speech classification model, which recognized each command and enabled drone operation.

The speech commands adopted were Start, Move, Turn, and Stop, which are commonly used in smart-device operation. Silent speech-based drone control has exhibited promise in prior research⁴⁷; this study extended its applicability by mapping classification results to corresponding control commands and transmitting them to the drone in real time. Experimental results demonstrated that a full sequence of operations—including takeoff, movement, turning, and landing—could be executed continuously.

Supplementary Table 2 presents smart devices capable of SSR via surface sensing of the face, including EMG, as demonstrated in this study. While conventional systems assume continuous attachment and suffer from privacy issues caused by unintended acquisition of muscle activity, the proposed design initiates signal acquisition only when the user deliberately brings the hand close to the mouth. This approach transfers control of information acquisition to the user and introduces a new paradigm that addresses both privacy and security concerns.

Traditional silent speech interfaces have mainly been investigated with the goal of protecting privacy during communication. However, many patch-type EMG systems^48,49, while capable of stable measurement, require attaching electrodes over wide areas of the face or neck. This implies that EMG signals are continuously acquired, even when the user does not intend it, raising privacy concerns. Systems that detect laryngeal muscle motion using magnetoelasticity⁵⁰ have also been studied, but they still require continuous attachment using tape or similar fixation methods and therefore suffer from the same issue of constant signal acquisition. In addition, studies have used cameras worn between the neck and chest to recognize mouth movements², but these pose privacy concerns because unwanted content may be captured.

In EEG- and neural network-based studies¹⁰, participants are required to wear caps, leading to constant acquisition of brain activity. SSR has also been attempted with everyday objects, such as glasses, masks, or earphones, integrated with accelerometers⁵¹, infrared sensors¹¹, TENGs¹⁴, and strain sensors⁴³. Although these devices achieve high wearability and a familiar appearance, they continuously capture mouth movement, posing a risk of privacy invasion. Additionally, personal information or behavioral patterns can be collected without user consent; thus, always-on wearable sensors inherently carry risks of unintended speech capture and security threats from external attacks. These issues stem from the fact that the devices are physically affixed to the body. Hence, studies have also been conducted on silent speech devices that do not require continuous facial attachment, with daily use in mind⁵².

Ultrasound echo-based silent speech devices⁶ have demonstrated intentional placement against the throat during measurement; however, they require gels and are not designed for daily use, and handheld operation reduces portability. In contrast to these prior works, our device is structured not as a standalone sensor but as an integrated system that does not require continuous facial attachment. By employing soft materials, it enables EMG acquisition only when deliberately pressed against the mouth, without the need for gels or adhesive tape.

This design enables data collection only when intended, mitigating privacy concerns. Although the measurement conditions are less favorable compared with adhesive-based electrodes—because no tape fixation or skin pretreatment is used—our device achieved 94.3% silent speech classification accuracy when combined with machine learning. This performance is comparable to that of similar devices. Drone control experiments further demonstrated the practicality of our device as an HMI.

In this study, we developed a user-controlled wearable soft EMG interface that provides stable signal acquisition and enables SSR. By integrating LM interconnects with elastomer encapsulation, the device maintained reliable electrical performance under repeated finger movements and delivered consistent biosignals suitable for machine learning. Using a DNN, the system classified a 30-word vocabulary with 94.3% accuracy, achieving word-level recognition without continuous facial attachment.

Furthermore, drone control experiments in noisy and privacy-sensitive environments verified its utility for intuitive human–machine interaction. The significance of this study lies in its integration of on-demand operation with machine learning-based linguistic recognition. It represents a new paradigm that addresses the challenges related to privacy, social acceptability, and signal stability inherent in previous wearable SSR systems. This work highlights the potential of soft and stretchable bioelectronics as safe and intuitive communication tools. Future challenges include systematically evaluating the impact of electrode placement variability on recognition accuracy, developing methods to mitigate these effects, expanding the vocabulary, and extending recognition to continuous speech. Overcoming these limitations would enable this technology to evolve into a broadly applicable platform for communication, assistive devices, and control systems in real-world environments.

Methods

Fabrication of EMG amplification circuit board

The circuit for EMG measurement comprised an instrumentation amplifier (AD627, Analog Devices) in the front stage and two operational amplifiers (AD8607, Analog Devices) in the subsequent stage to amplify weak potentials. High-pass and low-pass filters were incorporated to extract only the relevant EMG signals, following established designs for biological signal-processing circuits⁵³. The reference voltage was set to 1.6 V, enabling measurement of muscle potential changes within the range 0–3.3 V (Fig. S1).

The circuit board was fabricated using a transparent polyimide substrate covered on one side with copper foil (L71KTS 1012EDR T10, Arisawa Mfg. Co., Ltd.). AZ photoresist was spin-coated, dried on a hot plate at 100°C, and patterned via mask exposure. The circuit pattern was then developed through etching (Fig. S2), after which the necessary components were mounted. Two measurement electrodes and one reference electrode were applied to the skin for signal acquisition.

Preparation and wiring of LM paste

LM paste was prepared as follows. Nickel powder (3–7 µm, Alfa Aesar Co.) was dispersed into Galinstan (Maruya.com) at mass ratios of 2% and 5%. The mixture was sonicated using an ultrasonic probe (SFX 550, BRANSON) with a duty ratio of 30% and a total energy of 6 kJ. The mixture was then exposed to air overnight to promote oxidation. Oxidized LM paste was also prepared by stirring Galinstan under ambient conditions with a stirrer (Azone) at 750 rpm for 60 min to induce oxidation.

Fabrication of silent speech device

A mold produced via 3D printing was filled with silicone rubber Ecoflex 00–20 (Ecoflex, Smooth-On; A:B = 1:1 weight ratio) and cured in an oven.

The EMG amplification circuit board was fixed onto the cured elastomer using Sil-Poxy adhesive (Sil-Poxy, Smooth-On). Fingertip electrodes were attached in the same manner using Sil-Poxy. A polyimide film was cut using a laser marker (Keyence) to form a stencil mask for LM wiring. LM paste was applied through the mask, and after the mask was removed, wiring connections between the fingertip electrodes and the EMG circuit were formed.

The connection areas between the circuit board and the LM wiring were sealed with silicone rubber (KE-1606, Shin-Etsu Chemical) and cured. Finally, Ecoflex was coated over the entire device and cured in an oven to encapsulate the system.

Software for silent speech classification

A microcontroller (Arduino) was used to digitize the signals obtained from the electrodes and EMG amplification circuit. EMG signals from four perioral muscles were sampled at 1-ms intervals (1,000 Hz) via four analog input channels and transmitted to a computer (Fig. 2c). These data were used to generate test datasets for prediction with machine-learning models. The EMG signals were transformed into MFCCs, which served as the input features for machine learning. A five-layer DNN was used for training. The final layer employed the softmax function to perform multiclass classification of silent speech. The classification results were validated using t-SNE, confirming that silent speech words could be distinguished in the feature space (Figs. 4d and e).

Drone operation application

A control program was implemented in Python. The PC connected to the device was connected to a drone (Tello, Ryze Technology) via Wi-Fi. During silent speech articulation with the device placed at the mouth, recognized words were transmitted via HTTP to control the drone (Fig. 5c). Recognized words were displayed on a web browser, and the drone executed movements within the room based on the recognized commands (Figs. 5d and e).

Ethical approval

and participant consent

All participants gave informed consent after receiving a complete explanation of the study’s purpose and procedures.

The research protocol was approved by the Ethics Committee of the Yokohama National University Graduate School of Engineering Science (No. 2020-16, approved on February 12, 2021).

Electronic Supplementary Material

Below is the link to the electronic supplementary material

Supplementary Material 1

Supplementary Material 2

References

Denby, B. et al. Silent speech interfaces. Speech Commun. 52, 270–287 (2010).

Kimura, N., Hayashi, K. & Rekimoto, J. TieLent: a casual neck-mounted mouth capturing device for silent speech interaction. Proceedings of the 2020 International Conference on Advanced Visual Interfaces. 33, 1–8 (2020).

Sun, K. et al. Lip-Interact: improving mobile device interaction with silent speech commands. UIST '18: Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology. 581–593 (2018).

Wang, X., Su, Z., Rekimoto, J. & Zhang, Y. Watch your mouth: silent speech recognition with depth sensing. CHI '24: Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 323, 1–15 (2024).

Hueber, T., Benaroya, E.-L., Denby, B. & Chollet, G. Statistical mapping between articulatory and acoustic data for an ultrasound-based silent speech interface. Proc. Interspeech 2011. 593–596 (2011).

Kimura, N., Kono, M. & Rekimoto, J. SottoVoce: an ultrasound imaging-based silent speech interaction using deep neural networks. CHI '19: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 146, 1–11 (2019).

Hueber, T. et al. Development of a silent speech interface driven by ultrasound and optical images of the tongue and lips. Speech Commun. 52, 288–300 (2010).

Brigham, K. & Kumar, B. V. K. V. Imagined Speech Classification with EEG Signals for Silent Communication: A Preliminary Investigation into Synthetic Telepathy. 2010 4th International Conference on Bioinformatics and Biomedical Engineering. 1–10 (2010).

Zhang, D. et al. Making Sense of Spatio-Temporal Preserving Representations for EEG-Based Human Intention Recognition. IEEE Trans. Cybern. 50, 3033–3044 (2019).

10.

Vorontsova, D. et al. Silent EEG-speech recognition using convolutional and recurrent neural network with 85% accuracy of 9 words classification. Sensors 21, 6744 (2021).

11.

Igarashi, Y., Futami, K. & Murao, K. Silent speech eyewear interface: silent speech recognition method using eyewear and an ear-mounted microphone with infrared distance sensors. Sensors. 24, 7368 (2024).

12.

Rekimoto, J. & Nishimura, Y. Derma: Silent Speech Interaction Using Transcutaneous Motion Sensing. AHs '21: Proceedings of the Augmented Humans International Conference 2021. 91–100 (2021).

13.

Zhang, L. et al. Fully organic compliant dry electrodes self-adhesive to skin for long-term motion-robust epidermal biopotential monitoring. Nat. Commun. 11, 4683 (2020).

14.

Kim, T. et al. Ultrathin crystalline-silicon-based strain gauges with deep learning algorithms for silent speech interfaces. Nat. Commun. 13, 5815 (2022).

15.

Lu, Y. et al. Decoding lip language using triboelectric sensors with deep learning. Nat. Commun. 13, 1401 (2022).

16.

Liu, Y., De Mulatier, S. & Matsuhisa, N. Unperceivable designs of wearable electronics. Adv. Mater. 2502727; 10.1002/adma.202502727 (2025).

17.

You, C.-W. et al. Understanding social perceptions towards interacting with on-skin interfaces in public. ISWC '19: Proceedings of the 2019 ACM International Symposium on Wearable Computers. 244–253 (2019).

18.

Yang, H., Yu, J., Zo, H. & Choi, M. User acceptance of wearable devices: an extended perspective of perceived value. Telemat. Inform. 33, 256–269 (2016).

19.

Arias, O., Wurm, J., Hoang, K. & Jin, Y. Privacy and security in internet of things and wearable devices. IEEE Trans. Multi-scale Comput. Syst. 1, 99–109 (2015).

20.

Doherty, C. et al. Privacy in consumer wearable technologies: a living systematic analysis of data policies across leading manufacturers. npj Digit. Med. 8, 363 (2025).

21.

Sivakumar, C. L. V., Mone, V. & Abdumukhtor, R. Addressing privacy concerns with wearable health monitoring technology. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 14, e1535; 10.1002/widm.1535 (2024).

22.

Isano, Y. et al. Soft intelligent systems based on stretchable hybrid devices integrated with machine learning. Device. 2, 100496 (2024).

23.

Searle, A. & Kirkup, L. A direct comparison of wet, dry and insulating bioelectric recording electrodes. Physiol. Meas. 21, 271–283 (2000).

24.

Bai, Y. et al. Liquid metal flexible EMG gel electrodes for gesture recognition. Biosensors. 13, 692 (2023).

25.

Singh, O. P. et al. Flexible, conductive fabric-backed, microneedle electrodes for electrophysiological monitoring. Adv. Mater. Technol. 9, 2301606; 10.1002/admt.202301606 (2023).

26.

Kim, H. et al. Skin preparation-free, stretchable microneedle adhesive patches for reliable electrophysiological sensing and exoskeleton robot control. Sci. Adv. 10, eadk5260; 10.1126/sciadv.adk5260 (2024).

27.

Roy, S. H. et al. Electro-mechanical stability of surface EMG sensors. Med. Biol. Eng. Comput. 45, 447–457 (2007).

28.

Kateb, P. et al. Printable, adhesive, and self-healing dry epidermal electrodes based on PEDOT:PSS and polyurethane diol. Flex. Print. Electron. 8, 045006 (2023).

29.

Hossain, M. M. et al. Adhesive free, conformable and washable carbon nanotube fabric electrodes for biosensing. npj Flex. Electron. 6, 1–9 (2022).

30.

Jung, H.-C. et al. CNT/PDMS composite flexible dry electrodes for long-term ECG monitoring. IEEE Trans. Biomed. Eng. 59, 1472–1479 (2012).

31.

Togo, S., Murai, Y., Jiang, Y. & Yokoi, H. Development of an sEMG sensor composed of two-layered conductive silicone with different carbon concentrations. Sci. Rep. 9, 13996 (2019).

32.

Zhang, D. et al. Stretchable and durable HD-sEMG electrodes for accurate recognition of swallowing activities on complex epidermal surfaces. Microsyst. Nanoeng. 9, 115 (2023).

33.

Zhao, R. et al. Mechanical tough, stretchable, and adhesive PEDOT:PSS-based hydrogel flexible electronics towards multi-modal wearable application. Chem. Eng. J. 510, 161645 (2025).

34.

Lopes, P. A. et al. Soft bioelectronic stickers: selection and evaluation of skin-interfacing electrodes. Adv. Healthc. Mater. 8, e1900234 (2019).

35.

Gandhi, N. et al. Properties of dry and non-contact electrodes for wearable physiological sensors. Proc. Int. Conf. Body Sensor Netw. 107–112 (2011).

36.

Marur, T., Tuna, Y. & Demirci, S. Facial anatomy. Clin. Dermatol. 32, 14–23 (2014).

37.

Stepp, C. E. Surface electromyography for speech and swallowing systems: measurement, analysis, and interpretation. J. Speech Lang. Hear. Res. 55, 1232–1246 (2012).

38.

Gick, B. et al. Quantal biomechanical effects in speech postures of the lips. J. Neurophysiol. 124, 833–843 (2020).

39.

Meltzner, G. S. et al. Development of sEMG sensors and algorithms for silent speech recognition. J. Neural Eng. 15, 046031 (2018).

40.

Meltzner, G. S. et al. Silent speech recognition as an alternative communication device for persons with laryngectomy. IEEE/ACM Trans. Audio Speech Lang. Process. 25, 2386–2398 (2017).

41.

Wu, J. et al. A novel silent speech recognition approach based on parallel inception convolutional neural network and Mel frequency spectral coefficient. Front. Neurorobot. 16, 971446 (2022).

42.

Gulati, A. et al. Conformer: convolution-augmented Transformer for speech recognition. Proc. Interspeech 2020, 5036–5040 (2020).

43.

Song, R. et al. Decoding silent speech from high-density surface electromyographic data using transformer. Biomed. Signal Process. Control. 80, 104298 (2023).

44.

Tang, C. et al. Ultrasensitive textile strain sensors redefine wearable silent speech interfaces with high machine learning efficiency. npj Flex. Electron. 8, 1–11 (2024).

45.

Chen, X., Bi, H., Lai, W.-T. & Ma, F. Monaural speech enhancement on drone via adapter based transfer learning. Proc. 18th Int. Workshop Acoustic Signal Enhanc. 85–89 (2024).

46.

Ming, J., Hazen, T. J., Glass, J. R. & Reynolds, D. A. Robust speaker recognition in noisy conditions. IEEE Trans. Audio Speech Lang. Process. 15, 1711–1723 (2007).

47.

Dong, P. et al. Decoding silent speech commands from articulatory movements through soft magnetic skin and machine learning. Mater. Horiz. 10, 5607–5620 (2023).

48.

Liu, H. et al. An epidermal sEMG tattoo-like patch as a new human–machine interface for patients with loss of voice. Microsyst. Nanoeng. 6, 16 (2020).

49.

Wang, Y. et al. All-weather, natural silent speech recognition via machine-learning-assisted tattoo-like electronics. npj Flex. Electron. 5, 1–9 (2021).

50.

Che, Z. et al. Speaking without vocal folds using a machine-learning-assisted wearable sensing-actuation system. Nat. Commun. 15, 1873 (2024).

51.

Srivastava, T. et al. MuteIt: jaw motion based unvoiced command recognition using earable. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 6, 1–26 (2022).

52.

Manabe, H., Hiraiwa, A. & Sugimura, T. Unvoiced speech recognition using EMG – mime speech recognition. CHI EA '03: CHI '03 Extended Abstracts on Human Factors in Computing Systems. 794–795 (2003).

53.

Nagashima, Y. A platform for biological signal information processing. IPSJ SIG Tech. Rep. Entertainment Comput. 14, 1–6 (2015).

Author Contribution

YK conceived the study and was responsible for the overall design, device implementation, and experimental validation. SY contributed to the device fabrication processes. RY performed the machine learning analyses. YI and TT contributed to the circuit design. YI (Yuji) contributed to experimental planning for device fabrication. YM and KK contributed to machine learning analyses and experimental planning. HO supervised the project and provided overall guidance. All authors read and approved the final manuscript.

Acknowledgement

This work was supported by JST AIP Acceleration Research (JPMJCR22U2), Japan, JSPS KAKENHI Grant Number JP24H00296, and MEXT KAKENHI Grant Number JP24H00887. The funding bodies provided financial support but had no role in the design of the study, the collection, analysis, or interpretation of data, or in writing the manuscript.

Competing interests

All authors declare no financial or non-financial competing interests.

Data Availability

The datasets generated and/or analysed during the current study will be deposited in a publicly accessible repository with a persistent URL. Access details will be provided upon publication.

Yes