YutaKurotaki1,2
ShunsukeYamakoshi1
ReitaroYoshida2
YutakaIsoda1
TamamiTakano1
YujiIsano1
YusukeMiyake2
KentaroKuribayashi2
HirokiOta1✉Emailota-hiroki-xm@ynu.ac.jp
1A
Department of Mechanical EngineeringYokohama National University79-5 Tokiwadai, Hodogaya-Ku240-8501YokohamaKanagawaJapan 2Pepabo Research and Development InstituteGMO Pepabo, Inc26-1 Sakuragaoka, Shibuya Ward150-8512TokyoJapan
Yuta Kurotaki1,2, Shunsuke Yamakoshi1, Reitaro Yoshida2, Yutaka Isoda1, Tamami Takano1, Yuji Isano1, Yusuke Miyake2, Kentaro Kuribayashi2, Hiroki Ota1*
1Department of Mechanical Engineering, Yokohama National University, 79 − 5 Tokiwadai, Hodogaya-Ku, Yokohama, Kanagawa 240–8501, Japan
2Pepabo Research and Development Institute, GMO Pepabo, Inc., 26 − 1 Sakuragaoka, Shibuya Ward, Tokyo 150–8512, Japan
*Corresponding author: Hiroki Ota, ota-hiroki-xm@ynu.ac.jp
Abstract
Silent speech recognition (SSR) provides an alternative communication pathway where spoken sound is absent. However, conventional approaches are limited by the requirement of constant facial attachment, privacy concerns, and unstable signal acquisition. Herein, we propose a soft active electromyography (EMG) interface that enables word-level SSR via machine learning. Worn on the hand, the device uses a fingertip electrode that can be placed near the lips to acquire EMG signals only when desired. The interface integrates liquid-metal (LM) interconnects, transparent flexible printed circuit (FPC) electrodes, and elastomer encapsulation to ensure high mechanical stability during finger motion. A deep neural network, trained on these stable signals, achieved 94.3% accuracy in classifying a 30-word vocabulary, demonstrating robust linguistic discrimination. Furthermore, real-time drone control validated the practicality of this approach in noisy and privacy-sensitive environments where conventional voice recognition fails. This study highlights the potential of soft, wearable EMG systems as secure, intuitive human–machine interfaces.
A
A
A
A
A
Introduction
SSR has attracted attention as a human–machine interface (HMI) technology that conveys human intent to machines without audible speech1. It offers a new communication pathway in noisy environments and situations requiring confidentiality, as well as for individuals with speech impairments. In the future, it is expected to serve as a foundation for intuitive and secure interactions that transcend dependence on audible speech.
To realize SSR, conventional approaches have explored optical lip-reading2,3,4, ultrasonic detection of tongue and lip motion5,6,7, electroencephalography (EEG)-based language estimation8,9,10, and infrared sensors11. However, optical and ultrasonic methods are constrained by environmental conditions, privacy concerns, and device size. Brain–machine interfaces (BMIs) often capture abundant non-speech information, reducing controllability, while accelerometer/gyroscope sensors12 impose a significant burden due to their large size. In other words, conventional methods have not achieved high-precision signal acquisition and wearability/social acceptability suitable for daily use.
Despite the development of multiple SSR approaches, there remains a pressing demand for devices that enable daily use of silent speech. To achieve this, it is essential to improve wearability around the face and head while providing a system that minimizes user burden yet ensures sufficient recognition accuracy.
As a possible solution to the aforementioned problem, flexible devices employing soft materials have attracted attention. Reported examples include surface EMG (sEMG) sensors based on PEDOT:PSS/polyurethane composites13, strain sensors comprising metal thin films and polymer films14, and triboelectric sensors made of PVC/nylon/sponge15, all of which provide low invasiveness and skin compatibility. Accordingly, SSR devices based on soft materials are promising for daily use.
However, previous soft-material-based methods require sensors to be permanently attached to the face or throat, which compromises user comfort and affects aesthetics and social acceptance16,17,18. More critically, such continuously worn devices may acquire biosignals without the user’s intent, leading to risks of privacy invasion and security concerns19. Thus, while continuous sensing has often been considered an advantage for wearables, it becomes a bottleneck in the context of SSR.
To address challenges in aesthetics, privacy, and security, SSR requires not only flexibility and wearability but also a novel operation modality that functions on-demand under user control. This concept is relevant not only to SSR but also to smartphones that exploit location histories and healthcare devices that process biometric information—both contexts in which the handling of sensitive personal data requires strict consideration20,21.
In this study, we developed a wearable soft active electromyography (EMG) interface that operates based on intentional user input. The device is worn on the hand, and by placing a fingertip electrode near the lips, EMG signals are acquired only when necessary. Structurally, the system integrates a flexible elastomer (Ecoflex), a transparent FPC substrate, and highly conductive yet stretchable LM interconnects, enabling stable electrical connectivity even under repeated complex finger movements. This high mechanical stability ensures consistent and reproducible EMG signals, supporting accurate speech recognition through machine-learning algorithms. Moreover, by eliminating the need for continuous facial attachment, the device gives users control over when signals are acquired, simultaneously ensuring privacy and security. Furthermore, we demonstrated its utility and applicability through SSR and drone operation in noisy environments. The developed soft EMG interface can simultaneously realize machine-learning-compatible stable signal acquisition and user-driven on-demand operation, establishing a new paradigm for practical SSR.
Results
Design for Intent-Driven SSR
The developed wearable smart silent speech interface, mounted on the hand, enables language recognition and machine control through mouth movements alone, without audible speech. By placing the fingertip near the mouth only when intended (Fig. 1a), the exposed electrode on the fingertip pad makes contact with the area around the lips, enabling the acquisition of EMG signals from the perioral muscles. These signals are then processed by machine-learning algorithms to enable character recognition and control of smart devices such as drones (Fig. 1b).
In contrast to devices that require sensors to be continuously attached to the face, this device operates only when the user intentionally touches the mouth area with the fingertip, performing language recognition or machine input only at that moment. As the device is physically separated from the perioral skin except during input, no silent speech is sensed outside of intended usage, and external attempts to trigger sensing by third-party attacks are impossible. This addresses privacy and security concerns. Moreover, it mitigates aesthetic issues for users in cultural contexts where attaching sensors to the face or throat is undesirable. In addition, as a feature of silent speech, words can be recognized in situations where one wishes to remain quiet and avoid being overheard. No audible sound is used; hence, device operation remains unaffected even in noisy environments.
The EMG electrode attached to the fingertip is composed of a transparent FPC substrate and Ecoflex, with the electrode arranged on the inner surface of a tubular fingertip structure. An EMG amplification circuit is mounted on the dorsum of the hand, connected to the fingertip electrode via LM interconnects. As the finger is structurally complex and highly stretchable, LM wiring is used; this wiring offers superior electrical conductivity, minimal resistance fluctuation under deformation, and stable resistance under repeated strain compared with conventional silver paste, PEDOT:PSS, and carbon-based conductors. By employing LM wiring between the fingertip and the back of the hand, EMG signals can be stably transmitted to the circuit even during repeated finger bending. This technical feature is crucial when incorporating machine learning, which requires large amounts of training data, with stretchable devices22 (Fig. 1c).
An example of device usage is shown in Fig. 1d. The device is worn on the hand, the fingertip is placed near four muscles surrounding the lips, and EMG signals are acquired from these perioral muscles. From these signal patterns, silent speech is recognized via machine learning.
Device Design and Characterization
An active EMG interface was fabricated that combines high skin conformability with mechanical flexibility. The device body employs Ecoflex—a soft elastomer material—as the packaging layer, forming a structure that can be mounted on the dorsum of the hand (Fig. 2a). An EMG circuit for amplification and signal processing (Supplementary Fig. 1) is implemented on the Ecoflex substrate, while surface electrodes (EMG electrodes) for direct skin contact are placed on the inner side of the fingertip (Supplementary Fig. 2).
The EMG electrodes were fabricated as a bipolar pair on a transparent FPC substrate. The reference electrode was placed on the circuit side on the back of the hand. The inter-electrode distance was set to 5 mm to enable EMG measurement within the limited fingertip area. The electrodes were soldered onto copper foils on the transparent FPC substrate to facilitate skin contact. They were adhered to the Ecoflex substrate using Sil-Poxy and then encapsulated with Ecoflex. The EMG electrodes and circuit were interconnected using LM (Supplementary Fig. 3). The device was designed for repeated use and removability; tensile testing of the fabricated electrodes confirmed that the resistance of the electrode–circuit interconnection via LM remained stable after repeated use (Supplementary Fig. 4). Finally, the entire structure was encapsulated by spray-coating Ecoflex. Owing to its planar structure, the device can be folded around the fingertip, positioning the EMG electrodes on the finger pad to form the interface (Fig. 2b).
The interface was designed to conform closely to the fingers, with fingertip electrodes functioning as dry electrodes. When placed on the perioral muscles, these electrodes detect weak EMG signals associated with speech articulation with high sensitivity. The acquired EMG signals are amplified within the EMG circuit, passed through low-pass and high-pass filters to remove noise, subjected to analog-to-digital conversion (ADC), and then transmitted to an external computing device for real-time processing (Fig. 2c).
The fingertip electrodes were positioned anatomically to correspond to four speech-related muscles: the buccinator, orbicularis oris, depressor labii inferioris, and mentalis. The buccinator stabilizes the cheeks during plosive and fricative production. The orbicularis oris is responsible for rounding and protrusion of the lips, while the depressor labii inferioris moves the lower lip, and the mentalis acts in coordination with movements that retract the mouth corners. By aligning the electrodes with these muscles, EMG activity associated with speech articulation could be effectively recorded (Fig. 3a).
An example EMG waveform of the mentalis muscle is shown for a single channel during speech and rest (Fig.
3b).
A
The signal indicates EMG activity during 1 s of silent speech, followed by 1 s of rest and another 1 s of silent speech. In the amplification circuit used in this study, muscle activity is captured within the range 0–3.3 V, centered at 1.6 V. This baseline facilitates analog-to-digital conversion for preprocessing EMG signals as inputs to machine-learning models.
Using this device, we conducted experiments to measure muscle activity during silent speech. Figure 3c presents four-channel EMG signals recorded during silent articulation of the words “Move” and “Turn,” which were selected from a 30-word vocabulary. The signals correspond to the buccinator, orbicularis oris, depressor labii inferioris, and mentalis muscles, reflecting localized muscle activity associated with articulation. Distinct EMG patterns were observed between Move and Turn, with differences corresponding to coordinated movements of the respective muscles. These results suggest that the proposed device can capture input signals sufficient to distinguish between intended words, thereby functioning as an effective silent speech interface.
Implementation of Machine Learning-Integrated Silent Speech Interface
Using the high-conformability EMG device described in the previous section, we constructed a machine-learning model for word recognition during silent speech and achieved classification of voiceless articulation. During silent speech, fingertip electrodes were placed near four perioral muscles, and EMG signals were acquired at a sampling rate of 1 kHz (1-ms intervals). From the four-channel EMG signals, Mel-frequency cepstral coefficients (MFCCs) were computed and used as inputs to a deep neural network (DNN) trained on EMG-derived MFCC feature (Fig. 4a). The architecture of the DNN (Supplementary Fig. 5) comprised multiple fully connected (Dense) layers following the input layer, with dropout and batch normalization applied to each layer. This design stabilized training and suppressed overfitting.
The EMG signals acquired from the device were preprocessed via framing into fixed time windows; a fast Fourier transform (FFT) was then applied to each frame. The resulting frequency components were processed through a Mel filter bank and subjected to a discrete cosine transform (DCT) to obtain MFCCs. The first- through sixth-order MFCCs were extracted as features and used as DNN inputs (Fig. 4b).
Model performance was maximized by introducing optimization techniques step-by-step and verifying their effects. Ultimately, the combination of input-channel selection, data splitting, cross-validation, callbacks, hyperparameter tuning, and normalization achieved the highest accuracy of 94.3% (Fig. 4c).
We examined whether adjusting the number of input channels allowed the model to learn features more effectively. The performance with three input channels is shown in Supplementary Fig. 6. The four-channel configuration achieved the best results overall. The generalization performance was fairly evaluated by randomly dividing the dataset into 1,500 training samples and 300 test samples from a total of 1,800 samples. Furthermore, five-fold cross-validation was applied to enhance the reliability of model evaluation. By repeatedly using the entire dataset for both training and testing, variance due to specific splits was reduced, enabling more robust estimation of model performance. For classification results, t-distributed stochastic neighbor embedding (t-SNE) was applied to visualize the feature space, confirming that separability between words was improved after application of the machine-learning model (Figs. 4d and e). These results demonstrate that the proposed approach is effective for EMG-based silent speech interfaces and highlight its practicality as a non-audible communication method.
Application to Human–Computer Interaction: Demonstration via Drone Control
A
To evaluate the applicability of the developed high-conformability EMG device, we conducted a proof-of-concept experiment in which silent speech was used for real-time word classification and subsequent human–computer interaction through drone control.
A
Participants wore the device and articulated operation commands without producing audible speech. The corresponding EMG signals were analyzed and classified in real time, enabling recognition of drone control commands (Fig.
5a). The device was connected to a microcontroller, while the trained model was executed on a PC to classify silent speech in real time.
For each of the four EMG channels over which silent speech of drone commands were recorded, six-dimensional MFCCs were computed as features (Fig. 5b). MFCCs, which are designed to emulate the human auditory system, were calculated by segmenting the target signal into short time windows, performing the short-time Fourier transform (STFT) to obtain the power spectrum. A Mel filter bank was then applied to generate the Mel-frequency spectrum. Subsequently, a DCT was performed on the logarithmic Mel spectrum. The six lowest-order MFCCs were extracted from each channel and used as input features.
Using these features, we implemented a simulation in which four silent speech commands—Start, Move, Turn, and Stop—controlled a drone’s flight sequence. To verify the feasibility of drone control via silent speech, we conducted a flight simulation using a small quadcopter (Tello), with the system classifying EMG-based MFCC features and transmitting commands over Wi-Fi (Fig. 5c). The drone executed a full sequence of operations—from takeoff to circling and landing—using only these four silent speech commands. Recognition of Start triggered autonomous takeoff from the desk and a transition to stable hovering. Once the drone was airborne, recognition of Move prompted it to fly straight over a preset distance, effectively replacing the forward and backward control of the Tello app’s virtual joystick. Recognition of Turn caused the drone to rotate by 45° either clockwise or counterclockwise, emulating joystick-based yaw control. Finally, recognition of Stop after the flight sequence instructed the drone to land automatically. All commands were classified in real time from EMG-derived MFCC features and transmitted to the drone via Wi-Fi (Fig. 5d).
As a result, complex drone maneuvers, including takeoff, translation, turning, and landing, were accurately executed using only EMG-based command inputs through silent speech, without producing audible sound (Fig. 5e). These findings demonstrate that the proposed device functions as a silent speech interface with both real-time capability and high classification accuracy. Moreover, the results highlight its utility as an intuitive, nonverbal method for interaction with machines and robots, particularly in environments where speech is impractical or confidentiality is required. A demonstration of this operation is shown in Supplementary Video 1.
Discussion
A
In EMG measurements for SSR, particularly with daily use in mind, electrode portability and reusability have been critical challenges. To address this, we developed EMG measurement electrodes and amplification circuits designed to prioritize portability and reusability. Both wet and dry electrodes are available for use
23 to acquire EMG signals from the skin around the mouth; an appropriate type has to be selected depending on the purpose. Gel-based electrodes
24 and microneedle electrodes
25,26 exhibit high adhesion. Although gel electrodes enable secure fixation, they require attachment of gel to the skin, which is unsuitable for a device designed for repeated attachment and removal. Meanwhile, microneedle electrodes are invasive and therefore deemed unsuitable for daily use, which would involve repeated skin stimulation. Hence, we adopted an active EMG mechanism to acquire EMG signals noninvasively from the skin surface.
Active EMG enables noninvasive measurement without gels. Typically, such electrodes require fixation using adhesive tape or bands; insufficient fixation increases susceptibility to noise, rendering stable EMG acquisition difficult27. Electrodes based on PEDOT:PSS13,28, carbon nanotubes or carbon black29,30,31, and silver paste32 have been reported. Although reusable PEDOT:PSS electrodes have been investigated33, their resistance fluctuates with strain and temperature. Thus, they were not suitable for this study, where the electrodes had to withstand pressure during manual fixation near the face and remain stable despite environmental variations when worn on the palm. Regarding carbon and silver, prior work compared cPDMS electrodes made of carbon-PDMS, AgPDMS electrodes made of silver-PDMS, and stainless-steel electrodes34. The material characteristics of dry electrodes have also been evaluated35. Considering these results, we adopted solder-coated electrodes that connect easily to copper foil on transparent FPC substrates and exhibit minimal resistance changes due to strain.
Ecoflex was applied around the electrode’s contact area with the skin to prevent slippage during fingertip-based manual fixation against the perioral muscles. In this study, by combining Ecoflex as a flexible base material with transparent FPC electrodes and employing LM wiring that followed finger motion, we designed a system that enabled users to intentionally press electrodes against the skin for stable placement on target muscles (Fig. 2b). This structure eliminated the need for adhesives or tape, while still achieving stable fixation by the user and overcoming the limitations of conventional dry electrodes. Thus, the system combined ease of use with reusability. The EMG amplification circuit incorporated both low-pass and high-pass filters, enabling extraction and amplification of weak EMG signals while removing environmental noise and unwanted frequency components (Fig. 2c).
Although the fingertip electrodes employ solder-coated metallic contacts, which exhibit conventional dry-electrode characteristics, the overall interface integrates soft-elastomer encapsulation and LM interconnects. Thus, while the electrode surface is not intrinsically soft, the device as a whole provides high mechanical compliance and wearability, justifying the designation of a “soft active EMG interface.”
For silent speech classification using machine learning, electrodes were placed on the perioral region without skin pretreatment.
A
Participants repeated a cycle of 1 s of rest followed by 1 s of silent speech, during which EMG signals were acquired. In a single-channel measurement of the mentalis muscle, although noise was present, voltage increases corresponding to muscle activation during speech movements were observed (Fig.
3b). During articulation of
Move (characterized by lip protrusion), EMG signals were recorded in channel 3 (depressor labii inferioris) and channel 4 (mentalis). These muscles are closely involved in lip protrusion and closure
36. The depressor labii inferioris lowers the lower lip, while the mentalis protrudes it forward. Detecting their activity during the lip-pursing motion of
Move supports the anatomical validity of the measurement. Similarly, during silent articulation of
Turn, EMG signals were recorded for approximately 0.75 s in channel 2 (orbicularis oris) and channel 4 (mentalis) as the lips closed. The orbicularis oris functions in closing and rounding the lips, while the mentalis raises the lower lip and wrinkles the chin, moving the lower lip upward and forward
37. Therefore, the results indicate that relevant muscle activity during silent speech was successfully captured.
These findings demonstrate that distinctive features of words can be detected during silent speech. Articulatory gestures such as bilabial closure (plosives), constriction (labiodental fricatives), and rounding (approximants) are widely observed across languages. According to the UCLA Phonological Segment Inventory Database, 446 of 451 languages (99%) include bilabial plosives; many languages also feature labiodental fricatives and rounded approximants38. This suggests that the results of the present study may be extended beyond English to other languages (Fig. 3c).
By integrating the EMG signals acquired using the proposed device into a machine-learning model, the word classification accuracy for silent speech was increased to 94.3% (Fig. 4c). This result suggests that speech can be sufficiently distinguished based on the activity of four perioral muscles. In contrast to previous studies that directly processed EMG signals, our classification model applied a feature transformation into MFCCs—a representation widely used in speech recognition—and used these as training data (Figs. 4a and b, Supplementary Fig. 7). This approach has proven highly effective in EMG-based SSR research39,40,41 and similarly produced a significant improvement in classification accuracy in this study. The use of MFCCs with convolutional neural networks (CNNs) achieves higher accuracy than classifiers such as support vector machines (SVMs)41, and we adopted this perspective in selecting the model. We further attempted classification using a Conformer model42 based on Transformer architectures42,43; however, we only achieved an accuracy of 90.6%. Ultimately, the MFCC + DNN model achieved the highest accuracy of 94.3% (Supplementary Fig. 8). This indicates that under the condition of limited data in our experiments, the simpler DNN had a better generalization capability than the more complex Conformer model.
The device is designed to be worn on the hand, with electrodes placed on the fingertips to measure perioral muscle activity. Owing to anatomical constraints of the hand, the maximum number of electrodes is five. In this study, we evaluated four electrodes placed from the thumb to the ring finger, which achieved the high classification accuracy of 94.3%. This demonstrates that high-performance recognition can be realized without adhesives and with a limited number of channels. Previous studies have indicated that increasing the number of electrodes (e.g., 7, 8, or 12) leads to more feature information and thus higher classification accuracy39. In our system, the practical constraint is that users manually press the device against the mouth region. This results in a dynamic measurement environment rather than a completely static one, with noise and signal instability caused by electrode placement and hand movement. Despite these limitations—only four electrodes and dynamic operation—the device achieved a high classification accuracy of 94.3%. Comparisons using two- and three-electrode configurations also indicated that four electrodes yielded the highest accuracy, consistent with prior studies in which accuracy improved with additional electrodes (Supplementary Table 1).
Furthermore, the classification accuracy achieved in this study is considered sufficiently high compared with previous studies that assumed fixed electrodes under static measurement environments39. Visualization using t-SNE confirmed that words articulated silently could be distinguished in the feature space. This result demonstrates the feasibility of SSR based on EMG signals, consistent with prior studies that used t-SNE to evaluate classification of movements44. These findings support the effectiveness of the proposed device and preprocessing approach (Figs. 4d and e).
As a concrete application example of the proposed wearable silent speech interface, we conducted drone operation experiments. Conventional voice recognition systems are susceptible to interference from loud operational sounds such as drone startup and flight noise, as well as background noise in environments such as industrial or disaster sites, rendering acceptance of speech input from soft or whispered voices challenging45,46. In contrast, the present system performs SSR based on EMG signals, without requiring microphone input. As a result, control via command input remained feasible during the experiments, unaffected by environmental noise or the drone’s operational sounds. This characteristic enables the system to overcome the practical limitations faced by conventional speech recognition devices in noisy environments and function as an HMI with high robustness against environmental noise. Specifically, EMG signals acquired by the device were fed into a pretrained silent speech classification model, which recognized each command and enabled drone operation.
The speech commands adopted were Start, Move, Turn, and Stop, which are commonly used in smart-device operation. Silent speech-based drone control has exhibited promise in prior research47; this study extended its applicability by mapping classification results to corresponding control commands and transmitting them to the drone in real time. Experimental results demonstrated that a full sequence of operations—including takeoff, movement, turning, and landing—could be executed continuously.
Supplementary Table 2 presents smart devices capable of SSR via surface sensing of the face, including EMG, as demonstrated in this study. While conventional systems assume continuous attachment and suffer from privacy issues caused by unintended acquisition of muscle activity, the proposed design initiates signal acquisition only when the user deliberately brings the hand close to the mouth. This approach transfers control of information acquisition to the user and introduces a new paradigm that addresses both privacy and security concerns.
Traditional silent speech interfaces have mainly been investigated with the goal of protecting privacy during communication. However, many patch-type EMG systems48,49, while capable of stable measurement, require attaching electrodes over wide areas of the face or neck. This implies that EMG signals are continuously acquired, even when the user does not intend it, raising privacy concerns. Systems that detect laryngeal muscle motion using magnetoelasticity50 have also been studied, but they still require continuous attachment using tape or similar fixation methods and therefore suffer from the same issue of constant signal acquisition. In addition, studies have used cameras worn between the neck and chest to recognize mouth movements², but these pose privacy concerns because unwanted content may be captured.
In EEG- and neural network-based studies¹⁰, participants are required to wear caps, leading to constant acquisition of brain activity. SSR has also been attempted with everyday objects, such as glasses, masks, or earphones, integrated with accelerometers51, infrared sensors11, TENGs14, and strain sensors43. Although these devices achieve high wearability and a familiar appearance, they continuously capture mouth movement, posing a risk of privacy invasion. Additionally, personal information or behavioral patterns can be collected without user consent; thus, always-on wearable sensors inherently carry risks of unintended speech capture and security threats from external attacks. These issues stem from the fact that the devices are physically affixed to the body. Hence, studies have also been conducted on silent speech devices that do not require continuous facial attachment, with daily use in mind52.
Ultrasound echo-based silent speech devices⁶ have demonstrated intentional placement against the throat during measurement; however, they require gels and are not designed for daily use, and handheld operation reduces portability. In contrast to these prior works, our device is structured not as a standalone sensor but as an integrated system that does not require continuous facial attachment. By employing soft materials, it enables EMG acquisition only when deliberately pressed against the mouth, without the need for gels or adhesive tape.
This design enables data collection only when intended, mitigating privacy concerns. Although the measurement conditions are less favorable compared with adhesive-based electrodes—because no tape fixation or skin pretreatment is used—our device achieved 94.3% silent speech classification accuracy when combined with machine learning. This performance is comparable to that of similar devices. Drone control experiments further demonstrated the practicality of our device as an HMI.
In this study, we developed a user-controlled wearable soft EMG interface that provides stable signal acquisition and enables SSR. By integrating LM interconnects with elastomer encapsulation, the device maintained reliable electrical performance under repeated finger movements and delivered consistent biosignals suitable for machine learning. Using a DNN, the system classified a 30-word vocabulary with 94.3% accuracy, achieving word-level recognition without continuous facial attachment.
Furthermore, drone control experiments in noisy and privacy-sensitive environments verified its utility for intuitive human–machine interaction. The significance of this study lies in its integration of on-demand operation with machine learning-based linguistic recognition. It represents a new paradigm that addresses the challenges related to privacy, social acceptability, and signal stability inherent in previous wearable SSR systems. This work highlights the potential of soft and stretchable bioelectronics as safe and intuitive communication tools. Future challenges include systematically evaluating the impact of electrode placement variability on recognition accuracy, developing methods to mitigate these effects, expanding the vocabulary, and extending recognition to continuous speech. Overcoming these limitations would enable this technology to evolve into a broadly applicable platform for communication, assistive devices, and control systems in real-world environments.
Methods
Fabrication of EMG amplification circuit board
A
The circuit for EMG measurement comprised an instrumentation amplifier (AD627, Analog Devices) in the front stage and two operational amplifiers (AD8607, Analog Devices) in the subsequent stage to amplify weak potentials. High-pass and low-pass filters were incorporated to extract only the relevant EMG signals, following established designs for biological signal-processing circuits
53. The reference voltage was set to 1.6 V, enabling measurement of muscle potential changes within the range 0–3.3 V (Fig.
S1).
A
The circuit board was fabricated using a transparent polyimide substrate covered on one side with copper foil (L71KTS 1012EDR T10, Arisawa Mfg. Co., Ltd.). AZ photoresist was spin-coated, dried on a hot plate at 100°C, and patterned via mask exposure. The circuit pattern was then developed through etching (Fig.
S2), after which the necessary components were mounted. Two measurement electrodes and one reference electrode were applied to the skin for signal acquisition.
Preparation and wiring of LM paste
LM paste was prepared as follows. Nickel powder (3–7 µm, Alfa Aesar Co.) was dispersed into Galinstan (Maruya.com) at mass ratios of 2% and 5%. The mixture was sonicated using an ultrasonic probe (SFX 550, BRANSON) with a duty ratio of 30% and a total energy of 6 kJ. The mixture was then exposed to air overnight to promote oxidation. Oxidized LM paste was also prepared by stirring Galinstan under ambient conditions with a stirrer (Azone) at 750 rpm for 60 min to induce oxidation.
Fabrication of silent speech device
A mold produced via 3D printing was filled with silicone rubber Ecoflex 00–20 (Ecoflex, Smooth-On; A:B = 1:1 weight ratio) and cured in an oven.
A
The EMG amplification circuit board was fixed onto the cured elastomer using Sil-Poxy adhesive (Sil-Poxy, Smooth-On). Fingertip electrodes were attached in the same manner using Sil-Poxy. A polyimide film was cut using a laser marker (Keyence) to form a stencil mask for LM wiring. LM paste was applied through the mask, and after the mask was removed, wiring connections between the fingertip electrodes and the EMG circuit were formed.
A
The connection areas between the circuit board and the LM wiring were sealed with silicone rubber (KE-1606, Shin-Etsu Chemical) and cured. Finally, Ecoflex was coated over the entire device and cured in an oven to encapsulate the system.
Software for silent speech classification
A microcontroller (Arduino) was used to digitize the signals obtained from the electrodes and EMG amplification circuit. EMG signals from four perioral muscles were sampled at 1-ms intervals (1,000 Hz) via four analog input channels and transmitted to a computer (Fig. 2c). These data were used to generate test datasets for prediction with machine-learning models. The EMG signals were transformed into MFCCs, which served as the input features for machine learning. A five-layer DNN was used for training. The final layer employed the softmax function to perform multiclass classification of silent speech. The classification results were validated using t-SNE, confirming that silent speech words could be distinguished in the feature space (Figs. 4d and e).
Drone operation application
A control program was implemented in Python. The PC connected to the device was connected to a drone (Tello, Ryze Technology) via Wi-Fi. During silent speech articulation with the device placed at the mouth, recognized words were transmitted via HTTP to control the drone (Fig. 5c). Recognized words were displayed on a web browser, and the drone executed movements within the room based on the recognized commands (Figs. 5d and e).
A
All participants gave informed consent after receiving a complete explanation of the study’s purpose and procedures.
A
The research protocol was approved by the Ethics Committee of the Yokohama National University Graduate School of Engineering Science (No. 2020-16, approved on February 12, 2021).