Electrical Engineering and Systems Science
See recent articles
- [1] arXiv:2406.12931 [pdf, html, other]
-
Title: Automatic Speech Recognition for Biomedical Data in Bengali LanguageSubjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
This paper presents the development of a prototype Automatic Speech Recognition (ASR) system specifically designed for Bengali biomedical data. Recent advancements in Bengali ASR are encouraging, but a lack of domain-specific data limits the creation of practical healthcare ASR models. This project bridges this gap by developing an ASR system tailored for Bengali medical terms like symptoms, severity levels, and diseases, encompassing two major dialects: Bengali and Sylheti. We train and evaluate two popular ASR frameworks on a comprehensive 46-hour Bengali medical corpus. Our core objective is to create deployable health-domain ASR systems for digital health applications, ultimately increasing accessibility for non-technical users in the healthcare sector.
- [2] arXiv:2406.12937 [pdf, html, other]
-
Title: Self-Train Before You TranscribeComments: Accepted at Interspeech 2024Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
When there is a mismatch between the training and test domains, current speech recognition systems show significant performance degradation. Self-training methods, such as noisy student teacher training, can help address this and enable the adaptation of models under such domain shifts. However, self-training typically requires a collection of unlabelled target domain data. For settings where this is not practical, we investigate the benefit of performing noisy student teacher training on recordings in the test set as a test-time adaptation approach. Similarly to the dynamic evaluation approach in language modelling, this enables the transfer of information across utterance boundaries and functions as a method of domain adaptation. A range of in-domain and out-of-domain datasets are used for experiments demonstrating large relative gains of up to 32.2%. Interestingly, our method showed larger gains than the typical self-training setup that utilises separate adaptation data.
- [3] arXiv:2406.12943 [pdf, other]
-
Title: A square cross-section FOV rotational CL (SC-CL) and its analytical reconstruction methodSubjects: Image and Video Processing (eess.IV)
Rotational computed laminography (CL) has broad application potential in three-dimensional imaging of plate-like objects, as it only needs x-ray to pass through the tested object in the thickness direction during the imaging process. In this study, a square cross-section FOV rotational CL (SC-CL) was proposed. Then, the FDK-type analytical reconstruction algorithm applicable to the SC-CL was derived. On this basis, the proposed method was validated through numerical experiments.
- [4] arXiv:2406.12946 [pdf, other]
-
Title: Instruction Data Generation and Unsupervised Adaptation for Speech Language ModelsComments: Accepted for Interspeech 2024Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
In this paper, we propose three methods for generating synthetic samples to train and evaluate multimodal large language models capable of processing both text and speech inputs. Addressing the scarcity of samples containing both modalities, synthetic data generation emerges as a crucial strategy to enhance the performance of such systems and facilitate the modeling of cross-modal relationships between the speech and text domains. Our process employs large language models to generate textual components and text-to-speech systems to generate speech components. The proposed methods offer a practical and effective means to expand the training dataset for these models. Experimental results show progress in achieving an integrated understanding of text and speech. We also highlight the potential of using unlabeled speech data to generate synthetic samples comparable in quality to those with available transcriptions, enabling the expansion of these models to more languages.
- [5] arXiv:2406.12998 [pdf, html, other]
-
Title: Articulatory Encodec: Vocal Tract Kinematics as a Codec for SpeechSubjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
Vocal tract articulation is a natural, grounded control space of speech production. The spatiotemporal coordination of articulators combined with the vocal source shapes intelligible speech sounds to enable effective spoken communication. Based on this physiological grounding of speech, we propose a new framework of neural encoding-decoding of speech -- articulatory encodec. The articulatory encodec comprises an articulatory analysis model that infers articulatory features from speech audio, and an articulatory synthesis model that synthesizes speech audio from articulatory features. The articulatory features are kinematic traces of vocal tract articulators and source features, which are intuitively interpretable and controllable, being the actual physical interface of speech production. An additional speaker identity encoder is jointly trained with the articulatory synthesizer to inform the voice texture of individual speakers. By training on large-scale speech data, we achieve a fully intelligible, high-quality articulatory synthesizer that generalizes to unseen speakers. Furthermore, the speaker embedding is effectively disentangled from articulations, which enables accent-perserving zero-shot voice conversion. To the best of our knowledge, this is the first demonstration of universal, high-performance articulatory inference and synthesis, suggesting the proposed framework as a powerful coding system of speech.
- [6] arXiv:2406.13059 [pdf, html, other]
-
Title: Learned Compression of Encoding DistributionsComments: 7 pages, 5 figures, IEEE ICIP 2024Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
The entropy bottleneck introduced by Ballé et al. is a common component used in many learned compression models. It encodes a transformed latent representation using a static distribution whose parameters are learned during training. However, the actual distribution of the latent data may vary wildly across different inputs. The static distribution attempts to encompass all possible input distributions, thus fitting none of them particularly well. This unfortunate phenomenon, sometimes known as the amortization gap, results in suboptimal compression. To address this issue, we propose a method that dynamically adapts the encoding distribution to match the latent data distribution for a specific input. First, our model estimates a better encoding distribution for a given input. This distribution is then compressed and transmitted as an additional side-information bitstream. Finally, the decoder reconstructs the encoding distribution and uses it to decompress the corresponding latent data. Our method achieves a Bjøntegaard-Delta (BD)-rate gain of -7.10% on the Kodak test dataset when applied to the standard fully-factorized architecture. Furthermore, considering computational complexity, the transform used by our method is an order of magnitude cheaper in terms of Multiply-Accumulate (MAC) operations compared to related side-information methods such as the scale hyperprior.
- [7] arXiv:2406.13139 [pdf, html, other]
-
Title: Audio Fingerprinting with Holographic Reduced RepresentationsComments: accepted at Interspeech 2024Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
This paper proposes an audio fingerprinting model with holographic reduced representation (HRR). The proposed method reduces the number of stored fingerprints, whereas conventional neural audio fingerprinting requires many fingerprints for each audio track to achieve high accuracy and time resolution. We utilize HRR to aggregate multiple fingerprints into a composite fingerprint via circular convolution and summation, resulting in fewer fingerprints with the same dimensional space as the original. Our search method efficiently finds a combined fingerprint in which a query fingerprint exists. Using HRR's inverse operation, it can recover the relative position within a combined fingerprint, retaining the original time resolution. Experiments show that our method can reduce the number of fingerprints with modest accuracy degradation while maintaining the time resolution, outperforming simple decimation and summation-based aggregation methods.
- [8] arXiv:2406.13145 [pdf, html, other]
-
Title: Constructing and Evaluating Digital Twins: An Intelligent Framework for DT DevelopmentSubjects: Systems and Control (eess.SY); Machine Learning (cs.LG)
The development of Digital Twins (DTs) represents a transformative advance for simulating and optimizing complex systems in a controlled digital space. Despite their potential, the challenge of constructing DTs that accurately replicate and predict the dynamics of real-world systems remains substantial. This paper introduces an intelligent framework for the construction and evaluation of DTs, specifically designed to enhance the accuracy and utility of DTs in testing algorithmic performance. We propose a novel construction methodology that integrates deep learning-based policy gradient techniques to dynamically tune the DT parameters, ensuring high fidelity in the digital replication of physical systems. Moreover, the Mean STate Error (MSTE) is proposed as a robust metric for evaluating the performance of algorithms within these digital space. The efficacy of our framework is demonstrated through extensive simulations that show our DT not only accurately mirrors the physical reality but also provides a reliable platform for algorithm evaluation. This work lays a foundation for future research into DT technologies, highlighting pathways for both theoretical enhancements and practical implementations in various industries.
- [9] arXiv:2406.13150 [pdf, other]
-
Title: MCAD: Multi-modal Conditioned Adversarial Diffusion Model for High-Quality PET Image ReconstructionComments: Early accepted by MICCAI2024Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Radiation hazards associated with standard-dose positron emission tomography (SPET) images remain a concern, whereas the quality of low-dose PET (LPET) images fails to meet clinical requirements. Therefore, there is great interest in reconstructing SPET images from LPET images. However, prior studies focus solely on image data, neglecting vital complementary information from other modalities, e.g., patients' clinical tabular, resulting in compromised reconstruction with limited diagnostic utility. Moreover, they often overlook the semantic consistency between real SPET and reconstructed images, leading to distorted semantic contexts. To tackle these problems, we propose a novel Multi-modal Conditioned Adversarial Diffusion model (MCAD) to reconstruct SPET images from multi-modal inputs, including LPET images and clinical tabular. Specifically, our MCAD incorporates a Multi-modal conditional Encoder (Mc-Encoder) to extract multi-modal features, followed by a conditional diffusion process to blend noise with multi-modal features and gradually map blended features to the target SPET images. To balance multi-modal inputs, the Mc-Encoder embeds Optimal Multi-modal Transport co-Attention (OMTA) to narrow the heterogeneity gap between image and tabular while capturing their interactions, providing sufficient guidance for reconstruction. In addition, to mitigate semantic distortions, we introduce the Multi-Modal Masked Text Reconstruction (M3TRec), which leverages semantic knowledge extracted from denoised PET images to restore the masked clinical tabular, thereby compelling the network to maintain accurate semantics during reconstruction. To expedite the diffusion process, we further introduce an adversarial diffusive network with a reduced number of diffusion steps. Experiments show that our method achieves the state-of-the-art performance both qualitatively and quantitatively.
- [10] arXiv:2406.13165 [pdf, html, other]
-
Title: Cardiac Copilot: Automatic Probe Guidance for Echocardiography with World ModelComments: Early Accepted by MICCAI 2024Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Echocardiography is the only technique capable of real-time imaging of the heart and is vital for diagnosing the majority of cardiac diseases. However, there is a severe shortage of experienced cardiac sonographers, due to the heart's complex structure and significant operational challenges. To mitigate this situation, we present a Cardiac Copilot system capable of providing real-time probe movement guidance to assist less experienced sonographers in conducting freehand echocardiography. This system can enable non-experts, especially in primary departments and medically underserved areas, to perform cardiac ultrasound examinations, potentially improving global healthcare delivery. The core innovation lies in proposing a data-driven world model, named Cardiac Dreamer, for representing cardiac spatial structures. This world model can provide structure features of any cardiac planes around the current probe position in the latent space, serving as an precise navigation map for autonomous plane localization. We train our model with real-world ultrasound data and corresponding probe motion from 110 routine clinical scans with 151K sample pairs by three certified sonographers. Evaluations on three standard planes with 37K sample pairs demonstrate that the world model can reduce navigation errors by up to 33\% and exhibit more stable performance.
- [11] arXiv:2406.13191 [pdf, html, other]
-
Title: GPU-Accelerated DCOPF using Gradient-Based OptimizationSubjects: Systems and Control (eess.SY)
DC Optimal Power Flow (DCOPF) is a key operational tool for power system operators, and it is embedded as a subproblem in many challenging optimization problems (e.g., line switching). However, traditional CPU-based solve routines (e.g., simplex) have saturated in speed and are hard to parallelize. This paper focuses on solving DCOPF problems using gradient-based routines on Graphics Processing Units (GPUs), which have massive parallelization capability. To formulate these problems, we pose a Lagrange dual associated with DCOPF (linear and quadratic cost curves), and then we explicitly solve the inner (primal) minimization problem with a dual norm. The resulting dual problem can be efficiently iterated using projected gradient ascent. After solving the dual problem on both CPUs and GPUs to find tight lower bounds, we benchmark against Gurobi and MOSEK, comparing convergence speed and tightness on the IEEE 2000 and 10000 bus systems. We provide reliable and tight lower bounds for these problems with, at best, 5.4x speedup over a conventional solver.
- [12] arXiv:2406.13194 [pdf, html, other]
-
Title: A Hybrid Intelligent System for Protection of Transmission Lines Connected to PV Farms based on Linear TrendsComments: 27 pages, 20 figuresSubjects: Signal Processing (eess.SP)
Conventional relays face challenges for transmission lines connected to inverter-based resources (IBRs). In this article, a single-ended intelligent protection of the transmission line in the zone between the grid and the PV farm is suggested. The method employs a fuzzy logic and random forest (RF)-based hybrid system to detect faults based on combined linear trend attributes of the 3-phase currents. The fault location is determined and the faulty phase is detected. RF feature selection is used to obtain the optimal linear trend feature. The performance of the methodology is examined for abnormal events such as faults, capacitor and load-switching operations simulated in PSCAD/EMTDC on IEEE 9-bus system obtained by varying various fault and switching parameters. Additionally, when validating the suggested strategy, consideration is given to the effects of conditions such as the presence of double circuit lines, PV capacity, sampling rate, data window length, noise, high impedance faults, CT saturation, compensation devices, evolving and cross-country faults, and far-end and near-end faults. The findings indicate that the suggested strategy can be used to deal with a variety of system configurations and situations while still safeguarding such complex power transmission networks.
- [13] arXiv:2406.13205 [pdf, other]
-
Title: Application of Computer Deep Learning Model in Diagnosis of Pulmonary NodulesYutian Yang (1), Hongjie Qiu (2), Yulu Gong (3), Xiaoyi Liu (4), Yang Lin (5), Muqing Li (6) ((1) University of California, Davis, (2) University of Washington, (3) Northern Arizona University, (4) Arizona State University, (5) University of Pennsylvania, (6) University of California San Diego)Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
The 3D simulation model of the lung was established by using the reconstruction method. A computer aided pulmonary nodule detection model was constructed. The process iterates over the images to refine the lung nodule recognition model based on neural networks. It is integrated with 3D virtual modeling technology to improve the interactivity of the system, so as to achieve intelligent recognition of lung nodules. A 3D RCNN (Region-based Convolutional Neural Network) was utilized for feature extraction and nodule identification. The LUNA16 large sample database was used as the research dataset. FROC (Free-response Receiver Operating Characteristic) analysis was applied to evaluate the model, calculating sensitivity at various false positive rates to derive the average FROC. Compared with conventional diagnostic methods, the recognition rate was significantly improved. This technique facilitates the detection of pulmonary abnormalities at an initial phase, which holds immense value for the prompt diagnosis of lung malignancies.
- [14] arXiv:2406.13209 [pdf, html, other]
-
Title: Diffusion Model-based FOD Restoration from High Distortion in dMRIComments: 11 pages, 7 figuresSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
Fiber orientation distributions (FODs) is a popular model to represent the diffusion MRI (dMRI) data. However, imaging artifacts such as susceptibility-induced distortion in dMRI can cause signal loss and lead to the corrupted reconstruction of FODs, which prohibits successful fiber tracking and connectivity analysis in affected brain regions such as the brain stem. Generative models, such as the diffusion models, have been successfully applied in various image restoration tasks. However, their application on FOD images poses unique challenges since FODs are 4-dimensional data represented by spherical harmonics (SPHARM) with the 4-th dimension exhibiting order-related dependency. In this paper, we propose a novel diffusion model for FOD restoration that can recover the signal loss caused by distortion artifacts. We use volume-order encoding to enhance the ability of the diffusion model to generate individual FOD volumes at all SPHARM orders. Moreover, we add cross-attention features extracted across all SPHARM orders in generating every individual FOD volume to capture the order-related dependency across FOD volumes. We also condition the diffusion model with low-distortion FODs surrounding high-distortion areas to maintain the geometric coherence of the generated FODs. We trained and tested our model using data from the UK Biobank (n = 1315). On a test set with ground truth (n = 43), we demonstrate the high accuracy of the generated FODs in terms of root mean square errors of FOD volumes and angular errors of FOD peaks. We also apply our method to a test set with large distortion in the brain stem area (n = 1172) and demonstrate the efficacy of our method in restoring the FOD integrity and, hence, greatly improving tractography performance in affected brain regions.
- [15] arXiv:2406.13266 [pdf, other]
-
Title: Advancements in Orthopaedic Arm Segmentation: A Comprehensive ReviewComments: 29 pages, 20 figuresSubjects: Image and Video Processing (eess.IV)
The most recent advances in medical imaging that have transformed diagnosis, especially in the case of interpreting X-ray images, are actively involved in the healthcare sector. The advent of digital image processing technology and the implementation of deep learning models such as Convolutional Neural Networks (CNNs) have made the analysis of X-rays much more accurate and efficient. In this article, some essential techniques such as edge detection, region-growing technique, and thresholding approach, and the deep learning models such as variants of YOLOv8-which is the best object detection and segmentation framework-are reviewed. We further investigate that the traditional image processing techniques like segmentation are very much simple and provides the alternative to the advanced methods as well. Our review gives useful knowledge on the practical usage of the innovative and traditional approaches of manual X-ray interpretation. The discovered information will help professionals and researchers to gain more profound knowledge in digital interpretation techniques in medical imaging.
- [16] arXiv:2406.13268 [pdf, html, other]
-
Title: CEC: A Noisy Label Detection Method for Speaker RecognitionComments: interspeech 2024Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
Noisy labels are inevitable, even in well-annotated datasets. The detection of noisy labels is of significant importance to enhance the robustness of speaker recognition models. In this paper, we propose a novel noisy label detection approach based on two new statistical metrics: Continuous Inconsistent Counting (CIC) and Total Inconsistent Counting (TIC). These metrics are calculated through Cross-Epoch Counting (CEC) and correspond to the early and late stages of training, respectively. Additionally, we categorize samples based on their prediction results into three categories: inconsistent samples, hard samples, and easy samples. During training, we gradually increase the difficulty of hard samples to update model parameters, preventing noisy labels from being overfitted. Compared to contrastive schemes, our approach not only achieves the best performance in speaker verification but also excels in noisy label detection.
- [17] arXiv:2406.13312 [pdf, html, other]
-
Title: Pushing the Limit of Sound Event Detection with Multi-Dilated Frequency Dynamic ConvolutionSubjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
Frequency dynamic convolution (FDY conv) has been a milestone in the sound event detection (SED) field, but it involves a substantial increase in model size due to multiple basis kernels. In this work, we propose partial frequency dynamic convolution (PFD conv), which concatenates static conventional 2D convolution branch output and dynamic FDY conv branch output in order to minimize model size increase while maintaining the performance. Additionally, we propose multi-dilated frequency dynamic convolution (MDFD conv), which integrates multiple dilated frequency dynamic convolution (DFD conv) branches with different dilation size sets and a static branch within a single convolution module, achieving a 3.2% improvement in polyphonic sound detection score (PSDS) over FDY conv. Proposed methods with extensive ablation studies further enhance understanding and usability of FDY conv variants.
- [18] arXiv:2406.13337 [pdf, html, other]
-
Title: Medical Spoken Named Entity RecognitionComments: Preprint, 40 pagesSubjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
Spoken Named Entity Recognition (NER) aims to extracting named entities from speech and categorizing them into types like person, location, organization, etc. In this work, we present VietMed-NER - the first spoken NER dataset in the medical domain. To our best knowledge, our real-world dataset is the largest spoken NER dataset in the world in terms of the number of entity types, featuring 18 distinct types. Secondly, we present baseline results using various state-of-the-art pre-trained models: encoder-only and sequence-to-sequence. We found that pre-trained multilingual models XLM-R outperformed all monolingual models on both reference text and ASR output. Also in general, encoders perform better than sequence-to-sequence models for the NER task. By simply translating, the transcript is applicable not just to Vietnamese but to other languages as well. All code, data and models are made publicly available here: this https URL
- [19] arXiv:2406.13374 [pdf, html, other]
-
Title: State Anti-windup: A New Methodology for Tackling State Constraints at Both Synthesis and Implementation LevelsComments: 15 pages 21 FiguresSubjects: Systems and Control (eess.SY)
The anti-windup compensation typically addresses strict control limitations in control systems. However, there is a clear need for an equivalent solution for the states/outputs of the system. This paper introduces a novel methodology for the state anti-windup compensator. Unlike state-constrained control methods, which often focus on incorporating soft constraints into the design or fail to react adequately to constraint violations in practical settings, the proposed methodology treats state constraints as implement-oriented soft-hard constraints. This is achieved by integrating a saturation block within the structure of the safety compensator, referred to as the state anti-windup (SANTW) compensator. Similar to input anti-windup schemes, the SANTW design is separated from the nominal controller design. The problem is formulated as a disturbance rejection one to directly minimize the saturation. The paper develops two Hinf optimization frameworks using frequency-domain solutions and linear matrix inequalities. It then addresses constraints on both inputs and states, resulting in a unified Input-State Anti-windup (IS-ANTW) compensator synthesized using non-smooth Hinf optimization. This method also offers the flexibility of having a fixed-order compensator, crucial in many practical applications. Additionally, the study evaluates the proposed compensator's performance in managing current fluctuations from renewable energy sources during grid faults, demonstrating its effectiveness through detailed Electromagnetic Transient (EMT) simulations of grid-connected DC-AC converters.
- [20] arXiv:2406.13385 [pdf, html, other]
-
Title: Explainable by-design Audio Segmentation through Non-Negative Matrix Factorization and ProbingComments: Accepted at Interspeech 2024, 5 pages, 2 figures, 3 tablesSubjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
Audio segmentation is a key task for many speech technologies, most of which are based on neural networks, usually considered as black boxes, with high-level performances. However, in many domains, among which health or forensics, there is not only a need for good performance but also for explanations about the output decision. Explanations derived directly from latent representations need to satisfy "good" properties, such as informativeness, compactness, or modularity, to be interpretable. In this article, we propose an explainable-by-design audio segmentation model based on non-negative matrix factorization (NMF) which is a good candidate for the design of interpretable representations. This paper shows that our model reaches good segmentation performances, and presents deep analyses of the latent representation extracted from the non-negative matrix. The proposed approach opens new perspectives toward the evaluation of interpretable representations according to "good" properties.
- [21] arXiv:2406.13386 [pdf, html, other]
-
Title: Online Domain-Incremental Learning Approach to Classify Acoustic Scenes in All LocationsComments: Accepted to EUSIPCO 2024Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
In this paper, we propose a method for online domain-incremental learning of acoustic scene classification from a sequence of different locations. Simply training a deep learning model on a sequence of different locations leads to forgetting of previously learned knowledge. In this work, we only correct the statistics of the Batch Normalization layers of a model using a few samples to learn the acoustic scenes from a new location without any excessive training. Experiments are performed on acoustic scenes from 11 different locations, with an initial task containing acoustic scenes from 6 locations and the remaining 5 incremental tasks each representing the acoustic scenes from a different location. The proposed approach outperforms fine-tuning based methods and achieves an average accuracy of 48.8% after learning the last task in sequence without forgetting acoustic scenes from the previously learned locations.
- [22] arXiv:2406.13396 [pdf, other]
-
Title: Safe and Non-Conservative Trajectory Planning for Autonomous Driving Handling Unanticipated Behaviors of Traffic ParticipantsSubjects: Systems and Control (eess.SY)
Trajectory planning for autonomous driving is challenging because the unknown future motion of traffic participants must be accounted for, yielding large uncertainty. Stochastic Model Predictive Control (SMPC)-based planners provide non-conservative planning, but do not rule out a (small) probability of collision. We propose a control scheme that yields an efficient trajectory based on SMPC when the traffic scenario allows, still avoiding that the vehicle causes collisions with traffic participants if the latter move according to the prediction assumptions. If some traffic participant does not behave as anticipated, no safety guarantee can be given. Then, our approach yields a trajectory which minimizes the probability of collision, using Constraint Violation Probability Minimization techniques. Our algorithm can also be adapted to minimize the anticipated harm caused by a collision. We provide a thorough discussion of the benefits of our novel control scheme and compare it to a previous approach through numerical simulations from the CommonRoad database.
- [23] arXiv:2406.13413 [pdf, html, other]
-
Title: Recurrent Inference Machine for Medical Image RegistrationComments: PreprintSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Image registration is essential for medical image applications where alignment of voxels across multiple images is needed for qualitative or quantitative analysis. With recent advancements in deep neural networks and parallel computing, deep learning-based medical image registration methods become competitive with their flexible modelling and fast inference capabilities. However, compared to traditional optimization-based registration methods, the speed advantage may come at the cost of registration performance at inference time. Besides, deep neural networks ideally demand large training datasets while optimization-based methods are training-free. To improve registration accuracy and data efficiency, we propose a novel image registration method, termed Recurrent Inference Image Registration (RIIR) network. RIIR is formulated as a meta-learning solver to the registration problem in an iterative manner. RIIR addresses the accuracy and data efficiency issues, by learning the update rule of optimization, with implicit regularization combined with explicit gradient input.
We evaluated RIIR extensively on brain MRI and quantitative cardiac MRI datasets, in terms of both registration accuracy and training data efficiency. Our experiments showed that RIIR outperformed a range of deep learning-based methods, even with only $5\%$ of the training data, demonstrating high data efficiency. Key findings from our ablation studies highlighted the important added value of the hidden states introduced in the recurrent inference framework for meta-learning. Our proposed RIIR offers a highly data-efficient framework for deep learning-based medical image registration. - [24] arXiv:2406.13420 [pdf, html, other]
-
Title: The effect of control barrier functions on energy transfers in controlled physical systemsSubjects: Systems and Control (eess.SY)
Using a port-Hamiltonian formalism, we show the qualitative and quantitative effect of safety-critical control implemented with control barrier functions (CBFs) on the power balance of controlled physical systems. The presented results will provide novel tools to design CBFs inducing desired energetic behaviors of the closed-loop system, including nontrivial damping injection effects and non-passive control actions, effectively injecting energy in the system in a controlled manner. Simulations validate the stated results.
- [25] arXiv:2406.13441 [pdf, html, other]
-
Title: Robust Melanoma Thickness Prediction via Deep Transfer Learning enhanced by XAI TechniquesSubjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
This study focuses on analyzing dermoscopy images to determine the depth of melanomas, which is a critical factor in diagnosing and treating skin cancer. The Breslow depth, measured from the top of the granular layer to the deepest point of tumor invasion, serves as a crucial parameter for staging melanoma and guiding treatment decisions. This research aims to improve the prediction of the depth of melanoma through the use of machine learning models, specifically deep learning, while also providing an analysis of the possible existance of graduation in the images characteristics which correlates with the depth of the melanomas. Various datasets, including ISIC and private collections, were used, comprising a total of 1162 images. The datasets were combined and balanced to ensure robust model training. The study utilized pre-trained Convolutional Neural Networks (CNNs). Results indicated that the models achieved significant improvements over previous methods. Additionally, the study conducted a correlation analysis between model's predictions and actual melanoma thickness, revealing a moderate correlation that improves with higher thickness values. Explainability methods such as feature visualization through Principal Component Analysis (PCA) demonstrated the capability of deep features to distinguish between different depths of melanoma, providing insight into the data distribution and model behavior. In summary, this research presents a dual contribution: enhancing the state-of-the-art classification results through advanced training techniques and offering a detailed analysis of the data and model behavior to better understand the relationship between dermoscopy images and melanoma thickness.
- [26] arXiv:2406.13462 [pdf, other]
-
Title: Design of Phase Locked Loop in 180 nm TechnologySubjects: Systems and Control (eess.SY)
The presented paper introduces a design for a phase-locked loop (PLL) that is utilized in frequency synthesis and modulation-demodulation within communication systems and in VLSI applications. The CMOS PLL is designed using 180 nm Fabrication Technology on Cadence Virtuoso Tool with a supply voltage of 1.8 V. The performance is evaluated through simulations and measurements, which demonstrate its ability to track and lock onto the input frequency. The PLL is a frequency synthesizer implemented to generate 2.4 GHz frequency. The input reference clock from a crystal oscillator is 150 MHz square wave. Negative feedback is given by divide-by-16 frequency divider, ensuring the phase and frequency synchronization between the divided signal and the reference signal. The design has essential components such as a phase frequency detector, charge pump, loop filter, current-starved voltage-controlled oscillator (CSVCO), and frequency divider. Through their collaborative operation, the system generates an output frequency that is 16 times the input frequency. The centre frequency of the 3-stage CSVCO is 3.208 GHz at 900 mV input voltage. With an input voltage ranging from 0.4 V to 1.8 V, the VCO offers a tuning range that spans from 1.066 GHz to 3.731 GHz.PLL demonstrates a lock-in range spanning from 70.4 MHz to 173 MHz, with an output frequency range of 1.12 GHz to 2.78 GHz. It achieves a lock time of 260.03 ns and consumes a maximum power of 5.15 mW at 2.4 GHz.
- [27] arXiv:2406.13464 [pdf, html, other]
-
Title: An Efficient yet High-Performance Method for Precise Radar-Based Imaging of Human Hand PosesComments: 4 pages, 4 figures, accepted at European Microwave Week (EuMW 2024) to the topic "R28 Human Activity Monitoring, including Gesture Recognition" (EuRAD)Subjects: Signal Processing (eess.SP)
Contactless hand pose estimation requires sensors that provide precise spatial information and low computational complexity for real-time processing. Unlike vision-based systems, radar offers lighting independence and direct motion assessments. Yet, there is limited research balancing real-time constraints, suitable frame rates for motion evaluations, and the need for precise 3D data. To address this, we extend the ultra-efficient two-tone hand imaging method from our prior work to a three-tone approach. Maintaining high frame rates and real-time constraints, this approach significantly enhances reconstruction accuracy and precision. We assess these measures by evaluating reconstruction results for different hand poses obtained by an imaging radar. Accuracy is assessed against ground truth from a spatially calibrated photogrammetry setup, while precision is measured using 3D-printed hand poses. The results emphasize the method's great potential for future radar-based hand sensing.
- [28] arXiv:2406.13470 [pdf, html, other]
-
Title: Automatic Voice Classification Of Autistic SubjectsComments: Accepted for publication at EAI BODYNETS 2023 2024 - 18th EAI International Conference on Body Area Networks: Intelligent Edge Cloud for Dependable Globally Connected BAN, February 5-6, 2024 Milan, ItalySubjects: Signal Processing (eess.SP); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Autism Spectrum Disorders (ASD) describe a heterogeneous set of conditions classified as neurodevelopmental disorders. Although the mechanisms underlying ASD are not yet fully understood, more recent literature focused on multiple genetics and/or environmental risk factors. Heterogeneity of symptoms, especially in milder forms of this condition, could be a challenge for the clinician. In this work, an automatic speech classification algorithm is proposed to characterize the prosodic elements that best distinguish autism, to support the traditional diagnosis. The performance of the proposed algorithm is evaluted by testing the classification algorithms on a dataset composed of recorded speeches, collected among both autustic and non autistic subjects.
- [29] arXiv:2406.13471 [pdf, html, other]
-
Title: Diffusion-based Generative Modeling with Discriminative Guidance for Streamable Speech EnhancementSubjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
Diffusion-based generative models (DGMs) have recently attracted attention in speech enhancement research (SE) as previous works showed a remarkable generalization capability. However, DGMs are also computationally intensive, as they usually require many iterations in the reverse diffusion process (RDP), making them impractical for streaming SE systems. In this paper, we propose to use discriminative scores from discriminative models in the first steps of the RDP. These discriminative scores require only one forward pass with the discriminative model for multiple RDP steps, thus greatly reducing computations. This approach also allows for performance improvements. We show that we can trade off between generative and discriminative capabilities as the number of steps with the discriminative score increases. Furthermore, we propose a novel streamable time-domain generative model with an algorithmic latency of 50 ms, which has no significant performance degradation compared to offline models.
- [30] arXiv:2406.13522 [pdf, html, other]
-
Title: Measured-state conditioned recursive feasibility for stochastic model predictive controlSubjects: Systems and Control (eess.SY); Optimization and Control (math.OC)
In this paper, we address the problem of designing stochastic model predictive control (MPC) schemes for linear systems affected by unbounded disturbances. The contribution of the paper is twofold. First, motivated by the difficulty of guaranteeing recursive feasibility in this framework, due to the nonzero probability of violating chance-constraints in the case of unbounded noise, we introduce the novel definition of measured-state conditioned recursive feasibility in expectation. Second, we construct a stochastic MPC scheme, based on the introduction of ellipsoidal probabilistic reachable sets, which implements a closed-loop initialization strategy, i.e., the current measured-state is employed for initializing the optimization problem. This new scheme is proven to satisfy the novel definition of recursive feasibility, and its superiority with respect to open-loop initialization schemes, arising from the fact that one never neglects the information brought by the current measurement, is shown through numerical examples.
- [31] arXiv:2406.13526 [pdf, html, other]
-
Title: Using Geometrical information to Measure the Vibration of A Swaying Millimeter-wave RadarComments: 5 pages, 4 figures,submitted to the IEEE for publicationSubjects: Signal Processing (eess.SP)
This paper presents two new, simple yet effective approaches to measure the vibration of a swaying millimeter-wave radar (mmRadar) utilizing geometrical information. Specifically, for the planar vibrations, we firstly establish an equation based on the area difference between the swaying mmRadar and the reference objects at different moments, which enables the quantification of planar displacement. Secondly, volume differences are also utilized with the same idea, achieving the self-vibration measurement of a swaying mmRadar for spatial vibrations. Experimental results confirm the effectiveness of our methods, demonstrating its capability to estimate both the amplitude and a crude direction of the mmRadar's self-vibration.
- [32] arXiv:2406.13645 [pdf, html, other]
-
Title: Advancing UWF-SLO Vessel Segmentation with Source-Free Active Domain Adaptation and a Novel Multi-Center DatasetComments: MICCAI 2024 Early AcceptSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Accurate vessel segmentation in Ultra-Wide-Field Scanning Laser Ophthalmoscopy (UWF-SLO) images is crucial for diagnosing retinal diseases. Although recent techniques have shown encouraging outcomes in vessel segmentation, models trained on one medical dataset often underperform on others due to domain shifts. Meanwhile, manually labeling high-resolution UWF-SLO images is an extremely challenging, time-consuming and expensive task. In response, this study introduces a pioneering framework that leverages a patch-based active domain adaptation approach. By actively recommending a few valuable image patches by the devised Cascade Uncertainty-Predominance (CUP) selection strategy for labeling and model-finetuning, our method significantly improves the accuracy of UWF-SLO vessel segmentation across diverse medical centers. In addition, we annotate and construct the first Multi-center UWF-SLO Vessel Segmentation (MU-VS) dataset to promote this topic research, comprising data from multiple institutions. This dataset serves as a valuable resource for cross-center evaluation, verifying the effectiveness and robustness of our approach. Experimental results demonstrate that our approach surpasses existing domain adaptation and active learning methods, considerably reducing the gap between the Upper and Lower bounds with minimal annotations, highlighting our method's practical clinical value. We will release our dataset and code to facilitate relevant research: this https URL.
- [33] arXiv:2406.13650 [pdf, html, other]
-
Title: Advanced Maximum Adhesion Tracking Strategies in Railway Traction DrivesAhmed Fathy Abouzeid, Juan Manuel Guerrero, Lander Lejarza, Iker Muniategui, Aitor Endemaño, Fernando BrizComments: 16 pages, 21 figuresJournal-ref: IEEE Transactions on Transportation Electrification, vol. 10, no. 2, pp. 3645-3660, June 2024Subjects: Systems and Control (eess.SY)
Modern railway traction systems are often equipped with anti-slip control strategies to comply with performance and safety requirements. A certain amount of slip is needed to increase the torque transferred by the traction motors onto the rail. Commonly, constant slip control is used to limit the slip velocity between the wheel and rail avoiding excessive slippage and vehicle derailment. This is at the price of not fully utilizing the train's traction and braking capabilities. Finding the slip at which maximum traction force occurs is challenging due to the non-linear relationship between slip and wheel-rail adhesion coefficient, as well as to its dependence on rail and wheel conditions. Perturb and observe (P\&O) and steepest gradient (SG) methods have been reported for the Maximum Adhesion Tracking (MAT) search. However, both methods exhibit weaknesses. Two new MAT strategies are proposed in this paper which overcome the limitations of existing methods, using Fuzzy Logic Controller (FLC) and Particle Swarm Optimization (PSO) respectively. Existing and proposed methods are first simulated and further validated experimentally using a scaled roller rig under identical conditions. The results show that the proposed methods improve the traction capability with lower searching time and oscillations compared to existing solutions. Tuning complexity and computational requirements will also be shown to be favorable to the proposed methods.
- [34] arXiv:2406.13651 [pdf, html, other]
-
Title: CLAMP: Majorized Plug-and-Play for Coherent 3D LIDAR ImagingSubjects: Image and Video Processing (eess.IV)
Coherent LIDAR uses a chirped laser pulse for 3D imaging of distant targets. However, existing coherent LIDAR image reconstruction methods do not account for the system's aperture, resulting in sub-optimal resolution. Moreover, these methods use majorization-minimization for computational efficiency, but do so without a theoretical treatment of convergence.
In this paper, we present Coherent LIDAR Aperture Modeled Plug-and-Play (CLAMP) for multi-look coherent LIDAR image reconstruction. CLAMP uses multi-agent consensus equilibrium (a form of PnP) to combine a neural network denoiser with an accurate physics-based forward model. CLAMP introduces an FFT-based method to account for the effects of the aperture and uses majorization of the forward model for computational efficiency. We also formalize the use of majorization-minimization in consensus optimization problems and prove convergence to the exact consensus equilibrium solution. Finally, we apply CLAMP to synthetic and measured data to demonstrate its effectiveness in producing high-resolution, speckle-free, 3D imagery. - [35] arXiv:2406.13674 [pdf, html, other]
-
Title: Rethinking Abdominal Organ Segmentation (RAOS) in the clinical scenario: A robustness evaluation benchmark with challenging casesComments: 10 pages, 1 figure, 6 tables, Early Accept to MICCAI 2024Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Deep learning has enabled great strides in abdominal multi-organ segmentation, even surpassing junior oncologists on common cases or organs. However, robustness on corner cases and complex organs remains a challenging open problem for clinical adoption. To investigate model robustness, we collected and annotated the RAOS dataset comprising 413 CT scans ($\sim$80k 2D images, $\sim$8k 3D organ annotations) from 413 patients each with 17 (female) or 19 (male) labelled organs, manually delineated by oncologists. We grouped scans based on clinical information into 1) diagnosis/radiotherapy (317 volumes), 2) partial excision without the whole organ missing (22 volumes), and 3) excision with the whole organ missing (74 volumes). RAOS provides a potential benchmark for evaluating model robustness including organ hallucination. It also includes some organs that can be very hard to access on public datasets like the rectum, colon, intestine, prostate and seminal vesicles. We benchmarked several state-of-the-art methods in these three clinical groups to evaluate performance and robustness. We also assessed cross-generalization between RAOS and three public datasets. This dataset and comprehensive analysis establish a potential baseline for future robustness research: \url{this https URL}.
- [36] arXiv:2406.13705 [pdf, html, other]
-
Title: EndoUIC: Promptable Diffusion Transformer for Unified Illumination Correction in Capsule EndoscopyLong Bai, Qiaozhi Tan, Tong Chen, Wan Jun Nah, Yanheng Li, Zhicheng He, Sishen Yuan, Zhen Chen, Jinlin Wu, Mobarakol Islam, Zhen Li, Hongbin Liu, Hongliang RenComments: To appear in MICCAI 2024. Code and dataset availability: this https URLSubjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Wireless Capsule Endoscopy (WCE) is highly valued for its non-invasive and painless approach, though its effectiveness is compromised by uneven illumination from hardware constraints and complex internal dynamics, leading to overexposed or underexposed images. While researchers have discussed the challenges of low-light enhancement in WCE, the issue of correcting for different exposure levels remains underexplored. To tackle this, we introduce EndoUIC, a WCE unified illumination correction solution using an end-to-end promptable diffusion transformer (DFT) model. In our work, the illumination prompt module shall navigate the model to adapt to different exposure levels and perform targeted image enhancement, in which the Adaptive Prompt Integration (API) and Global Prompt Scanner (GPS) modules shall further boost the concurrent representation learning between the prompt parameters and features. Besides, the U-shaped restoration DFT model shall capture the long-range dependencies and contextual information for unified illumination restoration. Moreover, we present a novel Capsule-endoscopy Exposure Correction (CEC) dataset, including ground-truth and corrupted image pairs annotated by expert photographers. Extensive experiments against a variety of state-of-the-art (SOTA) methods on four datasets showcase the effectiveness of our proposed method and components in WCE illumination restoration, and the additional downstream experiments further demonstrate its utility for clinical diagnosis and surgical assistance.
- [37] arXiv:2406.13707 [pdf, html, other]
-
Title: Safety-Critical Formation Control of Non-Holonomic Multi-Robot Systems in Communication-Limited EnvironmentsComments: Under reviewSubjects: Systems and Control (eess.SY); Robotics (cs.RO)
This paper presents a robust estimator-based safety-critical controller for formation control of non-holonomic mobile robots in communication-limited environments. The proposed decentralized framework integrates a robust state estimator with a formation tracking control law that guarantees inter-agent collision avoidance using control barrier functions. String stability is incorporated into the control design to maintain stability against noise from predecessors in leader-follower formations. Rigorous stability analysis using Lyapunov functions ensures the stability of estimation errors and the convergence of the formation to desired configurations. The effectiveness and robustness of the proposed approach are validated through numerical simulations of various maneuvers and realistic Gazebo experiments involving formations in a warehouse environment. The results demonstrate the controller's ability to maintain safety, achieve precise formation control, and mitigate disturbances in scenarios without inter-robot communication.
- [38] arXiv:2406.13708 [pdf, other]
-
Title: Low-rank based motion correction followed by automatic frame selection in DT-CMRFanwen Wang, Pedro F.Ferreira, Camila Munoz, Ke Wen, Yaqing Luo, Jiahao Huang, Yinzhe Wu, Dudley J.Pennell, Andrew D. Scott, Sonia Nielles-Vallespin, Guang YangComments: Accepted as ISMRM 2024 Digital poster 2141Journal-ref: ISMRM 2024 Digital poster 2141Subjects: Image and Video Processing (eess.IV); Medical Physics (physics.med-ph)
Motivation: Post-processing of in-vivo diffusion tensor CMR (DT-CMR) is challenging due to the low SNR and variation in contrast between frames which makes image registration difficult, and the need to manually reject frames corrupted by motion. Goals: To develop a semi-automatic post-processing pipeline for robust DT-CMR registration and automatic frame selection. Approach: We used low intrinsic rank averaged frames as the reference to register other low-ranked frames. A myocardium-guided frame selection rejected the frames with signal loss, through-plane motion and poor registration. Results: The proposed method outperformed our previous noise-robust rigid registration on helix angle data quality and reduced negative eigenvalues in healthy volunteers.
- [39] arXiv:2406.13709 [pdf, html, other]
-
Title: A Study on the Effect of Color Spaces in Learned Image CompressionSrivatsa Prativadibhayankaram, Mahadev Prasad Panda, Jürgen Seiler, Thomas Richter, Heiko Sparenberg, Siegfried Fößel, André KaupComments: Accepter pre-print version for ICIP 2024Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
In this work, we present a comparison between color spaces namely YUV, LAB, RGB and their effect on learned image compression. For this we use the structure and color based learned image codec (SLIC) from our prior work, which consists of two branches - one for the luminance component (Y or L) and another for chrominance components (UV or AB). However, for the RGB variant we input all 3 channels in a single branch, similar to most learned image codecs operating in RGB. The models are trained for multiple bitrate configurations in each color space. We report the findings from our experiments by evaluating them on various datasets and compare the results to state-of-the-art image codecs. The YUV model performs better than the LAB variant in terms of MS-SSIM with a Bjøntegaard delta bitrate (BD-BR) gain of 7.5\% using VTM intra-coding mode as the baseline. Whereas the LAB variant has a better performance than YUV model in terms of CIEDE2000 having a BD-BR gain of 8\%. Overall, the RGB variant of SLIC achieves the best performance with a BD-BR gain of 13.14\% in terms of MS-SSIM and a gain of 17.96\% in CIEDE2000 at the cost of a higher model complexity.
- [40] arXiv:2406.13750 [pdf, html, other]
-
Title: Empowering Tuberculosis Screening with Explainable Self-Supervised Deep Neural NetworksComments: 9 pages, 3 figuresSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Tuberculosis persists as a global health crisis, especially in resource-limited populations and remote regions, with more than 10 million individuals newly infected annually. It stands as a stark symbol of inequity in public health. Tuberculosis impacts roughly a quarter of the global populace, with the majority of cases concentrated in eight countries, accounting for two-thirds of all tuberculosis infections. Although a severe ailment, tuberculosis is both curable and manageable. However, early detection and screening of at-risk populations are imperative. Chest x-ray stands as the predominant imaging technique utilized in tuberculosis screening efforts. However, x-ray screening necessitates skilled radiologists, a resource often scarce, particularly in remote regions with limited resources. Consequently, there is a pressing need for artificial intelligence (AI)-powered systems to support clinicians and healthcare providers in swift screening. However, training a reliable AI model necessitates large-scale high-quality data, which can be difficult and costly to acquire. Inspired by these challenges, in this work, we introduce an explainable self-supervised self-train learning network tailored for tuberculosis case screening. The network achieves an outstanding overall accuracy of 98.14% and demonstrates high recall and precision rates of 95.72% and 99.44%, respectively, in identifying tuberculosis cases, effectively capturing clinically significant features.
- [41] arXiv:2406.13752 [pdf, html, other]
-
Title: COAC: Cross-layer Optimization of Accelerator Configurability for Efficient CNN ProcessingComments: 14 pages,17 figures.Journal IEEE Transactions on Very Large Scale Integration (VLSI) SystemsJournal-ref: in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 31, no. 7, pp. 945-958, July 2023Subjects: Systems and Control (eess.SY)
To achieve high accuracy, convolutional neural networks (CNNs) are increasingly growing in complexity and diversity in layer types and topologies. This makes it very challenging to efficiently deploy such networks on custom processor architectures for resource-scarce edge devices. Existing mapping exploration frameworks enable searching for the optimal execution schedules or hardware mappings of individual network layers, by optimizing each layer's spatial (dataflow parallelization) and temporal unrolling (execution order). However, these tools fail to take into account the overhead of supporting different unrolling schemes within a common hardware architecture. Using a fixed unrolling scheme across all layers is also not ideal, as this misses significant opportunities for energy and latency savings from optimizing the mapping of diverse layer types. A balanced approach assesses the right amount of mapping flexibility needed across target neural networks, while taking into account the overhead to support multiple unrollings. This paper, therefore, presents COAC, a cross-layer design space exploration and mapping framework to optimize the flexibility of neural processing architectures by balancing configurability overhead against resulting energy and latency savings for end-to-end inference. COAC does not only provide a systematical analysis of the architectural overhead in function of the supported spatial unrollings, but also builds an automated flow to find the best unrolling combination(s) for efficient end-to-end inference with limited hardware overhead. Results demonstrate that architectures with carefully optimized flexibility can achieve up to 38% EDP (energy-delay-product) savings for a set of six neural networks at the expense of a relative area increase of 9.5%.
- [42] arXiv:2406.13788 [pdf, html, other]
-
Title: Groupwise Deformable Registration of Diffusion Tensor Cardiovascular Magnetic Resonance: Disentangling Diffusion Contrast, Respiratory and Cardiac MotionsFanwen Wang, Yihao Luo, Ke Wen, Jiahao Huang, Pedro F.Ferreira, Yaqing Luo, Yinzhe Wu, Camila Munoz, Dudley J.Pennell, Andrew D. Scott, Sonia Nielles-Vallespin, Guang YangComments: Accepted by MICCAI 2024Subjects: Signal Processing (eess.SP)
Diffusion tensor based cardiovascular magnetic resonance (DT-CMR) offers a non-invasive method to visualize the myocardial microstructure. With the assumption that the heart is stationary, frames are acquired with multiple repetitions for different diffusion encoding directions. However, motion from poor breath-holding and imprecise cardiac triggering complicates DT-CMR analysis, further challenged by its inherently low SNR, varied contrasts, and diffusion-induced textures. Our solution is a novel framework employing groupwise registration with an implicit template to isolate respiratory and cardiac motions, while a tensor-embedded branch preserves diffusion contrast textures. We've devised a loss refinement tailored for non-linear least squares fitting and low SNR conditions. Additionally, we introduce new physics-based and clinical metrics for performance evaluation. Access code and supplementary materials at: this https URL
- [43] arXiv:2406.13794 [pdf, html, other]
-
Title: Adaptive Curves for Optimally Efficient Market MakingSubjects: Systems and Control (eess.SY); Computational Engineering, Finance, and Science (cs.CE); Trading and Market Microstructure (q-fin.TR)
Automated Market Makers (AMMs) are essential in Decentralized Finance (DeFi) as they match liquidity supply with demand. They function through liquidity providers (LPs) who deposit assets into liquidity pools. However, the asset trading prices in these pools often trail behind those in more dynamic, centralized exchanges, leading to potential arbitrage losses for LPs. This issue is tackled by adapting market maker bonding curves to trader behavior, based on the classical market microstructure model of Glosten and Milgrom. Our approach ensures a zero-profit condition for the market maker's prices. We derive the differential equation that an optimal adaptive curve should follow to minimize arbitrage losses while remaining competitive. Solutions to this optimality equation are obtained for standard Gaussian and Lognormal price models using Kalman filtering. A key feature of our method is its ability to estimate the external market price without relying on price or loss oracles. We also provide an equivalent differential equation for the implied dynamics of canonical static bonding curves and establish conditions for their optimality. Our algorithms demonstrate robustness to changing market conditions and adversarial perturbations, and we offer an on-chain implementation using Uniswap v4 alongside off-chain AI co-processors.
- [44] arXiv:2406.13815 [pdf, other]
-
Title: IG-CFAT: An Improved GAN-Based Framework for Effectively Exploiting Transformers in Real-World Image Super-ResolutionSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
In the field of single image super-resolution (SISR), transformer-based models, have demonstrated significant advancements. However, the potential and efficiency of these models in applied fields such as real-world image super-resolution are less noticed and there are substantial opportunities for improvement. Recently, composite fusion attention transformer (CFAT), outperformed previous state-of-the-art (SOTA) models in classic image super-resolution. This paper extends the CFAT model to an improved GAN-based model called IG-CFAT to effectively exploit the performance of transformers in real-world image super-resolution. IG-CFAT incorporates a semantic-aware discriminator to reconstruct image details more accurately, significantly improving perceptual quality. Moreover, our model utilizes an adaptive degradation model to better simulate real-world degradations. Our methodology adds wavelet losses to conventional loss functions of GAN-based super-resolution models to reconstruct high-frequency details more efficiently. Empirical results demonstrate that IG-CFAT sets new benchmarks in real-world image super-resolution, outperforming SOTA models in both quantitative and qualitative metrics.
- [45] arXiv:2406.13817 [pdf, html, other]
-
Title: SkyGrid: Energy-Flow Optimization at Harmonized Aerial IntersectionsComments: 8 pages, 12 figures - Submitted to IEEE VTC Fall 2024 - Under ReviewSubjects: Systems and Control (eess.SY)
The rapid evolution of urban air mobility (UAM) is reshaping the future of transportation by integrating aerial vehicles into urban transit systems. The design of aerial intersections plays a critical role in the phased development of UAM systems to ensure safe and efficient operations in air corridors. This work adapts the concept of rhythmic control of connected and automated vehicles (CAVs) at unsignalized intersections to address complex traffic control problems. This control framework assigns UAM vehicles to different movement groups and significantly reduces the computation of routing strategies to avoid conflicts. In contrast to ground traffic, the objective is to balance three measures: minimizing energy utilization, maximizing intersection flow (throughput), and maintaining safety distances. This optimization method dynamically directs traffic with various demands, considering path assignment distributions and segment-level trajectory coefficients for straight and curved paths as control variables. To the best of our knowledge, this is the first work to consider a multi-objective optimization approach for unsignalized intersection control in the air and to propose such optimization in a rhythmic control setting with time arrival and UAM operational constraints. A sensitivity analysis with respect to inter-platoon safety and straight/left demand balance demonstrates the effectiveness of our method in handling traffic under various scenarios.
- [46] arXiv:2406.13895 [pdf, html, other]
-
Title: INFusion: Diffusion Regularized Implicit Neural Representations for 2D and 3D accelerated MRI reconstructionComments: 6 pages, 4 figures, asilomar 2024 submissionSubjects: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
Implicit Neural Representations (INRs) are a learning-based approach to accelerate Magnetic Resonance Imaging (MRI) acquisitions, particularly in scan-specific settings when only data from the under-sampled scan itself are available. Previous work demonstrates that INRs improve rapid MRI through inherent regularization imposed by neural network architectures. Typically parameterized by fully-connected neural networks, INRs support continuous image representations by taking a physical coordinate location as input and outputting the intensity at that coordinate. Previous work has applied unlearned regularization priors during INR training and have been limited to 2D or low-resolution 3D acquisitions. Meanwhile, diffusion based generative models have received recent attention as they learn powerful image priors decoupled from the measurement model. This work proposes INFusion, a technique that regularizes the optimization of INRs from under-sampled MR measurements with pre-trained diffusion models for improved image reconstruction. In addition, we propose a hybrid 3D approach with our diffusion regularization that enables INR application on large-scale 3D MR datasets. 2D experiments demonstrate improved INR training with our proposed diffusion regularization, and 3D experiments demonstrate feasibility of INR training with diffusion regularization on 3D matrix sizes of 256 by 256 by 80.
- [47] arXiv:2406.13935 [pdf, html, other]
-
Title: CONMOD: Controllable Neural Frame-based Modulation EffectsSubjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
Deep learning models have seen widespread use in modelling LFO-driven audio effects, such as phaser and flanger. Although existing neural architectures exhibit high-quality emulation of individual effects, they do not possess the capability to manipulate the output via control parameters. To address this issue, we introduce Controllable Neural Frame-based Modulation Effects (CONMOD), a single black-box model which emulates various LFO-driven effects in a frame-wise manner, offering control over LFO frequency and feedback parameters. Additionally, the model is capable of learning the continuous embedding space of two distinct phaser effects, enabling us to steer between effects and achieve creative outputs. Our model outperforms previous work while possessing both controllability and universality, presenting opportunities to enhance creativity in modern LFO-driven audio effects.
- [48] arXiv:2406.13977 [pdf, html, other]
-
Title: Similarity-aware Syncretic Latent Diffusion Model for Medical Image Translation with Representation LearningSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Non-contrast CT (NCCT) imaging may reduce image contrast and anatomical visibility, potentially increasing diagnostic uncertainty. In contrast, contrast-enhanced CT (CECT) facilitates the observation of regions of interest (ROI). Leading generative models, especially the conditional diffusion model, demonstrate remarkable capabilities in medical image modality transformation. Typical conditional diffusion models commonly generate images with guidance of segmentation labels for medical modal transformation. Limited access to authentic guidance and its low cardinality can pose challenges to the practical clinical application of conditional diffusion models. To achieve an equilibrium of generative quality and clinical practices, we propose a novel Syncretic generative model based on the latent diffusion model for medical image translation (S$^2$LDM), which can realize high-fidelity reconstruction without demand of additional condition during inference. S$^2$LDM enhances the similarity in distinct modal images via syncretic encoding and diffusing, promoting amalgamated information in the latent space and generating medical images with more details in contrast-enhanced regions. However, syncretic latent spaces in the frequency domain tend to favor lower frequencies, commonly locate in identical anatomic structures. Thus, S$^2$LDM applies adaptive similarity loss and dynamic similarity to guide the generation and supplements the shortfall in high-frequency details throughout the training process. Quantitative experiments confirm the effectiveness of our approach in medical image translation. Our code will release lately.
- [49] arXiv:2406.13979 [pdf, html, other]
-
Title: Knowledge-driven Subspace Fusion and Gradient Coordination for Multi-modal LearningSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Multi-modal learning plays a crucial role in cancer diagnosis and prognosis. Current deep learning based multi-modal approaches are often limited by their abilities to model the complex correlations between genomics and histology data, addressing the intrinsic complexity of tumour ecosystem where both tumour and microenvironment contribute to malignancy. We propose a biologically interpretative and robust multi-modal learning framework to efficiently integrate histology images and genomics by decomposing the feature subspace of histology images and genomics, reflecting distinct tumour and microenvironment features. To enhance cross-modal interactions, we design a knowledge-driven subspace fusion scheme, consisting of a cross-modal deformable attention module and a gene-guided consistency strategy. Additionally, in pursuit of dynamically optimizing the subspace knowledge, we further propose a novel gradient coordination learning strategy. Extensive experiments demonstrate the effectiveness of the proposed method, outperforming state-of-the-art techniques in three downstream tasks of glioma diagnosis, tumour grading, and survival analysis. Our code is available at this https URL.
- [50] arXiv:2406.14028 [pdf, html, other]
-
Title: Reliable State Estimation in a Truck-Semitrailer Combination using an Artificial Neural Network-Aided Extended Kalman FilterComments: 8 pages, 3 figures, accepted for publication at 2024 European Control Conference (ECC)Subjects: Systems and Control (eess.SY)
Advanced driver assistance systems are critically dependent on reliable and accurate information regarding a vehicles' driving state. For estimation of unknown quantities, model-based and learning-based methods exist, but both suffer from individual limitations. On the one hand, model-based estimation performance is often limited by the models' accuracy. On the other hand, learning-based estimators usually do not perform well in "unknown" conditions (bad generalization), which is particularly critical for semitrailers as their payload changes significantly in operation. To the best of the authors' knowledge, this work is the first to analyze the capability of state-of-the-art estimators for semitrailers to generalize across "unknown" loading states. Moreover, a novel hybrid Extended Kalman Filter (H-EKF) that takes advantage of accurate Artificial Neural Network (ANN) estimates while preserving reliable generalization capability is presented. It estimates the articulation angle between truck and semitrailer, lateral tire forces and the truck steering angle utilizing sensor data of a standard semitrailer only. An experimental comparison based on a full-scale truck-semitrailer combination indicates the superiority of the H-EKF compared to a state-of-the-art extended Kalman filter and an ANN estimator.
- [51] arXiv:2406.14052 [pdf, html, other]
-
Title: Perspective+ Unet: Enhancing Segmentation with Bi-Path Fusion and Efficient Non-Local Attention for Superior Receptive FieldsComments: 13 pages, 5 figuresSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Precise segmentation of medical images is fundamental for extracting critical clinical information, which plays a pivotal role in enhancing the accuracy of diagnoses, formulating effective treatment plans, and improving patient outcomes. Although Convolutional Neural Networks (CNNs) and non-local attention methods have achieved notable success in medical image segmentation, they either struggle to capture long-range spatial dependencies due to their reliance on local features, or face significant computational and feature integration challenges when attempting to address this issue with global attention mechanisms. To overcome existing limitations in medical image segmentation, we propose a novel architecture, Perspective+ Unet. This framework is characterized by three major innovations: (i) It introduces a dual-pathway strategy at the encoder stage that combines the outcomes of traditional and dilated convolutions. This not only maintains the local receptive field but also significantly expands it, enabling better comprehension of the global structure of images while retaining detail sensitivity. (ii) The framework incorporates an efficient non-local transformer block, named ENLTB, which utilizes kernel function approximation for effective long-range dependency capture with linear computational and spatial complexity. (iii) A Spatial Cross-Scale Integrator strategy is employed to merge global dependencies and local contextual cues across model stages, meticulously refining features from various levels to harmonize global and local information. Experimental results on the ACDC and Synapse datasets demonstrate the effectiveness of our proposed Perspective+ Unet. The code is available in the supplementary material.
- [52] arXiv:2406.14069 [pdf, html, other]
-
Title: Towards Multi-modality Fusion and Prototype-based Feature Refinement for Clinically Significant Prostate Cancer Classification in Transrectal UltrasoundSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Prostate cancer is a highly prevalent cancer and ranks as the second leading cause of cancer-related deaths in men globally. Recently, the utilization of multi-modality transrectal ultrasound (TRUS) has gained significant traction as a valuable technique for guiding prostate biopsies. In this study, we propose a novel learning framework for clinically significant prostate cancer (csPCa) classification using multi-modality TRUS. The proposed framework employs two separate 3D ResNet-50 to extract distinctive features from B-mode and shear wave elastography (SWE). Additionally, an attention module is incorporated to effectively refine B-mode features and aggregate the extracted features from both modalities. Furthermore, we utilize few shot segmentation task to enhance the capacity of classification encoder. Due to the limited availability of csPCa masks, a prototype correction module is employed to extract representative prototypes of csPCa. The performance of the framework is assessed on a large-scale dataset consisting of 512 TRUS videos with biopsy-proved prostate cancer. The results demonstrate the strong capability in accurately identifying csPCa, achieving an area under the curve (AUC) of 0.86. Moreover, the framework generates visual class activation mapping (CAM), which can serve as valuable assistance for localizing csPCa. These CAM images may offer valuable guidance during TRUS-guided targeted biopsies, enhancing the efficacy of the biopsy procedure.The code is available at this https URL.
- [53] arXiv:2406.14107 [pdf, html, other]
-
Title: Efficient Transmission Scheme for LEO Satellite-Based NB-IoT: A Data-Driven PerspectiveSubjects: Signal Processing (eess.SP)
This study analyses the medium access control (MAC) layer aspects of a low-Earth-orbit (LEO) satellite-based Internet of Things (IoT) network. A transmission scheme based on change detection is proposed to accommodate more users within the network and improve energy efficiency. Machine learning (ML) algorithms are also proposed to reduce the payload size by leveraging the correlation among the sensed parameters. Real-world data from an IoT testbed deployed for a smart city application is utilised to analyse the performance regarding collision probability, effective data received and average battery lifetime. The findings reveal that the traffic pattern, post-implementation of the proposed scheme, differs from the commonly assumed Poisson traffic, thus proving the effectiveness of having IoT data from actual deployment. It is demonstrated that the transmission scheme facilitates accommodating more devices while targeting a specific collision probability. Considering the link budget for a direct access NB-IoT scenario, more data is effectively offloaded to the server within the limited visibility of LEO satellites. The average battery lifetimes are also demonstrated to increase by many folds by using the proposed access schemes and ML algorithms.
- [54] arXiv:2406.14116 [pdf, html, other]
-
Title: Efficient Design and Implementation of Fast-Convolution-Based Variable-Bandwidth FiltersSubjects: Signal Processing (eess.SP)
This paper introduces an efficient design approach for a fast-convolution-based variable-bandwidth (VBW) filter. The proposed approach is based on a hybrid of frequency sampling and optimization (HFSO), that offers significant computational complexity reduction compared to existing solutions for a given performance. The paper provides a design procedure based on minimax optimization to obtain the minimum complexity of the overall filter. A design example includes a comparison of the proposed design-based VBW filter and time-domain designed VBW filters implemented in the time domain and in the frequency domain. It is shown that not only the implementation complexity can be reduced but also the design complexity by excluding any computations when the bandwidth of the filter is adjusted. Moreover, memory requirements are also decreased compared to the existing frequency-domain implementations.
- [55] arXiv:2406.14118 [pdf, html, other]
-
Title: Prediction and Reference Quality Adaptation for Learned Video CompressionSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Temporal prediction is one of the most important technologies for video compression. Various prediction coding modes are designed in traditional video codecs. Traditional video codecs will adaptively to decide the optimal coding mode according to the prediction quality and reference quality. Recently, learned video codecs have made great progress. However, they ignore the prediction and reference quality adaptation, which leads to incorrect utilization of temporal prediction and reconstruction error propagation. Therefore, in this paper, we first propose a confidence-based prediction quality adaptation (PQA) module to provide explicit discrimination for the spatial and channel-wise prediction quality difference. With this module, the prediction with low quality will be suppressed and that with high quality will be enhanced. The codec can adaptively decide which spatial or channel location of predictions to use. Then, we further propose a reference quality adaptation (RQA) module and an associated repeat-long training strategy to provide dynamic spatially variant filters for diverse reference qualities. With the filters, it is easier for our codec to achieve the target reconstruction quality according to reference qualities, thus reducing the propagation of reconstruction errors. Experimental results show that our codec obtains higher compression performance than the reference software of H.266/VVC and the previous state-of-the-art learned video codecs in both RGB and YUV420 colorspaces.
- [56] arXiv:2406.14126 [pdf, other]
-
Title: Joint Optimization of Switching Point and Power Control in Dynamic TDD Cell-Free Massive MIMOComments: Presented at the Asilomar Conference on Signals, Systems, and Computers 2023Subjects: Signal Processing (eess.SP); Information Theory (cs.IT)
We consider a cell-free massive multiple-input multiple-output (CFmMIMO) network operating in dynamic time division duplex (DTDD). The switching point between the uplink (UL) and downlink (DL) data transmission phases can be adapted dynamically to the instantaneous quality-of-service (QoS) requirements in order to improve energy efficiency (EE). To this end, we formulate a problem of optimizing the DTDD switching point jointly with the UL and DL power control coefficients, and the large-scale fading decoding (LSFD) weights for EE maximization. Then, we propose an iterative algorithm to solve the formulated challenging problem using successive convex approximation with an approximate stationary solution. Simulation results show that optimizing switching points remarkably improves EE compared with baseline schemes that adjust switching points heuristically.
- [57] arXiv:2406.14141 [pdf, html, other]
-
Title: Online Learning of Weakly Coupled MDP Policies for Load Balancing and Auto ScalingSubjects: Systems and Control (eess.SY); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
Load balancing and auto scaling are at the core of scalable, contemporary systems, addressing dynamic resource allocation and service rate adjustments in response to workload changes. This paper introduces a novel model and algorithms for tuning load balancers coupled with auto scalers, considering bursty traffic arriving at finite queues. We begin by presenting the problem as a weakly coupled Markov Decision Processes (MDP), solvable via a linear program (LP). However, as the number of control variables of such LP grows combinatorially, we introduce a more tractable relaxed LP formulation, and extend it to tackle the problem of online parameter learning and policy optimization using a two-timescale algorithm based on the LP Lagrangian.
- [58] arXiv:2406.14179 [pdf, html, other]
-
Title: Single Channel-based Motor Imagery Classification using Fisher's Ratio and Pearson CorrelationSubjects: Signal Processing (eess.SP)
Motor imagery-based BCI systems have been promising and gaining popularity in rehabilitation and Activities of daily life(ADL). Despite this, the technology is still emerging and has not yet been outside the laboratory constraints. Channel reduction is one contributing avenue to make these systems part of ADL. Although Motor Imagery classification heavily depends on spatial factors, single channel-based classification remains an avenue to be explored thoroughly. Since Fisher's ratio and Pearson Correlation are powerful measures actively used in the domain, we propose an integrated framework (FRPC integrated framework) that integrates Fisher's Ratio to select the best channel and Pearson correlation to select optimal filter banks and extract spectral and temporal features respectively. The framework is tested for a 2-class motor imagery classification on 2 open-source datasets and 1 collected dataset and compared with state-of-art work. Apart from implementing the framework, this study also explores the most optimal channel among all the subjects and later explores classes where the single-channel framework is efficient.
- [59] arXiv:2406.14186 [pdf, html, other]
-
Title: CriDiff: Criss-cross Injection Diffusion Framework via Generative Pre-train for Prostate SegmentationComments: Accepted in MICCAI 2024Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Recently, the Diffusion Probabilistic Model (DPM)-based methods have achieved substantial success in the field of medical image segmentation. However, most of these methods fail to enable the diffusion model to learn edge features and non-edge features effectively and to inject them efficiently into the diffusion backbone. Additionally, the domain gap between the images features and the diffusion model features poses a great challenge to prostate segmentation. In this paper, we proposed CriDiff, a two-stage feature injecting framework with a Crisscross Injection Strategy (CIS) and a Generative Pre-train (GP) approach for prostate segmentation. The CIS maximizes the use of multi-level features by efficiently harnessing the complementarity of high and low-level features. To effectively learn multi-level of edge features and non-edge features, we proposed two parallel conditioners in the CIS: the Boundary Enhance Conditioner (BEC) and the Core Enhance Conditioner (CEC), which discriminatively model the image edge regions and non-edge regions, respectively. Moreover, the GP approach eases the inconsistency between the images features and the diffusion model without adding additional parameters. Extensive experiments on four benchmark datasets demonstrate the effectiveness of the proposed method and achieve state-of-the-art performance on four evaluation metrics.
- [60] arXiv:2406.14210 [pdf, html, other]
-
Title: Self-Supervised Pretext Tasks for Alzheimer's Disease Classification using 3D Convolutional Neural Networks on Large-Scale Synthetic Neuroimaging DatasetSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Structural magnetic resonance imaging (MRI) studies have shown that Alzheimer's Disease (AD) induces both localised and widespread neural degenerative changes throughout the brain. However, the absence of segmentation that highlights brain degenerative changes presents unique challenges for training CNN-based classifiers in a supervised fashion. In this work, we evaluated several unsupervised methods to train a feature extractor for downstream AD vs. CN classification. Using the 3D T1-weighted MRI data of cognitive normal (CN) subjects from the synthetic neuroimaging LDM100K dataset, lightweight 3D CNN-based models are trained for brain age prediction, brain image rotation classification, brain image reconstruction and a multi-head task combining all three tasks into one. Feature extractors trained on the LDM100K synthetic dataset achieved similar performance compared to the same model using real-world data. This supports the feasibility of utilising large-scale synthetic data for pretext task training. All the training and testing splits are performed on the subject-level to prevent data leakage issues. Alongside the simple preprocessing steps, the random cropping data augmentation technique shows consistent improvement across all experiments.
- [61] arXiv:2406.14251 [pdf, html, other]
-
Title: Enhanced Optimal Power Flow Based Droop Control in MMC-MTDC SystemsSubjects: Systems and Control (eess.SY)
Optimizing operational set points for modular multilevel converters (MMCs) in Multi-Terminal Direct Current (MTDC) transmission systems is crucial for ensuring efficient power distribution and control. This paper presents an enhanced Optimal Power Flow (OPF) model for MMC-MTDC systems, integrating a novel adaptive voltage droop control strategy. The strategy aims to minimize generation costs and DC voltage deviations while ensuring the stable operation of the MTDC grid by dynamically adjusting the system operation points. The modified Nordic 32 test system with an embedded 4-terminal DC grid is modeled in Julia and the proposed control strategy is applied to the power model. The results demonstrate the feasibility and effectiveness of the proposed droop control strategy, affirming its potential value in enhancing the performance and reliability of hybrid AC-DC power systems.
- [62] arXiv:2406.14264 [pdf, html, other]
-
Title: Zero-Shot Image Denoising for High-Resolution Electron MicroscopyComments: 12 pages, 12 figuresSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
High-resolution electron microscopy (HREM) imaging technique is a powerful tool for directly visualizing a broad range of materials in real-space. However, it faces challenges in denoising due to ultra-low signal-to-noise ratio (SNR) and scarce data availability. In this work, we propose Noise2SR, a zero-shot self-supervised learning (ZS-SSL) denoising framework for HREM. Within our framework, we propose a super-resolution (SR) based self-supervised training strategy, incorporating the Random Sub-sampler module. The Random Sub-sampler is designed to generate approximate infinite noisy pairs from a single noisy image, serving as an effective data augmentation in zero-shot denoising. Noise2SR trains the network with paired noisy images of different resolutions, which is conducted via SR strategy. The SR-based training facilitates the network adopting more pixels for supervision, and the random sub-sampling helps compel the network to learn continuous signals enhancing the robustness. Meanwhile, we mitigate the uncertainty caused by random-sampling by adopting minimum mean squared error (MMSE) estimation for the denoised results. With the distinctive integration of training strategy and proposed designs, Noise2SR can achieve superior denoising performance using a single noisy HREM image. We evaluate the performance of Noise2SR in both simulated and real HREM denoising tasks. It outperforms state-of-the-art ZS-SSL methods and achieves comparable denoising performance with supervised methods. The success of Noise2SR suggests its potential for improving the SNR of images in material imaging domains.
- [63] arXiv:2406.14287 [pdf, html, other]
-
Title: Segmentation of Non-Small Cell Lung Carcinomas: Introducing DRU-Net and Multi-Lens DistortionSoroush Oskouei, Marit Valla, André Pedersen, Erik Smistad, Vibeke Grotnes Dale, Maren Høibø, Sissel Gyrid Freim Wahl, Mats Dehli Haugum, Thomas Langø, Maria Paula Ramnefjell, Lars Andreas Akslen, Gabriel Kiss, Hanne SorgerComments: 16 pages, 7 figures, submitted to Scientific ReportsSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
Considering the increased workload in pathology laboratories today, automated tools such as artificial intelligence models can help pathologists with their tasks and ease the workload. In this paper, we are proposing a segmentation model (DRU-Net) that can provide a delineation of human non-small cell lung carcinomas and an augmentation method that can improve classification results. The proposed model is a fused combination of truncated pre-trained DenseNet201 and ResNet101V2 as a patch-wise classifier followed by a lightweight U-Net as a refinement model. We have used two datasets (Norwegian Lung Cancer Biobank and Haukeland University Hospital lung cancer cohort) to create our proposed model. The DRU-Net model achieves an average of 0.91 Dice similarity coefficient. The proposed spatial augmentation method (multi-lens distortion) improved the network performance by 3%. Our findings show that choosing image patches that specifically include regions of interest leads to better results for the patch-wise classifier compared to other sampling methods. The qualitative analysis showed that the DRU-Net model is generally successful in detecting the tumor. On the test set, some of the cases showed areas of false positive and false negative segmentation in the periphery, particularly in tumors with inflammatory and reactive changes.
- [64] arXiv:2406.14301 [pdf, html, other]
-
Title: Resource Optimization for Tail-Based Control in Wireless Networked Control SystemsComments: Accepted in PIMRC 2024 conference, 6 pages, 5 figuresSubjects: Systems and Control (eess.SY); Machine Learning (cs.LG)
Achieving control stability is one of the key design challenges of scalable Wireless Networked Control Systems (WNCS) under limited communication and computing resources. This paper explores the use of an alternative control concept defined as tail-based control, which extends the classical Linear Quadratic Regulator (LQR) cost function for multiple dynamic control systems over a shared wireless network. We cast the control of multiple control systems as a network-wide optimization problem and decouple it in terms of sensor scheduling, plant state prediction, and control policies. Toward this, we propose a solution consisting of a scheduling algorithm based on Lyapunov optimization for sensing, a mechanism based on Gaussian Process Regression (GPR) for state prediction and uncertainty estimation, and a control policy based on Reinforcement Learning (RL) to ensure tail-based control stability. A set of discrete time-invariant mountain car control systems is used to evaluate the proposed solution and is compared against four variants that use state-of-the-art scheduling, prediction, and control methods. The experimental results indicate that the proposed method yields 22% reduction in overall cost in terms of communication and control resource utilization compared to state-of-the-art methods.
- [65] arXiv:2406.14308 [pdf, html, other]
-
Title: FIESTA: Fourier-Based Semantic Augmentation with Uncertainty Guidance for Enhanced Domain Generalizability in Medical Image SegmentationComments: 40 pages, 7 figures, 5 tablesSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Single-source domain generalization (SDG) in medical image segmentation (MIS) aims to generalize a model using data from only one source domain to segment data from an unseen target domain. Despite substantial advances in SDG with data augmentation, existing methods often fail to fully consider the details and uncertain areas prevalent in MIS, leading to mis-segmentation. This paper proposes a Fourier-based semantic augmentation method called FIESTA using uncertainty guidance to enhance the fundamental goals of MIS in an SDG context by manipulating the amplitude and phase components in the frequency domain. The proposed Fourier augmentative transformer addresses semantic amplitude modulation based on meaningful angular points to induce pertinent variations and harnesses the phase spectrum to ensure structural coherence. Moreover, FIESTA employs epistemic uncertainty to fine-tune the augmentation process, improving the ability of the model to adapt to diverse augmented data and concentrate on areas with higher ambiguity. Extensive experiments across three cross-domain scenarios demonstrate that FIESTA surpasses recent state-of-the-art SDG approaches in segmentation performance and significantly contributes to boosting the applicability of the model in medical imaging modalities.
- [66] arXiv:2406.14351 [pdf, html, other]
-
Title: Automatic Labels are as Effective as Manual Labels in Biomedical Images Classification with Deep LearningNiccolò Marini, Stefano Marchesin, Lluis Borras Ferris, Simon Püttmann, Marek Wodzinski, Riccardo Fratti, Damian Podareanu, Alessandro Caputo, Svetla Boytcheva, Simona Vatrano, Filippo Fraggetta, Iris Nagtegaal, Gianmaria Silvello, Manfredo Atzori, Henning MüllerComments: pre-print of the journal paperSubjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
The increasing availability of biomedical data is helping to design more robust deep learning (DL) algorithms to analyze biomedical samples. Currently, one of the main limitations to train DL algorithms to perform a specific task is the need for medical experts to label data. Automatic methods to label data exist, however automatic labels can be noisy and it is not completely clear when automatic labels can be adopted to train DL models. This paper aims to investigate under which circumstances automatic labels can be adopted to train a DL model on the classification of Whole Slide Images (WSI). The analysis involves multiple architectures, such as Convolutional Neural Networks (CNN) and Vision Transformer (ViT), and over 10000 WSIs, collected from three use cases: celiac disease, lung cancer and colon cancer, which one including respectively binary, multiclass and multilabel data. The results allow identifying 10% as the percentage of noisy labels that lead to train competitive models for the classification of WSIs. Therefore, an algorithm generating automatic labels needs to fit this criterion to be adopted. The application of the Semantic Knowledge Extractor Tool (SKET) algorithm to generate automatic labels leads to performance comparable to the one obtained with manual labels, since it generates a percentage of noisy labels between 2-5%. Automatic labels are as effective as manual ones, reaching solid performance comparable to the one obtained training models with manual labels.
- [67] arXiv:2406.14355 [pdf, other]
-
Title: A tensor model for calibration and imaging with air-coupled ultrasonic sensor arraysRaphael Müller, Gianni Allevato, Matthias Rutsch, Christoph Haugwitz, Mario Kupnik, Marius PesaventoComments: 22 pages, 6 figures. This work has been submitted to Elsevier B.V. for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessibleSubjects: Signal Processing (eess.SP)
Arrays of ultrasonic sensors are capable of 3D imaging in air and an affordable supplement to other sensing modalities, such as radar, lidar, and camera, i.e. in heterogeneous sensing systems. However, manufacturing tolerances of air-coupled ultrasonic sensors may lead to amplitude and phase deviations. Together with artifacts from imperfect knowledge of the array geometry, there are numerous factors that can impair the imaging performance of an array. We propose a reference-based calibration method to overcome possible limitations. First, we introduce a novel tensor signal model to capture the characteristics of piezoelectric ultrasonic transducers (PUTs) and the underlying multidimensional nature of a multiple-input multiple-output (MIMO) sensor array. Second, we formulate an optimization problem based on the proposed tensor model to obtain the calibrated parameters of the array and solve the problem using a modified block coordinate descent (BCD) method. Third, we assess both our model and the commonly used analytical model using real data from a 3D imaging experiment. The experiment reveals that our array response model we learned with calibration data yields an imaging performance similar to that of the analytical array model, which requires perfect array geometry information.
- [68] arXiv:2406.14372 [pdf, html, other]
-
Title: Ring-LWE based encrypted controller with unlimited number of recursive multiplications and effect of error growthComments: 12 pages, 3 figuresSubjects: Systems and Control (eess.SY)
In this paper, we propose a method to encrypt linear dynamic controllers that enables an unlimited number of recursive homomorphic multiplications on a Ring Learning With Errors (Ring-LWE) based cryptosystem without bootstrapping. Unlike LWE based schemes, where a scalar error is injected during encryption for security, Ring-LWE based schemes are based on polynomial rings and inject error as a polynomial having multiple error coefficients. Such errors accumulate under recursive homomorphic operations, and it has been studied that their effect can be suppressed by the closed-loop stability when dynamic controllers are encrypted using LWE based schemes. We show that this also holds for the proposed controller encrypted using a Ring-LWE based scheme. Specifically, only the constant terms of the error polynomials affect the control performance, and their effect can be arbitrarily bounded even when the noneffective terms diverge. Furthermore, a novel packing algorithm is applied, resulting in reduced computation time and enhanced memory efficiency. Simulation results demonstrate the effectiveness of the proposed method.
- [69] arXiv:2406.14379 [pdf, html, other]
-
Title: Decoding Vocal Articulations from Acoustic Latent RepresentationsComments: Presented at AES Europe 2024 in MadridSubjects: Audio and Speech Processing (eess.AS)
We present a novel neural encoder system for acoustic-to-articulatory inversion. We leverage the Pink Trombone voice synthesizer that reveals articulatory parameters (e.g tongue position and vocal cord configuration). Our system is designed to identify the articulatory features responsible for producing specific acoustic characteristics contained in a neural latent representation. To generate the necessary latent embeddings, we employed two main methodologies. The first was a self-supervised variational autoencoder trained from scratch to reconstruct the input signal at the decoder stage. We conditioned its bottleneck layer with a subnetwork called the "projector," which decodes the voice synthesizer's parameters.
The second methodology utilized two pretrained models: EnCodec and Wav2Vec. They eliminate the need to train the encoding process from scratch, allowing us to focus on training the projector network. This approach aimed to explore the potential of these existing models in the context of acoustic-to-articulatory inversion. By reusing the pretrained models, we significantly simplified the data processing pipeline, increasing efficiency and reducing computational overhead.
The primary goal of our project was to demonstrate that these neural architectures can effectively encapsulate both acoustic and articulatory features. This prediction-based approach is much faster than traditional methods focused on acoustic feature-based parameter optimization. We validated our models by predicting six different parameters and evaluating them with objective and ViSQOL subjective-equivalent metric using both synthesizer- and human-generated sounds. The results show that the predicted parameters can generate human-like vowel sounds when input into the synthesizer. We provide the dataset, code, and detailed findings to support future research in this field. - [70] arXiv:2406.14421 [pdf, html, other]
-
Title: Learning Binary Color Filter Arrays with Trainable Hard ThresholdingComments: Pre-publication journal paper, 17 pages, 9 figures. Potential submissions include IEEE Transactions on Computational Imaging, IEEE Transactions on Image Processing, IEEE Access, and MDPI SensorsSubjects: Image and Video Processing (eess.IV)
Color Filter Arrays (CFA) are optical filters in digital cameras that capture specific color channels. Current commercial CFAs are hand-crafted patterns with different physical and application-specific considerations. This study proposes a binary CFA learning module based on hard thresholding with a deep learning-based demosaicing network in a joint architecture. Unlike most existing learnable CFAs that learn a channel from the whole color spectrum or linearly combine available digital colors, this method learns a binary channel selection, resulting in CFAs that are practical and physically implementable to digital cameras. The binary selection is based on adapting the hard thresholding operation into neural networks via a straight-through estimator, and therefore it is named HardMax. This paper includes the background on the CFA design problem, the description of the HardMax method, and the performance evaluation results. The evaluation of the proposed method includes tests for different demosaicing models, color configurations, filter sizes, and a comparison with existing methods in various reconstruction metrics. The proposed approach is tested with Kodak and BSDS500 datasets and provides higher reconstruction performance than hand-crafted or alternative learned binary filters.
- [71] arXiv:2406.14430 [pdf, html, other]
-
Title: Adaptive Deep Neural Network-Based Control Barrier FunctionsComments: 7 pages, 2 figures, 28 referencesSubjects: Systems and Control (eess.SY)
Safety constraints of nonlinear control systems are commonly enforced through the use of control barrier functions (CBFs). Uncertainties in the dynamic model can disrupt forward invariance guarantees or cause the state to be restricted to an overly conservative subset of the safe set. In this paper, adaptive deep neural networks (DNNs) are combined with CBFs to produce a family of controllers that ensure safety while learning the system's dynamics in real-time without the requirement for pre-training. By basing the least squares adaptation law on a state derivative estimator-based identification error, the DNN parameter estimation error is shown to be uniformly ultimately bounded. The convergent bound on the parameter estimation error is then used to formulate CBF-constraints in an optimization-based controller to guarantee safety despite model uncertainty. Furthermore, the developed method is applicable for use under intermittent loss of state-feedback. Comparative simulation results demonstrate the ability of the developed method to ensure safety in an adaptive cruise control problem and when feedback is lost, unlike baseline methods.
- [72] arXiv:2406.14440 [pdf, html, other]
-
Title: LLM4CP: Adapting Large Language Models for Channel PredictionSubjects: Signal Processing (eess.SP)
Channel prediction is an effective approach for reducing the feedback or estimation overhead in massive multi-input multi-output (m-MIMO) systems. However, existing channel prediction methods lack precision due to model mismatch errors or network generalization issues. Large language models (LLMs) have demonstrated powerful modeling and generalization abilities, and have been successfully applied to cross-modal tasks, including the time series analysis. Leveraging the expressive power of LLMs, we propose a pre-trained LLM-empowered channel prediction method (LLM4CP) to predict the future downlink channel state information (CSI) sequence based on the historical uplink CSI sequence. We fine-tune the network while freezing most of the parameters of the pre-trained LLM for better cross-modality knowledge transfer. To bridge the gap between the channel data and the feature space of the LLM, preprocessor, embedding, and output modules are specifically tailored by taking into account unique channel characteristics. Simulations validate that the proposed method achieves SOTA prediction performance on full-sample, few-shot, and generalization tests with low training and inference costs.
- [73] arXiv:2406.14474 [pdf, html, other]
-
Title: Spatio-temporal Patterns between ENSO and Weather-related Power Outages in the Continental United StatesSubjects: Systems and Control (eess.SY)
El Niño-Southern Oscillation (ENSO) exhibits significant impacts on the frequency of extreme weather events and its socio-economic implications prevail on a global scale. However, a fundamental gap still exists in understanding the relationship between the ENSO and weather-related power outages in the continental United States. Through 24-year (2000-2023) composite and statistical analysis, our study reveals that higher power outage numbers (PONs) are observed from the developing winter to the decaying summer of La Niña phases. In particular, during the decaying spring, high La Niña intensity favors the occurrences of power outage over the west coast and east of the United States, by modulating the frequency of extreme precipitations and heatwaves. Furthermore, projected increasing heatwaves from the Coupled Model Intercomparison Project Phase 6 (CMIP6) indicate that spring-time PONs over the eastern United States occur about 11 times higher for the mid-term future (2041-2060) and almost 26 times higher for the long-term future (2081-2100), compared with 2000-2023. Our study provides a strong recommendation for building a more climate-resilient power system.
- [74] arXiv:2406.14486 [pdf, html, other]
-
Title: Rule-based outlier detection of AI-generated anatomy segmentationsDeepa Krishnaswamy, Vamsi Krishna Thiriveedhi, Cosmin Ciausu, David Clunie, Steve Pieper, Ron Kikinis, Andrey FedorovSubjects: Image and Video Processing (eess.IV)
There is a dire need for medical imaging datasets with accompanying annotations to perform downstream patient analysis. However, it is difficult to manually generate these annotations, due to the time-consuming nature, and the variability in clinical conventions. Artificial intelligence has been adopted in the field as a potential method to annotate these large datasets, however, a lack of expert annotations or ground truth can inhibit the adoption of these annotations. We recently made a dataset publicly available including annotations and extracted features of up to 104 organs for the National Lung Screening Trial using the TotalSegmentator method. However, the released dataset does not include expert-derived annotations or an assessment of the accuracy of the segmentations, limiting its usefulness. We propose the development of heuristics to assess the quality of the segmentations, providing methods to measure the consistency of the annotations and a comparison of results to the literature. We make our code and related materials publicly available at this https URL and interactive tools at this https URL.
- [75] arXiv:2406.14534 [pdf, html, other]
-
Title: Epicardium Prompt-guided Real-time Cardiac Ultrasound Frame-to-volume RegistrationLong Lei, Jun Zhou, Jialun Pei, Baoliang Zhao, Yueming Jin, Yuen-Chun Jeremy Teoh, Jing Qin, Pheng-Ann HengComments: This paper has been accepted by MICCAI 2024Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
A comprehensive guidance view for cardiac interventional surgery can be provided by the real-time fusion of the intraoperative 2D images and preoperative 3D volume based on the ultrasound frame-to-volume registration. However, cardiac ultrasound images are characterized by a low signal-to-noise ratio and small differences between adjacent frames, coupled with significant dimension variations between 2D frames and 3D volumes to be registered, resulting in real-time and accurate cardiac ultrasound frame-to-volume registration being a very challenging task. This paper introduces a lightweight end-to-end Cardiac Ultrasound frame-to-volume Registration network, termed CU-Reg. Specifically, the proposed model leverages epicardium prompt-guided anatomical clues to reinforce the interaction of 2D sparse and 3D dense features, followed by a voxel-wise local-global aggregation of enhanced features, thereby boosting the cross-dimensional matching effectiveness of low-quality ultrasound modalities. We further embed an inter-frame discriminative regularization term within the hybrid supervised learning to increase the distinction between adjacent slices in the same ultrasound volume to ensure registration stability. Experimental results on the reprocessed CAMUS dataset demonstrate that our CU-Reg surpasses existing methods in terms of registration accuracy and efficiency, meeting the guidance requirements of clinical cardiac interventional surgery.
New submissions for Friday, 21 June 2024 (showing 75 of 75 entries )
- [76] arXiv:2405.03262 (cross-list from cs.LG) [pdf, html, other]
-
Title: End-to-End Reinforcement Learning of Curative Curtailment with Partial Measurement AvailabilityHinrikus Wolf, Luis Böttcher, Sarra Bouchkati, Philipp Lutat, Jens Breitung, Bastian Jung, Tina Möllemann, Viktor Todosijević, Jan Schiefelbein-Lach, Oliver Pohl, Andreas Ulbig, Martin GroheSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
In the course of the energy transition, the expansion of generation and consumption will change, and many of these technologies, such as PV systems, electric cars and heat pumps, will influence the power flow, especially in the distribution grids. Scalable methods that can make decisions for each grid connection are needed to enable congestion-free grid operation in the distribution grids. This paper presents a novel end-to-end approach to resolving congestion in distribution grids with deep reinforcement learning. Our architecture learns to curtail power and set appropriate reactive power to determine a non-congested and, thus, feasible grid state. State-of-the-art methods such as the optimal power flow (OPF) demand high computational costs and detailed measurements of every bus in a grid. In contrast, the presented method enables decisions under sparse information with just some buses observable in the grid. Distribution grids are generally not yet fully digitized and observable, so this method can be used for decision-making on the majority of low-voltage grids. On a real low-voltage grid the approach resolves 100\% of violations in the voltage band and 98.8\% of asset overloads. The results show that decisions can also be made on real grids that guarantee sufficient quality for congestion-free grid operation.
- [77] arXiv:2406.13006 (cross-list from cs.CV) [pdf, html, other]
-
Title: Weighted Sum of Segmented Correlation: An Efficient Method for Spectra Matching in Hyperspectral ImagesComments: Accepted in IEEE IGARSS 2024 conferenceSubjects: Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Image and Video Processing (eess.IV)
Matching a target spectrum with known spectra in a spectral library is a common method for material identification in hyperspectral imaging research. Hyperspectral spectra exhibit precise absorption features across different wavelength segments, and the unique shapes and positions of these absorptions create distinct spectral signatures for each material, aiding in their identification. Therefore, only the specific positions can be considered for material identification. This study introduces the Weighted Sum of Segmented Correlation method, which calculates correlation indices between various segments of a library and a test spectrum, and derives a matching index, favoring positive correlations and penalizing negative correlations using assigned weights. The effectiveness of this approach is evaluated for mineral identification in hyperspectral images from both Earth and Martian surfaces.
- [78] arXiv:2406.13025 (cross-list from cs.LG) [pdf, html, other]
-
Title: ABNet: Attention BarrierNet for Safe and Scalable Robot LearningComments: 18 pagesSubjects: Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
Safe learning is central to AI-enabled robots where a single failure may lead to catastrophic results. Barrier-based method is one of the dominant approaches for safe robot learning.
However, this method is not scalable, hard to train, and tends to generate unstable signals under noisy inputs that are challenging to be deployed for robots. To address these challenges, we propose a novel Attention BarrierNet (ABNet) that is scalable to build larger foundational safe models in an incremental manner.
Each head of BarrierNet in the ABNet could learn safe robot control policies from different features and focus on specific part of the observation. In this way, we do not need to one-shotly construct a large model for complex tasks, which significantly facilitates the training of the model while ensuring its stable output. Most importantly, we can still formally prove the safety guarantees of the ABNet. We demonstrate the strength of ABNet in 2D robot obstacle avoidance, safe robot manipulation, and vision-based end-to-end autonomous driving, with results showing much better robustness and guarantees over existing models. - [79] arXiv:2406.13038 (cross-list from cs.AI) [pdf, html, other]
-
Title: Traffic Prediction considering Multiple Levels of Spatial-temporal Information: A Multi-scale Graph Wavelet-based ApproachSubjects: Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
Although traffic prediction has been receiving considerable attention with a number of successes in the context of intelligent transportation systems, the prediction of traffic states over a complex transportation network that contains different road types has remained a challenge. This study proposes a multi-scale graph wavelet temporal convolution network (MSGWTCN) to predict the traffic states in complex transportation networks. Specifically, a multi-scale spatial block is designed to simultaneously capture the spatial information at different levels, and the gated temporal convolution network is employed to extract the temporal dependencies of the data. The model jointly learns to mount multiple levels of the spatial interactions by stacking graph wavelets with different scales. Two real-world datasets are used in this study to investigate the model performance, including a highway network in Seattle and a dense road network of Manhattan in New York City. Experiment results show that the proposed model outperforms other baseline models. Furthermore, different scales of graph wavelets are found to be effective in extracting local, intermediate and global information at the same time and thus enable the model to learn a complex transportation network topology with various types of road segments. By carefully customizing the scales of wavelets, the model is able to improve the prediction performance and better adapt to different network configurations.
- [80] arXiv:2406.13118 (cross-list from cs.RO) [pdf, html, other]
-
Title: Thruster-Assisted Incline WalkingKaushik Venkatesh Krishnamurthy, Chenghao Wang, Shreyansh Pitroda, Adarsh Salagame, Eric Sihite, Reza Nemovi, Alireza Ramezani, Morteza GharibComments: 7 pages, 7 figures, submitted to CDC 2024 conference. arXiv admin note: text overlap with arXiv:2405.06070Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
In this study, our aim is to evaluate the effectiveness of thruster-assisted steep slope walking for the Husky Carbon, a quadrupedal robot equipped with custom-designed actuators and plural electric ducted fans, through simulation prior to conducting experimental trials. Thruster-assisted steep slope walking draws inspiration from wing-assisted incline running (WAIR) observed in birds, and intriguingly incorporates posture manipulation and thrust vectoring, a locomotion technique not previously explored in the animal kingdom. Our approach involves developing a reduced-order model of the Husky robot, followed by the application of an optimization-based controller utilizing collocation methods and dynamics interpolation to determine control actions. Through simulation testing, we demonstrate the feasibility of hardware implementation of our controller.
- [81] arXiv:2406.13179 (cross-list from cs.SD) [pdf, html, other]
-
Title: Global-Local Convolution with Spiking Neural Networks for Energy-efficient Keyword SpottingSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Audio and Speech Processing (eess.AS)
Thanks to Deep Neural Networks (DNNs), the accuracy of Keyword Spotting (KWS) has made substantial progress. However, as KWS systems are usually implemented on edge devices, energy efficiency becomes a critical requirement besides performance. Here, we take advantage of spiking neural networks' energy efficiency and propose an end-to-end lightweight KWS model. The model consists of two innovative modules: 1) Global-Local Spiking Convolution (GLSC) module and 2) Bottleneck-PLIF module. Compared to the hand-crafted feature extraction methods, the GLSC module achieves speech feature extraction that is sparser, more energy-efficient, and yields better performance. The Bottleneck-PLIF module further processes the signals from GLSC with the aim to achieve higher accuracy with fewer parameters. Extensive experiments are conducted on the Google Speech Commands Dataset (V1 and V2). The results show our method achieves competitive performance among SNN-based KWS models with fewer parameters.
- [82] arXiv:2406.13196 (cross-list from quant-ph) [pdf, html, other]
-
Title: Quantum Generative Learning for High-Resolution Medical Image GenerationSubjects: Quantum Physics (quant-ph); Image and Video Processing (eess.IV)
Integration of quantum computing in generative machine learning models has the potential to offer benefits such as training speed-up and superior feature extraction. However, the existing quantum generative adversarial networks (QGANs) fail to generate high-quality images due to their patch-based, pixel-wise learning approaches. These methods capture only local details, ignoring the global structure and semantic information of images. In this work, we address these challenges by proposing a quantum image generative learning (QIGL) approach for high-quality medical image generation. Our proposed quantum generator leverages variational quantum circuit approach addressing scalability issues by extracting principal components from the images instead of dividing them into patches. Additionally, we integrate the Wasserstein distance within the QIGL framework to generate a diverse set of medical samples. Through a systematic set of simulations on X-ray images from knee osteoarthritis and medical MNIST datasets, our model demonstrates superior performance, achieving the lowest Fréchet Inception Distance (FID) scores compared to its classical counterpart and advanced QGAN models reported in the literature.
- [83] arXiv:2406.13248 (cross-list from cs.IT) [pdf, html, other]
-
Title: Overlay Space-Air-Ground Integrated Networks with SWIPT-Empowered Aerial CommunicationsComments: 36 pages, 14 figures, This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessibleSubjects: Information Theory (cs.IT); Signal Processing (eess.SP)
In this article, we consider overlay space-air-ground integrated networks (OSAGINs) where a low earth orbit (LEO) satellite communicates with ground users (GUs) with the assistance of an energy-constrained coexisting air-to-air (A2A) network. Particularly, a non-linear energy harvester with a hybrid SWIPT utilizing both power-splitting and time-switching energy harvesting (EH) techniques is employed at the aerial transmitter. Specifically, we take the random locations of the satellite, ground and aerial receivers to investigate the outage performance of both the satellite-to-ground and aerial networks leveraging the stochastic tools. By taking into account the Shadowed-Rician fading for satellite link, the Nakagami-\emph{m} for ground link, and the Rician fading for aerial link, we derive analytical expressions for the outage probability of these networks. For a comprehensive analysis of aerial network, we consider both the perfect and imperfect successive interference cancellation (SIC) scenarios. Through our analysis, we illustrate that, unlike linear EH, the implementation of non-linear EH provides accurate figures for any target rate, underscoring the significance of using non-linear EH models. Additionally, the influence of key parameters is emphasized, providing guidelines for the practical design of an energy-efficient as well as spectrum-efficient future non-terrestrial networks. Monte Carlo simulations validate the accuracy of our theoretical developments.
- [84] arXiv:2406.13251 (cross-list from cs.CV) [pdf, html, other]
-
Title: Freq-Mip-AA : Frequency Mip Representation for Anti-Aliasing Neural Radiance FieldsComments: Accepted to ICIP 2024, 7 pages, 3 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Image and Video Processing (eess.IV)
Neural Radiance Fields (NeRF) have shown remarkable success in representing 3D scenes and generating novel views. However, they often struggle with aliasing artifacts, especially when rendering images from different camera distances from the training views. To address the issue, Mip-NeRF proposed using volumetric frustums to render a pixel and suggested integrated positional encoding (IPE). While effective, this approach requires long training times due to its reliance on MLP architecture. In this work, we propose a novel anti-aliasing technique that utilizes grid-based representations, usually showing significantly faster training time. In addition, we exploit frequency-domain representation to handle the aliasing problem inspired by the sampling theorem. The proposed method, FreqMipAA, utilizes scale-specific low-pass filtering (LPF) and learnable frequency masks. Scale-specific low-pass filters (LPF) prevent aliasing and prioritize important image details, and learnable masks effectively remove problematic high-frequency elements while retaining essential information. By employing a scale-specific LPF and trainable masks, FreqMipAA can effectively eliminate the aliasing factor while retaining important details. We validated the proposed technique by incorporating it into a widely used grid-based method. The experimental results have shown that the FreqMipAA effectively resolved the aliasing issues and achieved state-of-the-art results in the multi-scale Blender dataset. Our code is available at this https URL .
- [85] arXiv:2406.13269 (cross-list from cs.AI) [pdf, other]
-
Title: Investigating Low-Cost LLM Annotation for~Spoken Dialogue Understanding DatasetsJournal-ref: 27th International Conference on Text, Speech and Dialogue, Sep 2024, Brno (R{\'e}p. Tch{\`e}que), Czech RepublicSubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Signal Processing (eess.SP)
In spoken Task-Oriented Dialogue (TOD) systems, the choice of the semantic representation describing the users' requests is key to a smooth interaction. Indeed, the system uses this representation to reason over a database and its domain knowledge to choose its next action. The dialogue course thus depends on the information provided by this semantic representation. While textual datasets provide fine-grained semantic representations, spoken dialogue datasets fall behind. This paper provides insights into automatic enhancement of spoken dialogue datasets' semantic representations. Our contributions are three fold: (1) assess the relevance of Large Language Model fine-tuning, (2) evaluate the knowledge captured by the produced annotations and (3) highlight semi-automatic annotation implications.
- [86] arXiv:2406.13275 (cross-list from cs.SD) [pdf, html, other]
-
Title: Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio EncodingJizhong Liu, Gang Li, Junbo Zhang, Heinrich Dinkel, Yongqing Wang, Zhiyong Yan, Yujun Wang, Bin WangComments: Accepted by Interspeech 2024Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
Automated audio captioning (AAC) is an audio-to-text task to describe audio contents in natural language. Recently, the advancements in large language models (LLMs), with improvements in training approaches for audio encoders, have opened up possibilities for improving AAC. Thus, we explore enhancing AAC from three aspects: 1) a pre-trained audio encoder via consistent ensemble distillation (CED) is used to improve the effectivity of acoustic tokens, with a querying transformer (Q-Former) bridging the modality gap to LLM and compress acoustic tokens; 2) we investigate the advantages of using a Llama 2 with 7B parameters as the decoder; 3) another pre-trained LLM corrects text errors caused by insufficient training data and annotation ambiguities. Both the audio encoder and text decoder are optimized by -Base (LoRA). Experiments show that each of these enhancements is effective. Our method obtains a 33.0 SPIDEr-FL score, outperforming the winner of DCASE 2023 Task 6A.
- [87] arXiv:2406.13292 (cross-list from q-bio.QM) [pdf, html, other]
-
Title: An interpretable generative multimodal neuroimaging-genomics framework for decoding Alzheimer's diseaseGiorgio Dolci (1,2), Federica Cruciani (1), Md Abdur Rahaman (2), Anees Abrol (2), Jiayu Chen (2), Zening Fu (2), Ilaria Boscolo Galazzo (1), Gloria Menegaz (1), Vince D. Calhoun (2) ((1) Department of Engineering for Innovation Medicine, University of Verona, Verona, Italy, (2) Tri-Institutional Center for Translational Research in Neuroimaging and Data Science (TReNDS), Georgia State University, Georgia Institute of Technology, Emory University, Atlanta, GA, USA)Comments: 27 pages, 7 figures, submitted to a journalSubjects: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
Alzheimer's disease (AD) is the most prevalent form of dementia with a progressive decline in cognitive abilities. The AD continuum encompasses a prodormal stage known as Mild Cognitive Impairment (MCI), where patients may either progress to AD or remain stable. In this study, we leveraged structural and functional MRI to investigate the disease-induced grey matter and functional network connectivity changes. Moreover, considering AD's strong genetic component, we introduce SNPs as a third channel. Given such diverse inputs, missing one or more modalities is a typical concern of multimodal methods. We hence propose a novel deep learning-based classification framework where generative module employing Cycle GANs was adopted to impute missing data within the latent space. Additionally, we adopted an Explainable AI method, Integrated Gradients, to extract input features relevance, enhancing our understanding of the learned representations. Two critical tasks were addressed: AD detection and MCI conversion prediction. Experimental results showed that our model was able to reach the SOA in the classification of CN/AD reaching an average test accuracy of $0.926\pm0.02$. For the MCI task, we achieved an average prediction accuracy of $0.711\pm0.01$ using the pre-trained model for CN/AD. The interpretability analysis revealed significant grey matter modulations in cortical and subcortical brain areas well known for their association with AD. Moreover, impairments in sensory-motor and visual resting state network connectivity along the disease continuum, as well as mutations in SNPs defining biological processes linked to amyloid-beta and cholesterol formation clearance and regulation, were identified as contributors to the achieved performance. Overall, our integrative deep learning approach shows promise for AD detection and MCI prediction, while shading light on important biological insights.
- [88] arXiv:2406.13335 (cross-list from cs.NI) [pdf, html, other]
-
Title: AI-Empowered Multiple Access for 6G: A Survey of Spectrum Sensing, Protocol Designs, and OptimizationsSubjects: Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
With the rapidly increasing number of bandwidth-intensive terminals capable of intelligent computing and communication, such as smart devices equipped with shallow neural network models, the complexity of multiple access for these intelligent terminals is increasing due to the dynamic network environment and ubiquitous connectivity in 6G systems. Traditional multiple access (MA) design and optimization methods are gradually losing ground to artificial intelligence (AI) techniques that have proven their superiority in handling complexity. AI-empowered MA and its optimization strategies aimed at achieving high Quality-of-Service (QoS) are attracting more attention, especially in the area of latency-sensitive applications in 6G systems. In this work, we aim to: 1) present the development and comparative evaluation of AI-enabled MA; 2) provide a timely survey focusing on spectrum sensing, protocol design, and optimization for AI-empowered MA; and 3) explore the potential use cases of AI-empowered MA in the typical application scenarios within 6G systems. Specifically, we first present a unified framework of AI-empowered MA for 6G systems by incorporating various promising machine learning techniques in spectrum sensing, resource allocation, MA protocol design, and optimization. We then introduce AI-empowered MA spectrum sensing related to spectrum sharing and spectrum interference management. Next, we discuss the AI-empowered MA protocol designs and implementation methods by reviewing and comparing the state-of-the-art, and we further explore the optimization algorithms related to dynamic resource management, parameter adjustment, and access scheme switching. Finally, we discuss the current challenges, point out open issues, and outline potential future research directions in this field.
- [89] arXiv:2406.13340 (cross-list from cs.CL) [pdf, html, other]
-
Title: SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond WordsJunyi Ao, Yuancheng Wang, Xiaohai Tian, Dekun Chen, Jun Zhang, Lu Lu, Yuxuan Wang, Haizhou Li, Zhizheng WuSubjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Speech encompasses a wealth of information, including but not limited to content, paralinguistic, and environmental information. This comprehensive nature of speech significantly impacts communication and is crucial for human-computer interaction. Chat-Oriented Large Language Models (LLMs), known for their general-purpose assistance capabilities, have evolved to handle multi-modal inputs, including speech. Although these models can be adept at recognizing and analyzing speech, they often fall short of generating appropriate responses. We argue that this is due to the lack of principles on task definition and model development, which requires open-source datasets and metrics suitable for model evaluation. To bridge the gap, we present SD-Eval, a benchmark dataset aimed at multidimensional evaluation of spoken dialogue understanding and generation. SD-Eval focuses on paralinguistic and environmental information and includes 7,303 utterances, amounting to 8.76 hours of speech data. The data is aggregated from eight public datasets, representing four perspectives: emotion, accent, age, and background sound. To assess the SD-Eval benchmark dataset, we implement three different models and construct a training set following a similar process as SD-Eval. The training set contains 1,052.72 hours of speech data and 724.4k utterances. We also conduct a comprehensive evaluation using objective evaluation methods (e.g. BLEU and ROUGE), subjective evaluations and LLM-based metrics for the generated responses. Models conditioned with paralinguistic and environmental information outperform their counterparts in both objective and subjective measures. Moreover, experiments demonstrate LLM-based metrics show a higher correlation with human evaluation compared to traditional metrics. We open-source SD-Eval at this https URL.
- [90] arXiv:2406.13345 (cross-list from cs.CV) [pdf, html, other]
-
Title: Low Latency Visual Inertial Odometry with On-Sensor Accelerated Optical Flow for Resource-Constrained UAVsComments: This article has been accepted for publication in the IEEE Sensors Journal (JSEN)Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Visual Inertial Odometry (VIO) is the task of estimating the movement trajectory of an agent from an onboard camera stream fused with additional Inertial Measurement Unit (IMU) measurements. A crucial subtask within VIO is the tracking of features, which can be achieved through Optical Flow (OF). As the calculation of OF is a resource-demanding task in terms of computational load and memory footprint, which needs to be executed at low latency, especially in robotic applications, OF estimation is today performed on powerful CPUs or GPUs. This restricts its use in a broad spectrum of applications where the deployment of such powerful, power-hungry processors is unfeasible due to constraints related to cost, size, and power consumption. On-sensor hardware acceleration is a promising approach to enable low latency VIO even on resource-constrained devices such as nano drones. This paper assesses the speed-up in a VIO sensor system exploiting a compact OF sensor consisting of a global shutter camera and an Application Specific Integrated Circuit (ASIC). By replacing the feature tracking logic of the VINS-Mono pipeline with data from this OF camera, we demonstrate a 49.4% reduction in latency and a 53.7% reduction of compute load of the VIO pipeline over the original VINS-Mono implementation, allowing VINS-Mono operation up to 50 FPS instead of 20 FPS on the quad-core ARM Cortex-A72 processor of a Raspberry Pi Compute Module 4.
- [91] arXiv:2406.13357 (cross-list from cs.CL) [pdf, html, other]
-
Title: Transferable speech-to-text large language model alignment moduleComments: Accepted by InterSpeech 2024; 5 pages, 2 figuresSubjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
By leveraging the power of Large Language Models(LLMs) and speech foundation models, state of the art speech-text bimodal works can achieve challenging tasks like spoken translation(ST) and question answering(SQA) altogether with much simpler architectures. In this paper, we utilize the capability of Whisper encoder and pre-trained Yi-6B. Empirical results reveal that modal alignment can be achieved with one layer module and hundred hours of speech-text multitask corpus. We further swap the Yi-6B with human preferences aligned version of Yi-6B-Chat during inference, and discover that the alignment capability is applicable as well. In addition, the alignment subspace revealed by singular value decomposition(SVD) also implies linear alignment subspace is sparse, which leaves the possibility to concatenate other features like voice-print or video to expand modality.
- [92] arXiv:2406.13358 (cross-list from cs.CV) [pdf, html, other]
-
Title: Multi-scale Restoration of Missing Data in Optical Time-series Images with Masked Spatial-Temporal Attention NetworkSubjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Due to factors such as thick cloud cover and sensor limitations, remote sensing images often suffer from significant missing data, resulting in incomplete time-series information. Existing methods for imputing missing values in remote sensing images do not fully exploit spatio-temporal auxiliary information, leading to limited accuracy in restoration. Therefore, this paper proposes a novel deep learning-based approach called MS2TAN (Multi-scale Masked Spatial-Temporal Attention Network), for reconstructing time-series remote sensing images. Firstly, we introduce an efficient spatio-temporal feature extractor based on Masked Spatial-Temporal Attention (MSTA), to obtain high-quality representations of the spatio-temporal neighborhood features in the missing regions. Secondly, a Multi-scale Restoration Network consisting of the MSTA-based Feature Extractors, is employed to progressively refine the missing values by exploring spatio-temporal neighborhood features at different scales. Thirdly, we propose a ``Pixel-Structure-Perception'' Multi-Objective Joint Optimization method to enhance the visual effects of the reconstruction results from multiple perspectives and preserve more texture structures. Furthermore, the proposed method reconstructs missing values in all input temporal phases in parallel (i.e., Multi-In Multi-Out), achieving higher processing efficiency. Finally, experimental evaluations on two typical missing data restoration tasks across multiple research areas demonstrate that the proposed method outperforms state-of-the-art methods with an improvement of 0.40dB/1.17dB in mean peak signal-to-noise ratio (mPSNR) and 3.77/9.41 thousandths in mean structural similarity (mSSIM), while exhibiting stronger texture and structural consistency.
- [93] arXiv:2406.13384 (cross-list from cs.SD) [pdf, html, other]
-
Title: Straight Through Gumbel Softmax Estimator based Bimodal Neural Architecture Search for Audio-Visual Deepfake DetectionSubjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
Deepfakes are a major security risk for biometric authentication. This technology creates realistic fake videos that can impersonate real people, fooling systems that rely on facial features and voice patterns for identification. Existing multimodal deepfake detectors rely on conventional fusion methods, such as majority rule and ensemble voting, which often struggle to adapt to changing data characteristics and complex patterns. In this paper, we introduce the Straight-through Gumbel-Softmax (STGS) framework, offering a comprehensive approach to search multimodal fusion model architectures. Using a two-level search approach, the framework optimizes the network architecture, parameters, and performance. Initially, crucial features were efficiently identified from backbone networks, whereas within the cell structure, a weighted fusion operation integrated information from various sources. An architecture that maximizes the classification performance is derived by varying parameters such as temperature and sampling time. The experimental results on the FakeAVCeleb and SWAN-DF datasets demonstrated an impressive AUC value 94.4\% achieved with minimal model parameters.
- [94] arXiv:2406.13431 (cross-list from cs.CL) [pdf, html, other]
-
Title: Children's Speech Recognition through Discrete Token EnhancementComments: Accepted at Interspeech 2024Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Children's speech recognition is considered a low-resource task mainly due to the lack of publicly available data. There are several reasons for such data scarcity, including expensive data collection and annotation processes, and data privacy, among others. Transforming speech signals into discrete tokens that do not carry sensitive information but capture both linguistic and acoustic information could be a solution for privacy concerns. In this study, we investigate the integration of discrete speech tokens into children's speech recognition systems as input without significantly degrading the ASR performance. Additionally, we explored single-view and multi-view strategies for creating these discrete labels. Furthermore, we tested the models for generalization capabilities with unseen domain and nativity dataset. Results reveal that the discrete token ASR for children achieves nearly equivalent performance with an approximate 83% reduction in parameters.
- [95] arXiv:2406.13501 (cross-list from physics.optics) [pdf, html, other]
-
Title: Assessing the 3D resolution of refocused correlation plenoptic images using a general-purpose image quality estimatorSubjects: Optics (physics.optics); Image and Video Processing (eess.IV)
Correlation plenoptic imaging (CPI) is emerging as a promising approach to light-field imaging (LFI), a technique enabling simultaneous measurement of light intensity distribution and propagation direction from a scene. LFI allows single-shot 3D sampling, offering fast 3D reconstruction for a wide range of applications. However, the array of micro-lenses typically used in LFI to obtain 3D information limits image resolution, which rapidly declines with enhanced volumetric reconstruction capabilities. CPI addresses this limitation by decoupling light-field information measurement using two photodetectors with spatial resolution, eliminating the need for micro-lenses. 3D information is encoded in a four-dimensional correlation function, which is decoded in post-processing to reconstruct images without the resolution loss seen in conventional LFI. This paper evaluates the tomographic performance of CPI, demonstrating that the refocusing reconstruction method provides axial sectioning capabilities comparable to conventional imaging systems. A general-purpose analytical approach based on image fidelity is proposed to quantitatively study axial and lateral resolution. This analysis fully characterizes the volumetric resolution of any CPI architecture, offering a comprehensive evaluation of its imaging performance.
- [96] arXiv:2406.13502 (cross-list from cs.CL) [pdf, html, other]
-
Title: ManWav: The First Manchu ASR ModelComments: ACL2024/Field MattersSubjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
This study addresses the widening gap in Automatic Speech Recognition (ASR) research between high resource and extremely low resource languages, with a particular focus on Manchu, a critically endangered language. Manchu exemplifies the challenges faced by marginalized linguistic communities in accessing state-of-the-art technologies. In a pioneering effort, we introduce the first-ever Manchu ASR model ManWav, leveraging Wav2Vec2-XLSR-53. The results of the first Manchu ASR is promising, especially when trained with our augmented data. Wav2Vec2-XLSR-53 fine-tuned with augmented data demonstrates a 0.02 drop in CER and 0.13 drop in WER compared to the same base model fine-tuned with original data.
- [97] arXiv:2406.13579 (cross-list from cs.SD) [pdf, other]
-
Title: Automated Bioacoustic Monitoring for South African Bird Species on Unlabeled DataMichael Doell, Dominik Kuehn, Vanessa Suessle, Matthew J. Burnett, Colleen T. Downs, Andreas Weinmann, Elke HergenroetherComments: preprintJournal-ref: International Conferences in Central Europe on Computer Graphics, Visualization and Computer Vision 2024Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
Analyses for biodiversity monitoring based on passive acoustic monitoring (PAM) recordings is time-consuming and challenged by the presence of background noise in recordings. Existing models for sound event detection (SED) worked only on certain avian species and the development of further models required labeled data. The developed framework automatically extracted labeled data from available platforms for selected avian species. The labeled data were embedded into recordings, including environmental sounds and noise, and were used to train convolutional recurrent neural network (CRNN) models. The models were evaluated on unprocessed real world data recorded in urban KwaZulu-Natal habitats. The Adapted SED-CRNN model reached a F1 score of 0.73, demonstrating its efficiency under noisy, real-world conditions. The proposed approach to automatically extract labeled data for chosen avian species enables an easy adaption of PAM to other species and habitats for future conservation projects.
- [98] arXiv:2406.13602 (cross-list from cs.ET) [pdf, other]
-
Title: Parameter Training Efficiency Aware Resource Allocation for AIGC in Space-Air-Ground Integrated NetworksComments: submitted to a journalSubjects: Emerging Technologies (cs.ET); Signal Processing (eess.SP)
With the evolution of artificial intelligence-generated content (AIGC) techniques and the development of space-air-ground integrated networks (SAGIN), there will be a growing opportunity to enhance more users' mobile experience with customized AIGC applications. This is made possible through the use of parameter-efficient fine-tuning (PEFT) training alongside mobile edge computing. In this paper, we formulate the optimization problem of maximizing the parameter training efficiency of the SAGIN system over wireless networks under limited resource constraints. We propose the Parameter training efficiency Aware Resource Allocation (PARA) technique to jointly optimize user association, data offloading, and communication and computational resource allocation. Solid proofs are presented to solve this difficult sum of ratios problem based on quadratically constrained quadratic programming (QCQP), semidefinite programming (SDP), graph theory, and fractional programming (FP) techniques. Our proposed PARA technique is effective in finding a stationary point of this non-convex problem. The simulation results demonstrate that the proposed PARA method outperforms other baselines.
- [99] arXiv:2406.13612 (cross-list from math.OC) [pdf, html, other]
-
Title: On Computation of Approximate Solutions to Large-Scale Backstepping Kernel Equations via Continuum ApproximationComments: 13 pages, 5 figures, submitted to Systems & Control LettersSubjects: Optimization and Control (math.OC); Systems and Control (eess.SY)
We provide two methods for computation of continuum backstepping kernels that arise in control of continua (ensembles) of linear hyperbolic PDEs and which can approximate backstepping kernels arising in control of a large-scale, PDE system counterpart (with computational complexity that does not grow with the number of state components of the large-scale system). In the first method, we identify a class of systems for which the solution to the continuum (and hence, also an approximate solution to the respective large-scale) kernel equations can be constructed in closed form. In the second method, we provide explicit formulae for the solution to the continuum kernels PDEs, employing a (triple) power series representation of the continuum kernel and establishing its convergence properties. In this case, we also provide means for reducing computational complexity by properly truncating the power series (in the powers of the ensemble variable). We also present numerical examples to illustrate computational efficiency/accuracy of the approaches, as well as to validate the stabilization properties of the approximate control kernels, constructed based on the continuum.
- [100] arXiv:2406.13712 (cross-list from cs.MM) [pdf, html, other]
-
Title: Convex-hull Estimation using XPSNR for Versatile Video CodingComments: Accepted at 2024 IEEE International Conference on Image Processing (ICIP)Subjects: Multimedia (cs.MM); Image and Video Processing (eess.IV)
As adaptive streaming becomes crucial for delivering high-quality video content across diverse network conditions, accurate metrics to assess perceptual quality are essential. This paper explores using the eXtended Peak Signal-to-Noise Ratio (XPSNR) metric as an alternative to the popular Video Multimethod Assessment Fusion (VMAF) metric for determining optimized bitrate-resolution pairs in the context of Versatile Video Coding (VVC). Our study is rooted in the observation that XPSNR shows a superior correlation with subjective quality scores for VVC-coded Ultra-High Definition (UHD) content compared to VMAF. We predict the average XPSNR of VVC-coded bitstreams using spatiotemporal complexity features of the video and the target encoding configuration and then determine the convex-hull online. On average, the proposed convex-hull using XPSNR (VEXUS) achieves an overall quality improvement of 5.84 dB PSNR and 0.62 dB XPSNR while maintaining the same bitrate, compared to the default UHD encoding using the VVenC encoder, accompanied by an encoding time reduction of 44.43% and a decoding time reduction of 65.46%. This shift towards XPSNR as a guiding metric shall enhance the effectiveness of adaptive streaming algorithms, ensuring an optimal balance between bitrate efficiency and perceptual fidelity with advanced video coding standards.
- [101] arXiv:2406.13722 (cross-list from cs.IT) [pdf, html, other]
-
Title: Channel Charting in Real-World Coordinates with Distributed MIMOComments: Submitted to a journal. arXiv admin note: substantial text overlap with arXiv:2308.14498Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Channel charting is an emerging self-supervised method that maps channel-state information (CSI) to a low-dimensional latent space (the channel chart) that represents pseudo-positions of user equipments (UEs). While channel charts preserve local geometry, i.e., nearby UEs are nearby in the channel chart (and vice versa), the pseudo-positions are in arbitrary coordinates and global geometry is typically not preserved. In order to embed channel charts in real-world coordinates, we first propose a bilateration loss for distributed multiple-input multiple-output (D-MIMO) wireless systems in which only the access point (AP) positions are known. The idea behind this loss is to compare the received power at pairs of APs to determine whether a UE should be placed closer to one AP or the other in the channel chart. Second, we propose a line-of-sight (LoS) bounding-box loss that places the UE in a predefined LoS area of each AP that is estimated to have a LoS path to the UE. We demonstrate the efficacy of combining both of these loss functions with neural-network-based channel charting using ray-tracing-based and measurement-based channel vectors. Our approach outperforms several baselines and maintains the self-supervised nature of channel charting as it does not rely on geometrical propagation models or require ground-truth UE position information.
- [102] arXiv:2406.13842 (cross-list from cs.CL) [pdf, html, other]
-
Title: Joint vs Sequential Speaker-Role Detection and Automatic Speech Recognition for Air-traffic ControlComments: Accepted at Interspeech 2024Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Utilizing air-traffic control (ATC) data for downstream natural-language processing tasks requires preprocessing steps. Key steps are the transcription of the data via automatic speech recognition (ASR) and speaker diarization, respectively speaker role detection (SRD) to divide the transcripts into pilot and air-traffic controller (ATCO) transcripts. While traditional approaches take on these tasks separately, we propose a transformer-based joint ASR-SRD system that solves both tasks jointly while relying on a standard ASR architecture. We compare this joint system against two cascaded approaches for ASR and SRD on multiple ATC datasets. Our study shows in which cases our joint system can outperform the two traditional approaches and in which cases the other architectures are preferable. We additionally evaluate how acoustic and lexical differences influence all architectures and show how to overcome them for our joint architecture.
- [103] arXiv:2406.13982 (cross-list from cs.SD) [pdf, html, other]
-
Title: Improved Remixing Process for Domain Adaptation-Based Speech Enhancement by Mitigating Data Imbalance in Signal-to-Noise RatioComments: Accepted at Interspeech2024Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
RemixIT and Remixed2Remixed are domain adaptation-based speech enhancement (DASE) methods that use a teacher model trained in full supervision to generate pseudo-paired data by remixing the outputs of the teacher model. The student model for enhancing real-world recorded signals is trained using the pseudo-paired data without ground truth. Since the noisy signals are recorded in natural environments, the dataset inevitably suffers data imbalance in some acoustic properties, leading to subpar performance for the underrepresented data. The signal-to-noise ratio (SNR), inherently balanced in supervised learning, is a prime example. In this paper, we provide empirical evidence that the SNR of pseudo data has a significant impact on model performance using the dataset of the CHiME-7 UDASE task, highlighting the importance of balanced SNR in DASE. Furthermore, we propose adopting curriculum learning to encompass a broad range of SNRs to boost performance for underrepresented data.
- [104] arXiv:2406.13992 (cross-list from cs.MA) [pdf, html, other]
-
Title: Robust Cooperative Multi-Agent Reinforcement Learning:A Mean-Field Type Game PerspectiveComments: Accepted for publication in L4DC 2024Subjects: Multiagent Systems (cs.MA); Systems and Control (eess.SY)
In this paper, we study the problem of robust cooperative multi-agent reinforcement learning (RL) where a large number of cooperative agents with distributed information aim to learn policies in the presence of \emph{stochastic} and \emph{non-stochastic} uncertainties whose distributions are respectively known and unknown. Focusing on policy optimization that accounts for both types of uncertainties, we formulate the problem in a worst-case (minimax) framework, which is is intractable in general. Thus, we focus on the Linear Quadratic setting to derive benchmark solutions. First, since no standard theory exists for this problem due to the distributed information structure, we utilize the Mean-Field Type Game (MFTG) paradigm to establish guarantees on the solution quality in the sense of achieved Nash equilibrium of the MFTG. This in turn allows us to compare the performance against the corresponding original robust multi-agent control problem. Then, we propose a Receding-horizon Gradient Descent Ascent RL algorithm to find the MFTG Nash equilibrium and we prove a non-asymptotic rate of convergence. Finally, we provide numerical experiments to demonstrate the efficacy of our approach relative to a baseline algorithm.
- [105] arXiv:2406.14000 (cross-list from math.OC) [pdf, html, other]
-
Title: Robust nonlinear state-feedback control of second-order systemsComments: 6 pages, 6 figuresSubjects: Optimization and Control (math.OC); Systems and Control (eess.SY)
This note proposes a novel nonlinear state feedback controller for perturbed second-order systems. In analogy to a linear proportional-derivative (PD) output feedback control, the proposed nonlinear scheme uses the output state of interest and its time derivative for a robust finite-time regulation. The control has only one free design parameter, and the closed-loop system is shown to be uniformly asymptotically stable in the presence of matched disturbances. We derive a strict Lyapunov function for the closed control loop with a bounded exogenous perturbation, and use it for both the control tuning and analysis of the finite-time convergence. Apart from the numerical results, a revealing experimental example is also shown in favor of the proposed control and in comparison with PD and sub-optimal nonlinear damping regulators.
- [106] arXiv:2406.14011 (cross-list from math.OC) [pdf, html, other]
-
Title: Primal-Dual Strategy (PDS) for Composite Optimization Over Directed graphsSubjects: Optimization and Control (math.OC); Signal Processing (eess.SP)
We investigate the distributed multi-agent sharing optimization problem in a directed graph, with a composite objective function consisting of a smooth function plus a convex (possibly non-smooth) function shared by all agents. While adhering to the network connectivity structure, the goal is to minimize the sum of smooth local functions plus a non-smooth function. The proposed Primal-Dual algorithm (PD) is similar to a previous algorithm \cite{b27}, but it has additional benefits. To begin, we investigate the problem in directed graphs, where agents can only communicate in one direction and the combination matrix is not symmetric. Furthermore, the combination matrix is changing over time, and the condition coefficient weights are produced using an adaptive approach. The strong convexity assumption, adaptive coefficient weights, and a new upper bound on step-sizes are used to demonstrate that linear convergence is possible. New upper bounds on step-sizes are derived under the strong convexity assumption and adaptive coefficient weights that are time-varying in the presence of both smooth and non-smooth terms. Simulation results show the efficacy of the proposed algorithm compared to some other algorithms.
- [107] arXiv:2406.14064 (cross-list from cs.IT) [pdf, html, other]
-
Title: PAPR Reduction with Pre-chirp Selection for Affine Frequency Division MultipleSubjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Affine frequency division multiplexing (AFDM) is a promising new multicarrier technique based on discrete affine Fourier transform (DAFT). By properly tuning pre-chirp parameter and post-chirp parameter in the DAFT, the effective channel in the DAFT domain can completely avoid overlap of different paths, thus constitutes a full representation of delay-Doppler profile, which significantly improves the system performance in high mobility scenarios. However, AFDM has the crucial problem of high peak-to-average power ratio (PAPR) caused by phase randomness of modulated symbols. In this letter, an algorithm named grouped pre-chirp selection (GPS) is proposed to reduce the PAPR by changing the value of pre-chirp parameter on sub-carriers group by group. Specifically, it is demonstrated first that the important properties of AFDM system are maintained when implementing GPS. Secondly, we elaborate the operation steps of GPS algorithm, illustrating its effect on PAPR reduction and its advantage in terms of computational complexity compared with the ungrouped approach. Finally, simulation results of PAPR reduction in the form of complementary cumulative distribution function (CCDF) show the effectiveness of the proposed GPS algorithm.
- [108] arXiv:2406.14067 (cross-list from physics.optics) [pdf, other]
-
Title: A microwave photonic prototype for concurrent radar detection and spectrum sensing over an 8 to 40 GHz bandwidthTaixia Shi, Dingding Liang, Lu Wang, Lin Li, Shaogang Guo, Jiawei Gao, Xiaowei Li, Chulun Lin, Lei Shi, Baogang Ding, Shiyang Liu, Fangyi Yang, Chi Jiang, Yang ChenComments: 18 pages, 12 figures, 1 tableSubjects: Optics (physics.optics); Signal Processing (eess.SP)
In this work, a microwave photonic prototype for concurrent radar detection and spectrum sensing is proposed, designed, built, and investigated. A direct digital synthesizer and an analog electronic circuit are integrated to generate an intermediate frequency (IF) linearly frequency-modulated (LFM) signal with a tunable center frequency from 2.5 to 9.5 GHz and an instantaneous bandwidth of 1 GHz. The IF LFM signal is converted to the optical domain via an intensity modulator and then filtered by a fiber Bragg grating (FBG) to generate only two 2nd-order optical LFM sidebands. In radar detection, the two optical LFM sidebands beat with each other to generate a frequency-and-bandwidth-quadrupled LFM signal, which is used for ranging, radial velocity measurement, and imaging. By changing the center frequency of the IF LFM signal, the radar function can be operated within 8 to 40 GHz. In spectrum sensing, one 2nd-order optical LFM sideband is selected by another FBG, which then works in conjunction with the stimulated Brillouin scattering gain spectrum to map the frequency of the signal under test to time with an instantaneous measurement bandwidth of 2 GHz. By using a frequency shift module to adjust the pump frequency, the frequency measurement range can be adjusted from 0 to 40 GHz. The prototype is comprehensively studied and tested, which is capable of achieving a range resolution of 3.75 cm, a range error of less than $\pm$ 2 cm, a radial velocity error within $\pm$ 1 cm/s, delivering clear imaging of multiple small targets, and maintaining a frequency measurement error of less than $\pm$ 7 MHz and a frequency resolution of better than 20 MHz.
- [109] arXiv:2406.14082 (cross-list from cs.LG) [pdf, other]
-
Title: FLoCoRA: Federated learning compression with low-rank adaptationLucas Grativol Ribeiro (IMT Atlantique - MEE, Lab\_STICC\_BRAIn, Lab-STICC\_2AI, LHC), Mathieu Leonardon (IMT Atlantique - MEE, Lab\_STICC\_BRAIn), Guillaume Muller (Mines Saint-Étienne MSE, FAYOL-ENSMSE, FAYOL-ENSMSE), Virginie Fresse (LHC, TSE), Matthieu Arzel (IMT Atlantique - MEE, Lab-STICC\_2AI)Journal-ref: 32nd European Signal Processing Conference EUSIPCO, Aug 2024, Lyon, FranceSubjects: Machine Learning (cs.LG); Signal Processing (eess.SP)
Low-Rank Adaptation (LoRA) methods have gained popularity in efficient parameter fine-tuning of models containing hundreds of billions of parameters. In this work, instead, we demonstrate the application of LoRA methods to train small-vision models in Federated Learning (FL) from scratch. We first propose an aggregation-agnostic method to integrate LoRA within FL, named FLoCoRA, showing that the method is capable of reducing communication costs by 4.8 times, while having less than 1% accuracy degradation, for a CIFAR-10 classification task with a ResNet-8. Next, we show that the same method can be extended with an affine quantization scheme, dividing the communication cost by 18.6 times, while comparing it with the standard method, with still less than 1% of accuracy loss, tested with on a ResNet-18 model. Our formulation represents a strong baseline for message size reduction, even when compared to conventional model compression works, while also reducing the training memory requirements due to the low-rank adaptation.
- [110] arXiv:2406.14092 (cross-list from cs.CL) [pdf, html, other]
-
Title: Seamless Language Expansion: Enhancing Multilingual Mastery in Self-Supervised ModelsComments: Accepted by Interspeech 2024Subjects: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
Self-supervised (SSL) models have shown great performance in various downstream tasks. However, they are typically developed for limited languages, and may encounter new languages in real-world. Developing a SSL model for each new language is costly. Thus, it is vital to figure out how to efficiently adapt existed SSL models to a new language without impairing its original abilities. We propose adaptation methods which integrate LoRA to existed SSL models to extend new language. We also develop preservation strategies which include data combination and re-clustering to retain abilities on existed languages. Applied to mHuBERT, we investigate their effectiveness on speech re-synthesis task. Experiments show that our adaptation methods enable mHuBERT to be applied to a new language (Mandarin) with MOS value increased about 1.6 and the relative value of WER reduced up to 61.72%. Also, our preservation strategies ensure that the performance on both existed and new languages remains intact.
- [111] arXiv:2406.14176 (cross-list from cs.SD) [pdf, html, other]
-
Title: A Multi-Stream Fusion Approach with One-Class Learning for Audio-Visual Deepfake DetectionSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
This paper addresses the challenge of developing a robust audio-visual deepfake detection model. In practical use cases, new generation algorithms are continually emerging, and these algorithms are not encountered during the development of detection methods. This calls for the generalization ability of the method. Additionally, to ensure the credibility of detection methods, it is beneficial for the model to interpret which cues from the video indicate it is fake. Motivated by these considerations, we then propose a multi-stream fusion approach with one-class learning as a representation-level regularization technique. We study the generalization problem of audio-visual deepfake detection by creating a new benchmark by extending and re-splitting the existing FakeAVCeleb dataset. The benchmark contains four categories of fake video(Real Audio-Fake Visual, Fake Audio-Fake Visual, Fake Audio-Real Visual, and unsynchronized video). The experimental results show that our approach improves the model's detection of unseen attacks by an average of 7.31% across four test sets, compared to the baseline model. Additionally, our proposed framework offers interpretability, indicating which modality the model identifies as fake.
- [112] arXiv:2406.14177 (cross-list from cs.CL) [pdf, html, other]
-
Title: SimulSeamless: FBK at IWSLT 2024 Simultaneous Speech TranslationSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
This paper describes the FBK's participation in the Simultaneous Translation Evaluation Campaign at IWSLT 2024. For this year's submission in the speech-to-text translation (ST) sub-track, we propose SimulSeamless, which is realized by combining AlignAtt and SeamlessM4T in its medium configuration. The SeamlessM4T model is used "off-the-shelf" and its simultaneous inference is enabled through the adoption of AlignAtt, a SimulST policy based on cross-attention that can be applied without any retraining or adaptation of the underlying model for the simultaneous task. We participated in all the Shared Task languages (English->{German, Japanese, Chinese}, and Czech->English), achieving acceptable or even better results compared to last year's submissions. SimulSeamless, covering more than 143 source languages and 200 target languages, is released at: this https URL.
- [113] arXiv:2406.14234 (cross-list from physics.med-ph) [pdf, html, other]
-
Title: Zero field active shieldingComments: 26 pages, 7 figuresSubjects: Medical Physics (physics.med-ph); Human-Computer Interaction (cs.HC); Signal Processing (eess.SP); Instrumentation and Detectors (physics.ins-det); Neurons and Cognition (q-bio.NC)
Ambient field suppression is critical for accurate magnetic field measurements, and a requirement for certain low-field sensors to operate. The difference in magnitude between noise and signal (up to 10$^9$) makes the problem challenging, and solutions such as passive shielding, post-hoc processing, and most active shielding designs do not address it completely. Zero field active shielding (ZFS) achieves accurate field suppression with a feed-forward structure in which correction coils are fed by reference sensors via a matrix found using data-driven methods. Requirements are a sufficient number of correction coils and reference sensors to span the ambient field at the sensors, and to zero out the coil-to-reference sensor coupling. The solution assumes instantaneous propagation and mixing, but it can be extended to handle convolutional effects. Precise calculations based on sensor and coil geometries are not necessary, other than to improve efficiency and usability. The solution is simulated here but not implemented in hardware.
- [114] arXiv:2406.14294 (cross-list from cs.SD) [pdf, html, other]
-
Title: DASB -- Discrete Audio and Speech BenchmarkComments: 9 pages, 5 tablesSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Discrete audio tokens have recently gained considerable attention for their potential to connect audio and language processing, enabling the creation of modern multimodal large language models. Ideal audio tokens must effectively preserve phonetic and semantic content along with paralinguistic information, speaker identity, and other details. While several types of audio tokens have been recently proposed, identifying the optimal tokenizer for various tasks is challenging due to the inconsistent evaluation settings in existing studies. To address this gap, we release the Discrete Audio and Speech Benchmark (DASB), a comprehensive leaderboard for benchmarking discrete audio tokens across a wide range of discriminative tasks, including speech recognition, speaker identification and verification, emotion recognition, keyword spotting, and intent classification, as well as generative tasks such as speech enhancement, separation, and text-to-speech. Our results show that, on average, semantic tokens outperform compression tokens across most discriminative and generative tasks. However, the performance gap between semantic tokens and standard continuous representations remains substantial, highlighting the need for further research in this field.
- [115] arXiv:2406.14329 (cross-list from cs.LG) [pdf, html, other]
-
Title: Adaptive Adversarial Cross-Entropy Loss for Sharpness-Aware MinimizationComments: Accepted in ICIP2024. The project page can be accessed at this http URLSubjects: Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Recent advancements in learning algorithms have demonstrated that the sharpness of the loss surface is an effective measure for improving the generalization gap. Building upon this concept, Sharpness-Aware Minimization (SAM) was proposed to enhance model generalization and achieved state-of-the-art performance. SAM consists of two main steps, the weight perturbation step and the weight updating step. However, the perturbation in SAM is determined by only the gradient of the training loss, or cross-entropy loss. As the model approaches a stationary point, this gradient becomes small and oscillates, leading to inconsistent perturbation directions and also has a chance of diminishing the gradient. Our research introduces an innovative approach to further enhancing model generalization. We propose the Adaptive Adversarial Cross-Entropy (AACE) loss function to replace standard cross-entropy loss for SAM's perturbation. AACE loss and its gradient uniquely increase as the model nears convergence, ensuring consistent perturbation direction and addressing the gradient diminishing issue. Additionally, a novel perturbation-generating function utilizing AACE loss without normalization is proposed, enhancing the model's exploratory capabilities in near-optimum stages. Empirical testing confirms the effectiveness of AACE, with experiments demonstrating improved performance in image classification tasks using Wide ResNet and PyramidNet across various datasets. The reproduction code is available online
- [116] arXiv:2406.14333 (cross-list from cs.IR) [pdf, html, other]
-
Title: LARP: Language Audio Relational Pre-training for Cold-Start Playlist ContinuationSubjects: Information Retrieval (cs.IR); Sound (cs.SD); Audio and Speech Processing (eess.AS)
As online music consumption increasingly shifts towards playlist-based listening, the task of playlist continuation, in which an algorithm suggests songs to extend a playlist in a personalized and musically cohesive manner, has become vital to the success of music streaming. Currently, many existing playlist continuation approaches rely on collaborative filtering methods to perform recommendation. However, such methods will struggle to recommend songs that lack interaction data, an issue known as the cold-start problem. Current approaches to this challenge design complex mechanisms for extracting relational signals from sparse collaborative data and integrating them into content representations. However, these approaches leave content representation learning out of scope and utilize frozen, pre-trained content models that may not be aligned with the distribution or format of a specific musical setting. Furthermore, even the musical state-of-the-art content modules are either (1) incompatible with the cold-start setting or (2) unable to effectively integrate cross-modal and relational signals. In this paper, we introduce LARP, a multi-modal cold-start playlist continuation model, to effectively overcome these limitations. LARP is a three-stage contrastive learning framework that integrates both multi-modal and relational signals into its learned representations. Our framework uses increasing stages of task-specific abstraction: within-track (language-audio) contrastive loss, track-track contrastive loss, and track-playlist contrastive loss. Experimental results on two publicly available datasets demonstrate the efficacy of LARP over uni-modal and multi-modal models for playlist continuation in a cold-start setting. Code and dataset are released at: this https URL.
- [117] arXiv:2406.14338 (cross-list from cs.RO) [pdf, html, other]
-
Title: Adaptive Robust Controller for handling Unknown Uncertainty of Robotic ManipulatorsSubjects: Robotics (cs.RO); Systems and Control (eess.SY)
The ability to achieve precise and smooth trajectory tracking is crucial for ensuring the successful execution of various tasks involving robotic manipulators. State-of-the-art techniques require accurate mathematical models of the robot dynamics, and robustness to model uncertainties is achieved by relying on precise bounds on the model mismatch. In this paper, we propose a novel adaptive robust feedback linearization scheme able to compensate for model uncertainties without any a-priori knowledge on them, and we provide a theoretical proof of convergence under mild assumptions. We evaluate the method on a simulated RR robot. First, we consider a nominal model with known model mismatch, which allows us to compare our strategy with state-of-the-art uncertainty-aware methods. Second, we implement the proposed control law in combination with a learned model, for which uncertainty bounds are not available. Results show that our method leads to performance comparable to uncertainty-aware methods while requiring less prior knowledge.
- [118] arXiv:2406.14361 (cross-list from cs.AI) [pdf, html, other]
-
Title: Robustness Analysis of AI Models in Critical Energy SystemsSubjects: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
This paper analyzes the robustness of state-of-the-art AI-based models for power grid operations under the $N-1$ security criterion. While these models perform well in regular grid settings, our results highlight a significant loss in accuracy following the disconnection of a line.%under this security criterion. Using graph theory-based analysis, we demonstrate the impact of node connectivity on this loss. Our findings emphasize the need for practical scenario considerations in developing AI methodologies for critical infrastructure.
- [119] arXiv:2406.14458 (cross-list from cs.LG) [pdf, html, other]
-
Title: Centimeter Positioning Accuracy using AI/ML for 6G ApplicationsComments: 2 Pages, 2 Figures, ICMLCN Conference, Stockholm, SwedenSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Signal Processing (eess.SP)
This research looks at using AI/ML to achieve centimeter-level user positioning in 6G applications such as the Industrial Internet of Things (IIoT). Initial results show that our AI/ML-based method can estimate user positions with an accuracy of 17 cm in an indoor factory environment. In this proposal, we highlight our approaches and future directions.
- [120] arXiv:2406.14464 (cross-list from cs.SD) [pdf, html, other]
-
Title: A Review of Common Online Speaker Diarization MethodsComments: 6 pagesSubjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
Speaker diarization provides the answer to the question "who spoke when?" for an audio file. This information can be used to complete audio transcripts for further processing steps. Most speaker diarization systems assume that the audio file is available as a whole. However, there are scenarios in which the speaker labels are needed immediately after the arrival of an audio segment. Speaker diarization with a correspondingly low latency is referred to as online speaker diarization. This paper provides an overview. First the history of online speaker diarization is briefly presented. Next a taxonomy and datasets for training and evaluation are given. In the sections that follow, online diarization methods and systems are discussed in detail. This paper concludes with the presentation of challenges that still need to be solved by future research in the field of online speaker diarization.
- [121] arXiv:2406.14485 (cross-list from cs.AI) [pdf, other]
-
Title: Proceedings of The second international workshop on eXplainable AI for the Arts (XAIxArts)Nick Bryan-Kinns, Corey Ford, Shuoyang Zheng, Helen Kennedy, Alan Chamberlain, Makayla Lewis, Drew Hemment, Zijin Li, Qiong Wu, Lanxi Xiao, Gus Xia, Jeba Rezwana, Michael Clemens, Gabriel VigliensoniSubjects: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
This second international workshop on explainable AI for the Arts (XAIxArts) brought together a community of researchers in HCI, Interaction Design, AI, explainable AI (XAI), and digital arts to explore the role of XAI for the Arts. Workshop held at the 16th ACM Conference on Creativity and Cognition (C&C 2024), Chicago, USA.
- [122] arXiv:2406.14559 (cross-list from cs.SD) [pdf, html, other]
-
Title: Disentangled Representation Learning for Environment-agnostic Speaker RecognitionComments: Interspeech 2024. The official webpage can be found at this https URLSubjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
This work presents a framework based on feature disentanglement to learn speaker embeddings that are robust to environmental variations. Our framework utilises an auto-encoder as a disentangler, dividing the input speaker embedding into components related to the speaker and other residual information. We employ a group of objective functions to ensure that the auto-encoder's code representation - used as the refined embedding - condenses only the speaker characteristics. We show the versatility of our framework through its compatibility with any existing speaker embedding extractor, requiring no structural modifications or adaptations for integration. We validate the effectiveness of our framework by incorporating it into two popularly used embedding extractors and conducting experiments across various benchmarks. The results show a performance improvement of up to 16%. We release our code for this work to be available this https URL
Cross submissions for Friday, 21 June 2024 (showing 47 of 47 entries )
- [123] arXiv:2206.02909 (replaced) [pdf, html, other]
-
Title: Self-supervised Learning for Human Activity Recognition Using 700,000 Person-days of Wearable DataHang Yuan, Shing Chan, Andrew P. Creagh, Catherine Tong, Aidan Acquah, David A. Clifton, Aiden DohertyJournal-ref: npj Digit. Med. 7, 91 (2024)Subjects: Signal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Advances in deep learning for human activity recognition have been relatively limited due to the lack of large labelled datasets. In this study, we leverage self-supervised learning techniques on the UK-Biobank activity tracker dataset--the largest of its kind to date--containing more than 700,000 person-days of unlabelled wearable sensor data. Our resulting activity recognition model consistently outperformed strong baselines across seven benchmark datasets, with an F1 relative improvement of 2.5%-100% (median 18.4%), the largest improvements occurring in the smaller datasets. In contrast to previous studies, our results generalise across external datasets, devices, and environments. Our open-source model will help researchers and developers to build customisable and generalisable activity classifiers with high performance.
- [124] arXiv:2307.01927 (replaced) [pdf, html, other]
-
Title: Safe Connectivity Maintenance of Underactuated Multi-Agent Networks in Dynamic Oceanic EnvironmentsComments: 8 pages, Published at European Control Conference 2024 (ECC 2024) Nicolas Hoischen and Marius Wiggert contributed equally to this workSubjects: Systems and Control (eess.SY)
Autonomous multi-agent systems are increasingly being deployed in environments where winds and ocean currents have a significant influence. Recent work has developed control policies for single agents that leverage flows to achieve their objectives in dynamic environments. However, in multi-agent systems, these flows can cause agents to collide or drift apart and lose direct inter-agent communications, especially when agents have low propulsion capabilities. To address these challenges, we propose a hierarchical multi-agent control approach that allows arbitrary single-agent performance policies that are unaware of other agents to be used in multi-agent systems while ensuring safe operation. We first develop a safety controller using potential functions, solely dedicated to avoiding collisions and maintaining inter-agent communication. Next, we design a low-interference safe interaction (LISIC) policy that trades off the performance policy and the safety control to ensure safe and performant operation. Specifically, when the agents are at an appropriate distance, LISIC prioritizes the performance policy while smoothly increasing the safety controller when necessary. We prove that under mild assumptions on the flows experienced by the agents, our approach can guarantee safety. Additionally, we demonstrate the effectiveness of our method in realistic settings through an extensive empirical analysis with simulations of fleets of underactuated autonomous surface vehicles operating in dynamic ocean currents where these assumptions do not always hold.
- [125] arXiv:2309.06718 (replaced) [pdf, html, other]
-
Title: Immersion and Invariance-based Disturbance Observer and Its Application to Safe ControlComments: Accepted to IEEE Transactions on Automatic ControlJournal-ref: 10.1109/TAC.2024.3416323Subjects: Systems and Control (eess.SY); Optimization and Control (math.OC)
When the disturbance input matrix is nonlinear, existing disturbance observer design methods rely on the solvability of a partial differential equation or the existence of an output function with a uniformly well-defined disturbance relative degree, which can pose significant limitations. This note introduces a systematic approach for designing an Immersion and Invariance-based Disturbance Observer (IIDOB) that circumvents these strong assumptions. The proposed IIDOB ensures the disturbance estimation error is globally uniformly ultimately bounded by approximately solving a partial differential equation while compensating for the approximation error. Furthermore, by integrating IIDOB into the framework of control barrier functions, a filter-based safe control design method for control-affine systems with disturbances is established where the filter is used to generate an alternative disturbance estimation signal with a known derivative. Sufficient conditions are established to guarantee the safety of the disturbed systems. Simulation results demonstrate the effectiveness of the proposed method.
- [126] arXiv:2309.14645 (replaced) [pdf, html, other]
-
Title: A nonparametric learning framework for nonlinear robust output regulationComments: 17 pages; Nonlinear control; iISS stability; output regulation; parameter estimation; Non-adaptive controlSubjects: Systems and Control (eess.SY); Optimization and Control (math.OC); Adaptation and Self-Organizing Systems (nlin.AO)
A nonparametric learning solution framework is proposed for the global nonlinear robust output regulation problem. We first extend the assumption that the steady-state generator is linear in the exogenous signal to the more relaxed assumption that it is polynomial in the exogenous signal. Additionally, a nonparametric learning framework is proposed to eliminate the construction of an explicit regressor, as required in the adaptive method, which can potentially simplify the implementation and reduce the computational complexity of existing methods. With the help of the proposed framework, the robust nonlinear output regulation problem can be converted into a robust non-adaptive stabilization problem for the augmented system with integral input-to-state stable (iISS) inverse dynamics. Moreover, a dynamic gain approach can adaptively raise the gain to a sufficiently large constant to achieve stabilization without requiring any a priori knowledge of the uncertainties appearing in the dynamics of the exosystem and the system. Furthermore, we apply the nonparametric learning framework to globally reconstruct and estimate multiple sinusoidal signals with unknown frequencies without the need for adaptive parametric techniques. An explicit nonlinear mapping can directly provide the estimated parameters, which will exponentially converge to the unknown frequencies. Finally, a feedforward control design is proposed to solve the linear output regulation problem using the nonparametric learning framework. Two simulation examples are provided to illustrate the effectiveness of the theoretical results.
- [127] arXiv:2309.16792 (replaced) [pdf, html, other]
-
Title: Agent Coordination via Contextual Regression (AgentCONCUR) for Data Center FlexibilitySubjects: Systems and Control (eess.SY); Optimization and Control (math.OC)
A network of spatially distributed data centers can provide operational flexibility to power systems by shifting computing tasks among electrically remote locations. However, harnessing this flexibility in real-time through the standard optimization techniques is challenged by the need for sensitive operational datasets and substantial computational resources. To alleviate the data and computational requirements, this paper introduces a coordination mechanism based on contextual regression. This mechanism, abbreviated as AgentCONCUR, associates cost-optimal task shifts with public and trusted contextual data (e.g., real-time prices) and uses regression on this data as a coordination policy. Notably, regression-based coordination does not learn the optimal coordination actions from a labeled dataset. Instead, it exploits the optimization structure of the coordination problem to ensure feasible and cost-effective actions. A NYISO-based study reveals large coordination gains and the optimal features for the successful regression-based coordination.
- [128] arXiv:2310.03559 (replaced) [pdf, html, other]
-
Title: MedSyn: Text-guided Anatomy-aware Synthesis of High-Fidelity 3D CT ImagesSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
This paper introduces an innovative methodology for producing high-quality 3D lung CT images guided by textual information. While diffusion-based generative models are increasingly used in medical imaging, current state-of-the-art approaches are limited to low-resolution outputs and underutilize radiology reports' abundant information. The radiology reports can enhance the generation process by providing additional guidance and offering fine-grained control over the synthesis of images. Nevertheless, expanding text-guided generation to high-resolution 3D images poses significant memory and anatomical detail-preserving challenges. Addressing the memory issue, we introduce a hierarchical scheme that uses a modified UNet architecture. We start by synthesizing low-resolution images conditioned on the text, serving as a foundation for subsequent generators for complete volumetric data. To ensure the anatomical plausibility of the generated samples, we provide further guidance by generating vascular, airway, and lobular segmentation masks in conjunction with the CT images. The model demonstrates the capability to use textual input and segmentation tasks to generate synthesized images. The results of comparative assessments indicate that our approach exhibits superior performance compared to the most advanced models based on GAN and diffusion techniques, especially in accurately retaining crucial anatomical features such as fissure lines, airways, and vascular structures. This innovation introduces novel possibilities. This study focuses on two main objectives: (1) the development of a method for creating images based on textual prompts and anatomical components, and (2) the capability to generate new images conditioning on anatomical elements. The advancements in image generation can be applied to enhance numerous downstream tasks.
- [129] arXiv:2311.00433 (replaced) [pdf, other]
-
Title: Decentralized PI-control and Anti-windup in Resource Sharing NetworksSubjects: Systems and Control (eess.SY)
We consider control of multiple stable first-order \review{agents} which have a control coupling described by an M-matrix. These agents are subject to incremental sector-bounded \review{input} nonlinearities. We show that such plants can be globally asymptotically stabilized to a unique equilibrium using fully decentralized proportional-integral controllers equipped with anti-windup and subject to local tuning rules. In addition, we show that when the nonlinearities correspond to the saturation function, the closed loop asymptotically minimizes a weighted 1-norm of the agents state mismatch. The control strategy is finally compared to other state-of-the-art controllers on a numerical district heating example.
- [130] arXiv:2311.00483 (replaced) [pdf, html, other]
-
Title: DEFN: Dual-Encoder Fourier Group Harmonics Network for Three-Dimensional Indistinct-Boundary Object SegmentationXiaohua Jiang, Yihao Guo, Jian Huang, Yuting Wu, Meiyi Luo, Zhaoyang Xu, Qianni Zhang, Xingru Huang, Hong He, Shaowei Jiang, Jing Ye, Mang XiaoComments: 36pages,16figures,7tablesSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
The precise spatial and quantitative delineation of indistinct-boundary medical objects is paramount for the accuracy of diagnostic protocols, efficacy of surgical interventions, and reliability of postoperative assessments. Despite their significance, the effective segmentation and instantaneous three-dimensional reconstruction are significantly impeded by the paucity of representative samples in available datasets and noise artifacts. To surmount these challenges, we introduced Stochastic Defect Injection (SDi) to augment the representational diversity of challenging indistinct-boundary objects within training corpora. Consequently, we propose the Dual-Encoder Fourier Group Harmonics Network (DEFN) to tailor noise filtration, amplify detailed feature recognition, and bolster representation across diverse medical imaging scenarios. By incorporating Dynamic Weight Composing (DWC) loss dynamically adjusts model's focus based on training progression, DEFN achieves SOTA performance on the OIMHS public dataset, showcasing effectiveness in indistinct boundary contexts. Source code for DEFN is available at: this https URL.
- [131] arXiv:2311.10224 (replaced) [pdf, html, other]
-
Title: CV-Attention UNet: Attention-based UNet for 3D Cerebrovascular Segmentation of Enhanced TOF-MRA ImagesSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Due to the lack of automated methods, to diagnose cerebrovascular disease, time-of-flight magnetic resonance angiography (TOF-MRA) is assessed visually, making it time-consuming. The commonly used encoder-decoder architectures for cerebrovascular segmentation utilize redundant features, eventually leading to the extraction of low-level features multiple times. Additionally, convolutional neural networks (CNNs) suffer from performance degradation when the batch size is small, and deeper networks experience the vanishing gradient problem. Methods: In this paper, we attempt to solve these limitations and propose the 3D cerebrovascular attention UNet method, named CV-AttentionUNet, for precise extraction of brain vessel images. We proposed a sequence of preprocessing techniques followed by deeply supervised UNet to improve the accuracy of segmentation of the brain vessels leading to a stroke. To combine the low and high semantics, we applied the attention mechanism. This mechanism focuses on relevant associations and neglects irrelevant anatomical information. Furthermore, the inclusion of deep supervision incorporates different levels of features that prove to be beneficial for network convergence. Results: We demonstrate the efficiency of the proposed method by cross-validating with an unlabeled dataset, which was further labeled by us. We believe that the novelty of this algorithm lies in its ability to perform well on both labeled and unlabeled data with image processing-based enhancement. The results indicate that our method performed better than the existing state-of-the-art methods on the TubeTK dataset. Conclusion: The proposed method will help in accurate segmentation of cerebrovascular structure leading to stroke
- [132] arXiv:2312.04022 (replaced) [pdf, html, other]
-
Title: Analysis of Coding Gain Due to In-Loop ReshapingComments: Published in IEEE Transactions on Image ProcessingSubjects: Image and Video Processing (eess.IV); Information Theory (cs.IT)
Reshaping, a point operation that alters the characteristics of signals, has been shown capable of improving the compression ratio in video coding practices. Out-of-loop reshaping that directly modifies the input video signal was first adopted as the supplemental enhancement information (SEI) for the HEVC/H.265 without the need to alter the core design of the video codec. VVC/H.266 further improves the coding efficiency by adopting in-loop reshaping that modifies the residual signal being processed in the hybrid coding loop. In this paper, we theoretically analyze the rate-distortion performance of the in-loop reshaping and use experiments to verify the theoretical result. We prove that the in-loop reshaping can improve coding efficiency when the entropy coder adopted in the coding pipeline is suboptimal, which is in line with the practical scenarios that video codecs operate in. We derive the PSNR gain in a closed form and show that the theoretically predicted gain is consistent with that measured from experiments using standard testing video sequences.
- [133] arXiv:2312.05547 (replaced) [pdf, html, other]
-
Title: Signatures Meet Dynamic Programming: Generalizing Bellman Equations for Trajectory FollowingComments: 48 pages, 21 figuresJournal-ref: 6th Annual Conference on Learning for Dynamics and Control (2024)Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG); Robotics (cs.RO)
Path signatures have been proposed as a powerful representation of paths that efficiently captures the path's analytic and geometric characteristics, having useful algebraic properties including fast concatenation of paths through tensor products. Signatures have recently been widely adopted in machine learning problems for time series analysis. In this work we establish connections between value functions typically used in optimal control and intriguing properties of path signatures. These connections motivate our novel control framework with signature transforms that efficiently generalizes the Bellman equation to the space of trajectories. We analyze the properties and advantages of the framework, termed signature control. In particular, we demonstrate that (i) it can naturally deal with varying/adaptive time steps; (ii) it propagates higher-level information more efficiently than value function updates; (iii) it is robust to dynamical system misspecification over long rollouts. As a specific case of our framework, we devise a model predictive control method for path tracking. This method generalizes integral control, being suitable for problems with unknown disturbances. The proposed algorithms are tested in simulation, with differentiable physics models including typical control and robotics tasks such as point-mass, curve following for an ant model, and a robotic manipulator.
- [134] arXiv:2401.06422 (replaced) [pdf, html, other]
-
Title: Joint Mechanical and Electrical Adjustment of IRS-aided LEO Satellite MIMO CommunicationsComments: 5 pages, 6 figuresSubjects: Signal Processing (eess.SP); Systems and Control (eess.SY)
In this letter, we propose a joint mechanical and electrical adjustment of intelligent reflecting surface (IRS) for the performance improvements of low-earth orbit (LEO) satellite multiple-input multiple-output (MIMO) communications. In particular, we construct a three-dimensional (3D) MIMO channel model for the mechanically-tilted IRS in general deployment, and consider two types of scenarios with and without the direct path of LEO-ground user link due to the orbital flight. With the aim of maximizing the end-to-end performance, we jointly optimize tilting angle and phase shift of IRS along with the transceiver beamforming, whose performance superiority is verified via simulations with the Orbcomm LEO satellite using a real orbit data.
- [135] arXiv:2402.05210 (replaced) [pdf, html, other]
-
Title: Anatomically-Controllable Medical Image Generation with Segmentation-Guided Diffusion ModelsComments: Accepted at MICCAI 2024. Code and synthetic dataset: this https URLSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
Diffusion models have enabled remarkably high-quality medical image generation, yet it is challenging to enforce anatomical constraints in generated images. To this end, we propose a diffusion model-based method that supports anatomically-controllable medical image generation, by following a multi-class anatomical segmentation mask at each sampling step. We additionally introduce a random mask ablation training algorithm to enable conditioning on a selected combination of anatomical constraints while allowing flexibility in other anatomical areas. We compare our method ("SegGuidedDiff") to existing methods on breast MRI and abdominal/neck-to-pelvis CT datasets with a wide range of anatomical objects. Results show that our method reaches a new state-of-the-art in the faithfulness of generated images to input anatomical masks on both datasets, and is on par for general anatomical realism. Finally, our model also enjoys the extra benefit of being able to adjust the anatomical similarity of generated images to real images of choice through interpolation in its latent space. SegGuidedDiff has many applications, including cross-modality translation, and the generation of paired or counterfactual data. Our code is available at this https URL.
- [136] arXiv:2403.10271 (replaced) [pdf, html, other]
-
Title: SuperM2M: Supervised and Mixture-to-Mixture Co-Learning for Speech Enhancement and Robust ASRComments: in submissionSubjects: Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
The current dominant approach for neural speech enhancement is based on supervised learning by using simulated training data. The trained models, however, often exhibit limited generalizability to real-recorded data. To address this, this paper investigates training enhancement models directly on real target-domain data. We propose to adapt mixture-to-mixture (M2M) training, originally designed for speaker separation, for speech enhancement, by modeling multi-source noise signals as a single, combined source. In addition, we propose a co-learning algorithm that improves M2M with the help of supervised algorithms. When paired close-talk and far-field mixtures are available for training, M2M realizes speech enhancement by training a deep neural network (DNN) to produce speech and noise estimates in a way such that they can be linearly filtered to reconstruct the close-talk and far-field mixtures. This way, the DNN can be trained directly on real mixtures, and can leverage close-talk and far-field mixtures as a weak supervision to enhance far-field mixtures. To improve M2M, we combine it with supervised approaches to co-train the DNN, where mini-batches of real close-talk and far-field mixture pairs and mini-batches of simulated mixture and clean speech pairs are alternately fed to the DNN, and the loss functions are respectively (a) the mixture reconstruction loss on the real close-talk and far-field mixtures and (b) the regular enhancement loss on the simulated clean speech and noise. We find that, this way, the DNN can learn from real and simulated data to achieve better generalization to real data. We name this algorithm SuperM2M (supervised and mixture-to-mixture co-learning). Evaluation results on the CHiME-4 dataset show its effectiveness and potential.
- [137] arXiv:2403.18564 (replaced) [pdf, html, other]
-
Title: Reachability Analysis Using Constrained Polynomial Logical ZonotopesComments: IEEE Control Systems Letters (2024)Subjects: Systems and Control (eess.SY); Logic in Computer Science (cs.LO)
In this paper, we propose reachability analysis using constrained polynomial logical zonotopes. We perform reachability analysis to compute the set of states that could be reached. To do this, we utilize a recently introduced set representation called polynomial logical zonotopes for performing computationally efficient and exact reachability analysis on logical systems. Notably, polynomial logical zonotopes address the "curse of dimensionality" when analyzing the reachability of logical systems since the set representation can represent $2^h$ binary vectors using $h$ generators. After finishing the reachability analysis, the formal verification involves verifying whether the intersection of the calculated reachable set and the unsafe set is empty or not. Polynomial logical zonotopes lack closure under intersections, prompting the formulation of constrained polynomial logical zonotopes, which preserve the computational efficiency and exactness of polynomial logical zonotopes for reachability analysis while enabling exact intersections. Additionally, an extensive empirical study is presented to demonstrate and validate the advantages of constrained polynomial logical zonotopes.
- [138] arXiv:2404.01929 (replaced) [pdf, other]
-
Title: Towards Enhanced Analysis of Lung Cancer Lesions in EBUS-TBNA -- A Semi-Supervised Video Object Detection MethodSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
This study aims to establish a computer-aided diagnostic system for lung lesions using endobronchial ultrasound (EBUS) to assist physicians in identifying lesion areas. During EBUS-transbronchial needle aspiration (EBUS-TBNA) procedures, hysicians rely on grayscale ultrasound images to determine the location of lesions. However, these images often contain significant noise and can be influenced by surrounding tissues or blood vessels, making identification challenging. Previous research has lacked the application of object detection models to EBUS-TBNA, and there has been no well-defined solution for the lack of annotated data in the EBUS-TBNA dataset. In related studies on ultrasound images, although models have been successful in capturing target regions for their respective tasks, their training and predictions have been based on two-dimensional images, limiting their ability to leverage temporal features for improved predictions. This study introduces a three-dimensional video-based object detection model. It first generates a set of improved queries using a diffusion model, then captures temporal correlations through an attention mechanism. A filtering mechanism selects relevant information from previous frames to pass to the current frame. Subsequently, a teacher-student model training approach is employed to further optimize the model using unlabeled data. By incorporating various data augmentation and feature alignment, the model gains robustness against interference. Test results demonstrate that this model, which captures spatiotemporal information and employs semi-supervised learning methods, achieves an Average Precision (AP) of 48.7 on the test dataset, outperforming other models. It also achieves an Average Recall (AR) of 79.2, significantly leading over existing models.
- [139] arXiv:2404.07970 (replaced) [pdf, html, other]
-
Title: Differentiable All-pole Filters for Time-varying Audio SystemsChin-Yun Yu, Christopher Mitcheltree, Alistair Carson, Stefan Bilbao, Joshua D. Reiss, György FazekasComments: Accepted at DAFx 2024Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
Infinite impulse response filters are an essential building block of many time-varying audio systems, such as audio effects and synthesisers. However, their recursive structure impedes end-to-end training of these systems using automatic differentiation. Although non-recursive filter approximations like frequency sampling and frame-based processing have been proposed and widely used in previous works, they cannot accurately reflect the gradient of the original system. We alleviate this difficulty by re-expressing a time-varying all-pole filter to backpropagate the gradients through itself, so the filter implementation is not bound to the technical limitations of automatic differentiation frameworks. This implementation can be employed within audio systems containing filters with poles for efficient gradient evaluation. We demonstrate its training efficiency and expressive capabilities for modelling real-world dynamic audio systems on a phaser, time-varying subtractive synthesiser, and feed-forward compressor. We make our code and audio samples available and provide the trained audio effect and synth models in a VST plugin at this https URL.
- [140] arXiv:2405.11032 (replaced) [pdf, html, other]
-
Title: Game-theoretic Energy Management Strategies With Interacting Agents in Formula 1Subjects: Systems and Control (eess.SY)
This paper presents an interaction-aware energy management optimization framework for Formula 1 racing. The considered scenario involves two agents and a drag reduction model. Strategic interactions between the agents are captured by a Stackelberg game formulated as a bilevel program. To address the computational challenges associated with bilevel optimization, the problem is reformulated as a single-level nonlinear program employing the Karush-Kuhn-Tucker conditions. The proposed framework contributes towards the development of new energy management and allocation strategies, caused by the presence of another agent. For instance, it provides valuable insights on how to redistribute the energy in order to optimally exploit the wake effect, showcasing a notable difference with the behavior studied in previous works. Robust energy allocations can be identified to reduce the lap time loss associated with unexpected choices of the other agent. It allows to recognize the boundary conditions for the interaction to become relevant, impacting the system's behavior, and to assess if overtaking is possible and beneficial. Overall, the framework provides a comprehensive approach for a two-agent Formula 1 racing problem with strategic interactions, offering physically intuitive and practical results.
- [141] arXiv:2406.00621 (replaced) [pdf, html, other]
-
Title: Log-Scale Quantization in Distributed First-Order Methods: Gradient-based Learning from Distributed DataMohammadreza Doostmohammadian, Muhammad I. Qureshi, Mohammad Hossein Khalesi, Hamid R. Rabiee, Usman A. KhanSubjects: Systems and Control (eess.SY); Signal Processing (eess.SP); Optimization and Control (math.OC)
Decentralized strategies are of interest for learning from large-scale data over networks. This paper studies learning over a network of geographically distributed nodes/agents subject to quantization. Each node possesses a private local cost function, collectively contributing to a global cost function, which the proposed methodology aims to minimize. In contrast to many existing literature, the information exchange among nodes is quantized. We adopt a first-order computationally-efficient distributed optimization algorithm (with no extra inner consensus loop) that leverages node-level gradient correction based on local data and network-level gradient aggregation only over nearby nodes. This method only requires balanced networks with no need for stochastic weight design. It can handle log-scale quantized data exchange over possibly time-varying and switching network setups. We analyze convergence over both structured networks (for example, training over data-centers) and ad-hoc multi-agent networks (for example, training over dynamic robotic networks). Through analysis and experimental validation, we show that (i) structured networks generally result in a smaller optimality gap, and (ii) logarithmic quantization leads to smaller optimality gap compared to uniform quantization.
- [142] arXiv:2406.03657 (replaced) [pdf, html, other]
-
Title: UrBAN: Urban Beehive Acoustics and PheNotyping DatasetMahsa Abdollahi, Yi Zhu, Heitor R. Guimarães, Nico Coallier, Ségolène Maucourt, Pierre Giovenazzo, Tiago H. FalkSubjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
In this paper, we present a multimodal dataset obtained from a honey bee colony in Montréal, Quebec, Canada, spanning the years of 2021 to 2022. This apiary comprised 10 beehives, with microphones recording more than 2000 hours of high quality raw audio, and also sensors capturing temperature, and humidity. Periodic hive inspections involved monitoring colony honey bee population changes, assessing queen-related conditions, and documenting overall hive health. Additionally, health metrics, such as Varroa mite infestation rates and winter mortality assessments were recorded, offering valuable insights into factors affecting hive health status and resilience. In this study, we first outline the data collection process, sensor data description, and dataset structure. Furthermore, we demonstrate a practical application of this dataset by extracting various features from the raw audio to predict colony population using the number of frames of bees as a proxy.
- [143] arXiv:2406.04188 (replaced) [pdf, html, other]
-
Title: Digital Twin Aided RIS Communication: Robust Beamforming and Interference ManagementComments: Dataset and code files will be available soon on the DeepMIMIO website: this https URLSubjects: Signal Processing (eess.SP); Information Theory (cs.IT)
Reconfigurable intelligent surfaces (RISs) are envisioned to play a key role in future wireless communication networks. However, channel estimation in RIS-aided wireless networks is challenging due to their passive nature and the large number of reflective elements, leading to high channel estimation overhead. Additionally, conventional methods like beam sweeping, which do not rely on explicit channel state information, often struggle in managing interference in multi-user networks. In this paper, we propose a novel approach that leverages digital twins (DTs) of the physical environments to approximate channels using electromagnetic 3D models and ray tracing, thus relaxing the need for channel estimation and extensive over-the-air computations in RIS-aided wireless networks. To address the digital twins channel approximation errors, we further refine this approach with a DT-specific robust transmission design that reliably meets minimum desired rates. The results show that our method secures these rates over 90% of the time, significantly outperforming beam sweeping, which achieves these rates less than 8% of the time due to its poor management of transmitting power and interference.
- [144] arXiv:2406.04786 (replaced) [pdf, html, other]
-
Title: Asymptotic Analysis of Near-Field Coupling in Massive MISO and Massive SIMO SystemsComments: Accepted version of the article published in IEEE Communications Letters, 2024. DOI: https://doi.org/10.1109/LCOMM.2024.3416044Subjects: Signal Processing (eess.SP); Systems and Control (eess.SY)
This paper studies the receiver to transmitter antenna coupling in near-field communications with massive arrays. Although most works in the literature consider that it is negligible and approximate it by zero, there is no rigorous analysis on its relevance for practical systems. In this work, we leverage multiport communication theory to obtain conditions for the aforementioned approximation to be valid in MISO and SIMO systems. These conditions are then particularized for arrays with fixed inter-element spacing and arrays with fixed size.
- [145] arXiv:2406.05128 (replaced) [pdf, html, other]
-
Title: Differentiable Time-Varying Linear Prediction in the Context of End-to-End Analysis-by-SynthesisComments: Accepted at Interspeech 2024Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
Training the linear prediction (LP) operator end-to-end for audio synthesis in modern deep learning frameworks is slow due to its recursive formulation. In addition, frame-wise approximation as an acceleration method cannot generalise well to test time conditions where the LP is computed sample-wise. Efficient differentiable sample-wise LP for end-to-end training is the key to removing this barrier. We generalise the efficient time-invariant LP implementation from the GOLF vocoder to time-varying cases. Combining this with the classic source-filter model, we show that the improved GOLF learns LP coefficients and reconstructs the voice better than its frame-wise counterparts. Moreover, in our listening test, synthesised outputs from GOLF scored higher in quality ratings than the state-of-the-art differentiable WORLD vocoder.
- [146] arXiv:2406.05763 (replaced) [pdf, html, other]
-
Title: WenetSpeech4TTS: A 12,800-hour Mandarin TTS Corpus for Large Speech Generation Model BenchmarkLinhan Ma, Dake Guo, Kun Song, Yuepeng Jiang, Shuai Wang, Liumeng Xue, Weiming Xu, Huan Zhao, Binbin Zhang, Lei XieComments: Accepted by INTERSPEECH2024Subjects: Audio and Speech Processing (eess.AS)
With the development of large text-to-speech (TTS) models and scale-up of the training data, state-of-the-art TTS systems have achieved impressive performance. In this paper, we present WenetSpeech4TTS, a multi-domain Mandarin corpus derived from the open-sourced WenetSpeech dataset. Tailored for the text-to-speech tasks, we refined WenetSpeech by adjusting segment boundaries, enhancing the audio quality, and eliminating speaker mixing within each segment. Following a more accurate transcription process and quality-based data filtering process, the obtained WenetSpeech4TTS corpus contains $12,800$ hours of paired audio-text data. Furthermore, we have created subsets of varying sizes, categorized by segment quality scores to allow for TTS model training and fine-tuning. VALL-E and NaturalSpeech 2 systems are trained and fine-tuned on these subsets to validate the usability of WenetSpeech4TTS, establishing baselines on benchmark for fair comparison of TTS systems. The corpus and corresponding benchmarks are publicly available on huggingface.
- [147] arXiv:2406.06247 (replaced) [pdf, other]
-
Title: Image Compression with Isotropic and Anisotropic Shepard InpaintingComments: 37 pages, 8 figuresSubjects: Image and Video Processing (eess.IV)
Inpainting-based codecs store sparse selected pixel data and decode by reconstructing the discarded image parts by inpainting. Successful codecs (coders and decoders) traditionally use inpainting operators that solve partial differential equations. This requires some numerical expertise if efficient implementations are necessary. Our goal is to investigate variants of Shepard inpainting as simple alternatives for inpainting-based compression. They can be implemented efficiently when we localise their weighting function. To turn them into viable codecs, we have to introduce novel extensions of classical Shepard interpolation that adapt successful ideas from previous codecs: Anisotropy allows direction-dependent inpainting, which improves reconstruction quality. Additionally, we incorporate data selection by subdivision as an efficient way to tailor the stored information to the image structure. On the encoding side, we introduce the novel concept of joint inpainting and prediction for isotropic Shepard codecs, where storage cost can be reduced based on intermediate inpainting results. In an ablation study, we show the usefulness of these individual contributions and demonstrate that they offer synergies which elevate the performance of Shepard inpainting to surprising levels. Our resulting approaches offer a more favourable trade-off between simplicity and quality than traditional inpainting-based codecs. Experiments show that they can outperform JPEG and JPEG2000 at high compression ratios.
- [148] arXiv:2406.10453 (replaced) [pdf, html, other]
-
Title: Fast Geometric Learning of MIMO Signal Detection over Grassmannian ManifoldsSubjects: Systems and Control (eess.SY)
Domain or statistical distribution shifts are a key staple of the wireless communication channel, because of the dynamics of the environment. Deep learning (DL) models for detecting multiple-input multiple-output (MIMO) signals in dynamic communication require large training samples (in the order of hundreds of thousands to millions) and online retraining to adapt to domain shift. Some dynamic networks, such as vehicular networks, cannot tolerate the waiting time associated with gathering a large number of training samples or online fine-tuning which incurs significant end-to-end delay. In this paper, a novel classification technique based on the concept of geodesic flow kernel (GFK) is proposed for MIMO signal detection. In particular, received MIMO signals are first represented as points on Grassmannian manifolds by formulating basis of subspaces spanned by the rows vectors of the received signal. Then, the domain shift is modeled using a geodesic flow kernel integrating the subspaces that lie on the geodesic to characterize changes in geometric and statistical properties of the received signals. The kernel derives low-dimensional representations of the received signals over the Grassman manifolds that are invariant to domain shift and is used in a geometric support vector machine (G-SVM) algorithm for MIMO signal detection in an unsupervised manner. Simulation results reveal that the proposed method achieves promising performance against the existing baselines like OAMPnet and MMNet with only 1,200 training samples and without online retraining.
- [149] arXiv:2406.11697 (replaced) [pdf, html, other]
-
Title: GridSweep Simulation: Measuring Subsynchronous Impedance Spectra of Distribution FeederComments: 10 pages, 18 figuresSubjects: Systems and Control (eess.SY)
Peaks and troughs in the subsynchronous impedance spectrum of a distribution feeder may be a useful indication of oscillation risk, or more importantly lack of oscillation risk, if inverter-based resource (IBR) deployments are increased on that feeder. GridSweep is a new instrument for measuring the subsynchronous impedance spectra of distribution feeders. It combines an active probing device that modulates a 120-volt 1-kW load sinusoidally at a user-selected GPS-phase locked frequency from 1.0 to 40.0 Hz, and with a recorder that takes ultra-high-precision continuous point-on-wave (CPOW) 120-volt synchrowaveforms at 4 kHz. This paper presents a computer simulation of GridSweep's probing and measurement capability. We construct an electromagnetic transient (EMT) simulation of a single-phase distribution feeder equipped with multiple inverter-based resources (IBRs). We include a model of the GridSweep probing device, then demonstrate the model's capability to measure the subsynchronous apparent impedance spectrum of the feeder. Peaks in that spectrum align with the system's dominant oscillation modes caused by IBRs.
- [150] arXiv:2406.12186 (replaced) [pdf, html, other]
-
Title: Unlocking the Potential of Early Epochs: Uncertainty-aware CT Metal Artifact ReductionSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
In computed tomography (CT), the presence of metallic implants in patients often leads to disruptive artifacts in the reconstructed images, hindering accurate diagnosis. Recently, a large amount of supervised deep learning-based approaches have been proposed for metal artifact reduction (MAR). However, these methods neglect the influence of initial training weights. In this paper, we have discovered that the uncertainty image computed from the restoration result of initial training weights can effectively highlight high-frequency regions, including metal artifacts. This observation can be leveraged to assist the MAR network in removing metal artifacts. Therefore, we propose an uncertainty constraint (UC) loss that utilizes the uncertainty image as an adaptive weight to guide the MAR network to focus on the metal artifact region, leading to improved restoration. The proposed UC loss is designed to be a plug-and-play method, compatible with any MAR framework, and easily adoptable. To validate the effectiveness of the UC loss, we conduct extensive experiments on the public available Deeplesion and CLINIC-metal dataset. Experimental results demonstrate that the UC loss further optimizes the network training process and significantly improves the removal of metal artifacts.
- [151] arXiv:2204.06328 (replaced) [pdf, html, other]
-
Title: HuBERT-EE: Early Exiting HuBERT for Efficient Speech RecognitionComments: Accepted by INTERSPEECH 2024Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Pre-training with self-supervised models, such as Hidden-unit BERT (HuBERT) and wav2vec 2.0, has brought significant improvements in automatic speech recognition (ASR). However, these models usually require an expensive computational cost to achieve outstanding performance, slowing down the inference speed. To improve the model efficiency, we introduce an early exit scheme for ASR, namely HuBERT-EE, that allows the model to stop the inference dynamically. In HuBERT-EE, multiple early exit branches are added at the intermediate layers. When the intermediate prediction of the early exit branch is confident, the model stops the inference, and the corresponding result can be returned early. We investigate the proper early exiting criterion and fine-tuning strategy to effectively perform early exiting. Experimental results on the LibriSpeech show that HuBERT-EE can accelerate the inference of the HuBERT while simultaneously balancing the trade-off between the performance and the latency.
- [152] arXiv:2209.11920 (replaced) [pdf, html, other]
-
Title: Tradeoffs between convergence rate and noise amplification for momentum-based accelerated optimization algorithmsComments: 23 pages; 7 figuresSubjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY); Dynamical Systems (math.DS)
We study momentum-based first-order optimization algorithms in which the iterations utilize information from the two previous steps and are subject to an additive white noise. This setup uses noise to account for uncertainty in either gradient evaluation or iteration updates, and it includes Polyak's heavy-ball and Nesterov's accelerated methods as special cases. For strongly convex quadratic problems, we use the steady-state variance of the error in the optimization variable to quantify noise amplification and identify fundamental stochastic performance tradeoffs. Our approach utilizes the Jury stability criterion to provide a novel geometric characterization of conditions for linear convergence, and it reveals the relation between the noise amplification and convergence rate as well as their dependence on the condition number and the constant algorithmic parameters. This geometric insight leads to simple alternative proofs of standard convergence results and allows us to establish ``uncertainty principle'' of strongly convex optimization: for the two-step momentum method with linear convergence rate, the lower bound on the product between the settling time and noise amplification scales quadratically with the condition number. Our analysis also identifies a key difference between the gradient and iterate noise models: while the amplification of gradient noise can be made arbitrarily small by sufficiently decelerating the algorithm, the best achievable variance for the iterate noise model increases linearly with the settling time in the decelerating regime. Finally, we introduce two parameterized families of algorithms that strike a balance between noise amplification and settling time while preserving order-wise Pareto optimality for both noise models.
- [153] arXiv:2212.13854 (replaced) [pdf, other]
-
Title: A DRL Approach for RIS-Assisted Full-Duplex UL and DL Transmission: Beamforming, Phase Shift and Power OptimizationSubjects: Information Theory (cs.IT); Signal Processing (eess.SP)
We propose a deep reinforcement learning (DRL) approach for a full-duplex (FD) transmission that predicts the phase shifts of the reconfigurable intelligent surface (RIS), base station (BS) active beamformers, and the transmit powers to maximize the weighted sum rate of uplink and downlink users. Existing methods require channel state information (CSI) and residual self-interference (SI) knowledge to calculate exact active beamformers or the DRL rewards, which typically fail without CSI or residual SI. Especially for time-varying channels, estimating and signaling CSI to the DRL agent is required at each time step and is costly. We propose a two-stage DRL framework with minimal signaling overhead to address this. The first stage uses the least squares method to initiate learning by partially canceling the residual SI. The second stage uses DRL to achieve performance comparable to existing CSI-based methods without requiring the CSI or the exact residual SI. Further, the proposed DRL framework for quantized RIS phase shifts reduces the signaling from BS to the RISs using $32$ times fewer bits than the continuous version. The quantized methods reduce action space, resulting in faster convergence and $7.1\%$ and $22.28\%$ better UL and DL rates, respectively than the continuous method.
- [154] arXiv:2301.07409 (replaced) [pdf, html, other]
-
Title: Representing Noisy Image Without DenoisingComments: Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
A long-standing topic in artificial intelligence is the effective recognition of patterns from noisy images. In this regard, the recent data-driven paradigm considers 1) improving the representation robustness by adding noisy samples in training phase (i.e., data augmentation) or 2) pre-processing the noisy image by learning to solve the inverse problem (i.e., image denoising). However, such methods generally exhibit inefficient process and unstable result, limiting their practical applications. In this paper, we explore a non-learning paradigm that aims to derive robust representation directly from noisy images, without the denoising as pre-processing. Here, the noise-robust representation is designed as Fractional-order Moments in Radon space (FMR), with also beneficial properties of orthogonality and rotation invariance. Unlike earlier integer-order methods, our work is a more generic design taking such classical methods as special cases, and the introduced fractional-order parameter offers time-frequency analysis capability that is not available in classical methods. Formally, both implicit and explicit paths for constructing the FMR are discussed in detail. Extensive simulation experiments and an image security application are provided to demonstrate the uniqueness and usefulness of our FMR, especially for noise robustness, rotation invariance, and time-frequency discriminability.
- [155] arXiv:2305.05738 (replaced) [pdf, html, other]
-
Title: DOCTOR: A Multi-Disease Detection Continual Learning Framework Based on Wearable Medical SensorsComments: 39 pages, 14 figures. This work has been submitted to the ACM for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessibleSubjects: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC); Signal Processing (eess.SP)
Modern advances in machine learning (ML) and wearable medical sensors (WMSs) in edge devices have enabled ML-driven disease detection for smart healthcare. Conventional ML-driven methods for disease detection rely on customizing individual models for each disease and its corresponding WMS data. However, such methods lack adaptability to distribution shifts and new task classification classes. In addition, they need to be rearchitected and retrained from scratch for each new disease. Moreover, installing multiple ML models in an edge device consumes excessive memory, drains the battery faster, and complicates the detection process. To address these challenges, we propose DOCTOR, a multi-disease detection continual learning (CL) framework based on WMSs. It employs a multi-headed deep neural network (DNN) and a replay-style CL algorithm. The CL algorithm enables the framework to continually learn new missions where different data distributions, classification classes, and disease detection tasks are introduced sequentially. It counteracts catastrophic forgetting with a data preservation method and a synthetic data generation (SDG) module. The data preservation method preserves the most informative subset of real training data from previous missions for exemplar replay. The SDG module models the probability distribution of the real training data and generates synthetic data for generative replay while retaining data privacy. The multi-headed DNN enables DOCTOR to detect multiple diseases simultaneously based on user WMS data. We demonstrate DOCTOR's efficacy in maintaining high disease classification accuracy with a single DNN model in various CL experiments. In complex scenarios, DOCTOR achieves 1.43 times better average test accuracy, 1.25 times better F1-score, and 0.41 higher backward transfer than the naive fine-tuning framework with a small model size of less than 350KB.
- [156] arXiv:2305.14736 (replaced) [pdf, html, other]
-
Title: Optimal Control of Logically Constrained Partially Observable and Multi-Agent Markov Decision ProcessesComments: arXiv admin note: substantial text overlap with arXiv:2203.09038Subjects: Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL); Systems and Control (eess.SY)
Autonomous systems often have logical constraints arising, for example, from safety, operational, or regulatory requirements. Such constraints can be expressed using temporal logic specifications. The system state is often partially observable. Moreover, it could encompass a team of multiple agents with a common objective but disparate information structures and constraints. In this paper, we first introduce an optimal control theory for partially observable Markov decision processes (POMDPs) with finite linear temporal logic constraints. We provide a structured methodology for synthesizing policies that maximize a cumulative reward while ensuring that the probability of satisfying a temporal logic constraint is sufficiently high. Our approach comes with guarantees on approximate reward optimality and constraint satisfaction. We then build on this approach to design an optimal control framework for logically constrained multi-agent settings with information asymmetry. We illustrate the effectiveness of our approach by implementing it on several case studies.
- [157] arXiv:2306.09774 (replaced) [pdf, html, other]
-
Title: Vessim: A Testbed for Carbon-Aware Applications and SystemsComments: HotCarbon'24Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Systems and Control (eess.SY)
To reduce the carbon footprint of computing and stabilize electricity grids, there is an increasing focus on approaches that align the power usage of IT infrastructure with the availability of clean energy. Unfortunately, research on energy-aware and carbon-aware applications, as well as the interfaces between computing and energy systems, remains complex due to the scarcity of available testing environments. To this day, almost all new approaches are evaluated on custom simulation testbeds, which leads to repeated development efforts and limited comparability of results.
In this paper, we present Vessim, a co-simulation environment for testing applications and computing systems that interact with their energy systems. Our testbed connects domain-specific simulators for renewable power generation and energy storage, and enables users to implement interfaces to integrate real systems through software and hardware-in-the-loop simulation. Vessim offers an easy-to-use interface, is extendable to new simulators, and provides direct access to historical datasets. We aim to not only accelerate research in carbon-aware computing but also facilitate development and operation, as in continuous testing or digital twins. Vessim is publicly available: this https URL. - [158] arXiv:2307.06090 (replaced) [pdf, html, other]
-
Title: Can Large Language Models Aid in Annotating Speech Emotional Data? Uncovering New FrontiersComments: Accepted in IEEE Computational Intelligence MagazineSubjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Despite recent advancements in speech emotion recognition (SER) models, state-of-the-art deep learning (DL) approaches face the challenge of the limited availability of annotated data. Large language models (LLMs) have revolutionised our understanding of natural language, introducing emergent properties that broaden comprehension in language, speech, and vision. This paper examines the potential of LLMs to annotate abundant speech data, aiming to enhance the state-of-the-art in SER. We evaluate this capability across various settings using publicly available speech emotion classification datasets. Leveraging ChatGPT, we experimentally demonstrate the promising role of LLMs in speech emotion data annotation. Our evaluation encompasses single-shot and few-shots scenarios, revealing performance variability in SER. Notably, we achieve improved results through data augmentation, incorporating ChatGPT-annotated samples into existing datasets. Our work uncovers new frontiers in speech emotion classification, highlighting the increasing significance of LLMs in this field moving forward.
- [159] arXiv:2307.13520 (replaced) [pdf, html, other]
-
Title: Using power system modelling outputs to identify weather-induced extreme events in highly renewable systemsJournal-ref: Environmental Research Letters, Volume 19, Number 5 (2024)Subjects: Physics and Society (physics.soc-ph); Systems and Control (eess.SY)
In highly renewable power systems the increased weather dependence can result in new resilience challenges, such as renewable energy droughts, or a lack of sufficient renewable generation at times of high demand. The weather conditions responsible for these challenges have been well-studied in the literature. However, in reality multi-day resilience challenges are triggered by complex interactions between high demand, low renewable availability, electricity transmission constraints and storage dynamics. We show these challenges cannot be rigorously understood from an exclusively power systems, or meteorological, perspective. We propose a new method that uses electricity shadow prices - obtained by a European power system model based on 40 years of reanalysis data - to identify the most difficult periods driving system investments. Such difficult periods are driven by large-scale weather conditions such as low wind and cold temperature periods of various lengths associated with stationary high pressure over Europe. However, purely meteorological approaches fail to identify which events lead to the largest system stress over the multi-decadal study period due to the influence of subtle transmission bottlenecks and storage issues across multiple regions. These extreme events also do not relate strongly to traditional weather patterns (such as Euro-Atlantic weather regimes or the North Atlantic Oscillation index). We therefore compile a new set of weather patterns to define energy system stress events which include the impacts of electricity storage and large-scale interconnection. Without interdisciplinary studies combining state-of-the-art energy meteorology and modelling, further strive for adequate renewable power systems will be hampered.
- [160] arXiv:2309.04505 (replaced) [pdf, html, other]
-
Title: COVID-19 Detection System: A Comparative Analysis of System Performance Based on Acoustic Features of Cough Audio SignalsComments: 8 pages, 3 figuresJournal-ref: 2023 IEEE 22nd International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), Exeter, United Kingdom, 2023, pp. 2706-2713Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
A wide range of respiratory diseases, such as cold and flu, asthma, and COVID-19, affect people's daily lives worldwide. In medical practice, respiratory sounds are widely used in medical services to diagnose various respiratory illnesses and lung disorders. The traditional diagnosis of such sounds requires specialized knowledge, which can be costly and reliant on human expertise. Despite this, recent advancements, such as cough audio recordings, have emerged as a means to automate the detection of respiratory conditions. Therefore, this research aims to explore various acoustic features that enhance the performance of machine learning (ML) models in detecting COVID-19 from cough signals. It investigates the efficacy of three feature extraction techniques, including Mel Frequency Cepstral Coefficients (MFCC), Chroma, and Spectral Contrast features, when applied to two machine learning algorithms, Support Vector Machine (SVM) and Multilayer Perceptron (MLP), and therefore proposes an efficient CovCepNet detection system. The proposed system provides a practical solution and demonstrates state-of-the-art classification performance, with an AUC of 0.843 on the COUGHVID dataset and 0.953 on the Virufy dataset for COVID-19 detection from cough audio signals.
- [161] arXiv:2310.08753 (replaced) [pdf, html, other]
-
Title: CompA: Addressing the Gap in Compositional Reasoning in Audio-Language ModelsSreyan Ghosh, Ashish Seth, Sonal Kumar, Utkarsh Tyagi, Chandra Kiran Evuru, Ramaneswaran S, S. Sakshi, Oriol Nieto, Ramani Duraiswami, Dinesh ManochaComments: ICLR 2024Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
A fundamental characteristic of audio is its compositional nature. Audio-language models (ALMs) trained using a contrastive approach (e.g., CLAP) that learns a shared representation between audio and language modalities have improved performance in many downstream applications, including zero-shot audio classification, audio retrieval, etc. However, the ability of these models to effectively perform compositional reasoning remains largely unexplored and necessitates additional research. In this paper, we propose CompA, a collection of two expert-annotated benchmarks with a majority of real-world audio samples, to evaluate compositional reasoning in ALMs. Our proposed CompA-order evaluates how well an ALM understands the order or occurrence of acoustic events in audio, and CompA-attribute evaluates attribute-binding of acoustic events. An instance from either benchmark consists of two audio-caption pairs, where both audios have the same acoustic events but with different compositions. An ALM is evaluated on how well it matches the right audio to the right caption. Using this benchmark, we first show that current ALMs perform only marginally better than random chance, thereby struggling with compositional reasoning. Next, we propose CompA-CLAP, where we fine-tune CLAP using a novel learning method to improve its compositional reasoning abilities. To train CompA-CLAP, we first propose improvements to contrastive training with composition-aware hard negatives, allowing for more focused training. Next, we propose a novel modular contrastive loss that helps the model learn fine-grained compositional understanding and overcomes the acute scarcity of openly available compositional audios. CompA-CLAP significantly improves over all our baseline models on the CompA benchmark, indicating its superior compositional reasoning capabilities.
- [162] arXiv:2401.00816 (replaced) [pdf, html, other]
-
Title: GLIMPSE: Generalized Local Imaging with MLPsComments: 12 pages, 10 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Deep learning is the current de facto state of the art in tomographic imaging. A common approach is to feed the result of a simple inversion, for example the backprojection, to a convolutional neural network (CNN) which then computes the reconstruction. Despite strong results on 'in-distribution' test data similar to the training data, backprojection from sparse-view data delocalizes singularities, so these approaches require a large receptive field to perform well. As a consequence, they overfit to certain global structures which leads to poor generalization on out-of-distribution (OOD) samples. Moreover, their memory complexity and training time scale unfavorably with image resolution, making them impractical for application at realistic clinical resolutions, especially in 3D: a standard U-Net requires a substantial 140GB of memory and 2600 seconds per epoch on a research-grade GPU when training on 1024x1024 images. In this paper, we introduce GLIMPSE, a local processing neural network for computed tomography which reconstructs a pixel value by feeding only the measurements associated with the neighborhood of the pixel to a simple MLP. While achieving comparable or better performance with successful CNNs like the U-Net on in-distribution test data, GLIMPSE significantly outperforms them on OOD samples while maintaining a memory footprint almost independent of image resolution; 5GB memory suffices to train on 1024x1024 images. Further, we built GLIMPSE to be fully differentiable, which enables feats such as recovery of accurate projection angles if they are out of calibration.
- [163] arXiv:2401.10747 (replaced) [pdf, other]
-
Title: Multimodal Sentiment Analysis with Missing Modality: A Knowledge-Transfer ApproachComments: I request to withdraw my paper due to the discovery of significant errors in the manuscript that affect the validity of the experimental results. These errors necessitate a substantial revision, and the current version should not be used or cited in its present formSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Multimodal sentiment analysis aims to identify the emotions expressed by individuals through visual, language, and acoustic cues. However, most of the existing research efforts assume that all modalities are available during both training and testing, making their algorithms susceptible to the missing modality scenario. In this paper, we propose a novel knowledge-transfer network to translate between different modalities to reconstruct the missing audio modalities. Moreover, we develop a cross-modality attention mechanism to retain the maximal information of the reconstructed and observed modalities for sentiment prediction. Extensive experiments on three publicly available datasets demonstrate significant improvements over baselines and achieve comparable results to the previous methods with complete multi-modality supervision.
- [164] arXiv:2402.04216 (replaced) [pdf, other]
-
Title: Resource-Aware Hierarchical Federated Learning in Wireless Video Caching NetworksComments: Under review for possible publication in IEEE Transactions on Wireless CommunicationsSubjects: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG); Systems and Control (eess.SY)
Backhaul traffic congestion caused by the video traffic of a few popular files can be alleviated by storing the to-be-requested content at various levels in wireless video caching networks. Typically, content service providers (CSPs) own the content, and the users request their preferred content from the CSPs using their (wireless) internet service providers (ISPs). As these parties do not reveal their private information and business secrets, traditional techniques may not be readily used to predict the dynamic changes in users' future demands. Motivated by this, we propose a novel resource-aware hierarchical federated learning (RawHFL) solution for predicting user's future content requests. A practical data acquisition technique is used that allows the user to update its local training dataset based on its requested content. Besides, since networking and other computational resources are limited, considering that only a subset of the users participate in the model training, we derive the convergence bound of the proposed algorithm. Based on this bound, we minimize a weighted utility function for jointly configuring the controllable parameters to train the RawHFL energy efficiently under practical resource constraints. Our extensive simulation results validate the proposed algorithm's superiority, in terms of test accuracy and energy cost, over existing baselines.
- [165] arXiv:2402.14400 (replaced) [pdf, html, other]
-
Title: Modeling 3D Infant Kinetics Using Adaptive Graph Convolutional NetworksDaniel Holmberg, Manu Airaksinen, Viviana Marchi, Andrea Guzzetta, Anna Kivi, Leena Haataja, Sampsa Vanhatalo, Teemu RoosComments: 10 pages, 3 figures. Code repository available via this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Reliable methods for the neurodevelopmental assessment of infants are essential for early detection of medical issues that may need prompt interventions. Spontaneous motor activity, or 'kinetics', is shown to provide a powerful surrogate measure of upcoming neurodevelopment. However, its assessment is by and large qualitative and subjective, focusing on visually identified, age-specific gestures. Here, we follow an alternative approach, predicting infants' neurodevelopmental maturation based on data-driven evaluation of individual motor patterns. We utilize 3D video recordings of infants processed with pose-estimation to extract spatio-temporal series of anatomical landmarks, and apply adaptive graph convolutional networks to predict the actual age. We show that our data-driven approach achieves improvement over traditional machine learning baselines based on manually engineered features.
- [166] arXiv:2402.19325 (replaced) [pdf, html, other]
-
Title: Do End-to-End Neural Diarization Attractors Need to Encode Speaker Characteristic Information?Comments: Accepted to Odyssey 2024. This arXiv version includes an appendix for more visualizations. Code: this https URLSubjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
In this paper, we apply the variational information bottleneck approach to end-to-end neural diarization with encoder-decoder attractors (EEND-EDA). This allows us to investigate what information is essential for the model. EEND-EDA utilizes attractors, vector representations of speakers in a conversation. Our analysis shows that, attractors do not necessarily have to contain speaker characteristic information. On the other hand, giving the attractors more freedom to allow them to encode some extra (possibly speaker-specific) information leads to small but consistent diarization performance improvements. Despite architectural differences in EEND systems, the notion of attractors and frame embeddings is common to most of them and not specific to EEND-EDA. We believe that the main conclusions of this work can apply to other variants of EEND. Thus, we hope this paper will be a valuable contribution to guide the community to make more informed decisions when designing new systems.
- [167] arXiv:2403.07675 (replaced) [pdf, html, other]
-
Title: Multichannel Long-Term Streaming Neural Speech Enhancement for Static and Moving SpeakersComments: Accepted by IEEE Signal Processing LettersSubjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
In this work, we extend our previously proposed offline SpatialNet for long-term streaming multichannel speech enhancement in both static and moving speaker scenarios. SpatialNet exploits spatial information, such as the spatial/steering direction of speech, for discriminating between target speech and interferences, and achieved outstanding performance. The core of SpatialNet is a narrow-band self-attention module used for learning the temporal dynamic of spatial vectors. Towards long-term streaming speech enhancement, we propose to replace the offline self-attention network with online networks that have linear inference complexity w.r.t signal length and meanwhile maintain the capability of learning long-term information. Three variants are developed based on (i) masked self-attention, (ii) Retention, a self-attention variant with linear inference complexity, and (iii) Mamba, a structured-state-space-based RNN-like network. Moreover, we investigate the length extrapolation ability of different networks, namely test on signals that are much longer than training signals, and propose a short-signal training plus long-signal fine-tuning strategy, which largely improves the length extrapolation ability of the networks within limited training time. Overall, the proposed online SpatialNet achieves outstanding speech enhancement performance for long audio streams, and for both static and moving speakers. The proposed method is open-sourced in this https URL.
- [168] arXiv:2403.13086 (replaced) [pdf, html, other]
-
Title: Listenable Maps for Audio ClassifiersComments: Accepted to ICML 2024 (Oral)Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
Despite the impressive performance of deep learning models across diverse tasks, their complexity poses challenges for interpretation. This challenge is particularly evident for audio signals, where conveying interpretations becomes inherently difficult. To address this issue, we introduce Listenable Maps for Audio Classifiers (L-MAC), a posthoc interpretation method that generates faithful and listenable interpretations. L-MAC utilizes a decoder on top of a pretrained classifier to generate binary masks that highlight relevant portions of the input audio. We train the decoder with a loss function that maximizes the confidence of the classifier decision on the masked-in portion of the audio while minimizing the probability of model output for the masked-out portion. Quantitative evaluations on both in-domain and out-of-domain data demonstrate that L-MAC consistently produces more faithful interpretations than several gradient and masking-based methodologies. Furthermore, a user study confirms that, on average, users prefer the interpretations generated by the proposed technique.
- [169] arXiv:2405.00495 (replaced) [pdf, html, other]
-
Title: The Loewner framework for parametric systems: Taming the curse of dimensionalityComments: 32 pages, 4 figuresSubjects: Numerical Analysis (math.NA); Systems and Control (eess.SY)
The Loewner framework is an interpolatory approach designed for approximating linear and nonlinear systems. The goal here is to extend this framework to linear parametric systems with an arbitrary number n of parameters. One main innovation established here is the construction of data-based realizations for any number of parameters. Equally importantly, we show how to alleviate the computational burden, by avoiding the explicit construction of large-scale n-dimensional Loewner matrices of size $N \times N$. This reduces the complexity from $O(N^3)$ to about $O(N^{1.4})$, thus taming the curse of dimensionality and making the solution scalable to very large data sets. To achieve this, a new generalized multivariate rational function realization is defined. Then, we introduce the n-dimensional multivariate Loewner matrices and show that they can be computed by solving a coupled set of Sylvester equations. The null space of these Loewner matrices then allows the construction of the multivariate barycentric transfer function. The principal result of this work is to show how the null space of the n-dimensional Loewner matrix can be computed using a sequence of 1-dimensional Loewner matrices, leading to a drastic computational burden reduction. Finally, we suggest two algorithms (one direct and one iterative) to construct, directly from data, multivariate (or parametric) realizations ensuring (approximate) interpolation. Numerical examples highlight the effectiveness and scalability of the method.
- [170] arXiv:2405.05669 (replaced) [pdf, html, other]
-
Title: Passive Obstacle Aware Control to Follow Desired VelocitiesSubjects: Robotics (cs.RO); Systems and Control (eess.SY)
Evaluating and updating the obstacle avoidance velocity for an autonomous robot in real-time ensures robustness against noise and disturbances. A passive damping controller can obtain the desired motion with a torque-controlled robot, which remains compliant and ensures a safe response to external perturbations. Here, we propose a novel approach for designing the passive control policy. Our algorithm complies with obstacle-free zones while transitioning to increased damping near obstacles to ensure collision avoidance. This approach ensures stability across diverse scenarios, effectively mitigating disturbances. Validation on a 7DoF robot arm demonstrates superior collision rejection capabilities compared to the baseline, underlining its practicality for real-world applications. Our obstacle-aware damping controller represents a substantial advancement in secure robot control within complex and uncertain environments.
- [171] arXiv:2405.16090 (replaced) [pdf, html, other]
-
Title: EEG-DBNet: A Dual-Branch Network for Temporal-Spectral Decoding in Motor-Imagery Brain-Computer InterfacesSubjects: Human-Computer Interaction (cs.HC); Signal Processing (eess.SP)
Motor imagery electroencephalogram (EEG)-based brain-computer interfaces (BCIs) offer significant advantages for individuals with restricted limb mobility. However, challenges such as low signal-to-noise ratio and limited spatial resolution impede accurate feature extraction from EEG signals, thereby affecting the classification accuracy of different actions. To address these challenges, this study proposes an end-to-end dual-branch network (EEG-DBNet) that decodes the temporal and spectral sequences of EEG signals in parallel through two distinct network branches. Each branch comprises a local convolutional block and a global convolutional block. The local convolutional block transforms the source signal from the temporal-spatial domain to the temporal-spectral domain. By varying the number of filters and convolution kernel sizes, the local convolutional blocks in different branches adjust the length of their respective dimension sequences. Different types of pooling layers are then employed to emphasize the features of various dimension sequences, setting the stage for subsequent global feature extraction. The global convolution block splits and reconstructs the feature of the signal sequence processed by the local convolution block in the same branch and further extracts features through the dilated causal convolutional neural networks. Finally, the outputs from the two branches are concatenated, and signal classification is completed via a fully connected layer. Our proposed method achieves classification accuracies of 85.84% and 91.60% on the BCI Competition 4-2a and BCI Competition 4-2b datasets, respectively, surpassing existing state-of-the-art models. The source code is available at this https URL.
- [172] arXiv:2406.04589 (replaced) [pdf, html, other]
-
Title: MUSE: Flexible Voiceprint Receptive Fields and Multi-Path Fusion Enhanced Taylor Transformer for U-Net-based Speech EnhancementComments: This paper was accepted by Interspeech 2024Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Achieving a balance between lightweight design and high performance remains a challenging task for speech enhancement. In this paper, we introduce Multi-path Enhanced Taylor (MET) Transformer based U-net for Speech Enhancement (MUSE), a lightweight speech enhancement network built upon the Unet architecture. Our approach incorporates a novel Multi-path Enhanced Taylor (MET) Transformer block, which integrates Deformable Embedding (DE) to enable flexible receptive fields for voiceprints. The MET Transformer is uniquely designed to fuse Channel and Spatial Attention (CSA) branches, facilitating channel information exchange and addressing spatial attention deficits within the Taylor-Transformer framework. Through extensive experiments conducted on the VoiceBank+DEMAND dataset, we demonstrate that MUSE achieves competitive performance while significantly reducing both training and deployment costs, boasting a mere 0.51M parameters.
- [173] arXiv:2406.08416 (replaced) [pdf, html, other]
-
Title: TokSing: Singing Voice Synthesis based on Discrete TokensComments: Accepted by Interspeech 2024Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Recent advancements in speech synthesis witness significant benefits by leveraging discrete tokens extracted from self-supervised learning (SSL) models. Discrete tokens offer higher storage efficiency and greater operability in intermediate representations compared to traditional continuous Mel spectrograms. However, when it comes to singing voice synthesis(SVS), achieving higher levels of melody expression poses a great challenge for utilizing discrete tokens. In this paper, we introduce TokSing, a discrete-based SVS system equipped with a token formulator that offers flexible token blendings. We observe a melody degradation during discretization, prompting us to integrate a melody signal with the discrete token and incorporate a specially-designed melody enhancement strategy in the musical encoder. Extensive experiments demonstrate that our TokSing achieves better performance against the Mel spectrogram baselines while offering advantages in intermediate representation space cost and convergence speed.
- [174] arXiv:2406.08905 (replaced) [pdf, html, other]
-
Title: SingOMD: Singing Oriented Multi-resolution Discrete Representation Construction from Speech ModelsComments: Accepted by Interspeech 2024Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Discrete representation has shown advantages in speech generation tasks, wherein discrete tokens are derived by discretizing hidden features from self-supervised learning (SSL) pre-trained models. However, the direct application of speech SSL models to singing generation encounters domain gaps between speech and singing. Furthermore, singing generation necessitates a more refined representation than typical speech. To address these challenges, we introduce SingOMD, a novel method to extract singing-oriented multi-resolution discrete representations from speech SSL models. Specifically, we first adapt the features from speech SSL through a resynthesis task and incorporate multi-resolution modules based on resampling to better serve singing generation. These adapted multi-resolution features are then discretized via clustering. Extensive experiments demonstrate the robustness, efficiency, and effectiveness of these representations in singing vocoders and singing voice synthesis.
- [175] arXiv:2406.08931 (replaced) [pdf, html, other]
-
Title: Exploring Multilingual Unseen Speaker Emotion Recognition: Leveraging Co-Attention Cues in Multitask LearningComments: 5 pages, Accepted to INTERSPEECH 2024. The first two authors contributed equallySubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Advent of modern deep learning techniques has given rise to advancements in the field of Speech Emotion Recognition (SER). However, most systems prevalent in the field fail to generalize to speakers not seen during training. This study focuses on handling challenges of multilingual SER, specifically on unseen speakers. We introduce CAMuLeNet, a novel architecture leveraging co-attention based fusion and multitask learning to address this problem. Additionally, we benchmark pretrained encoders of Whisper, HuBERT, Wav2Vec2.0, and WavLM using 10-fold leave-speaker-out cross-validation on five existing multilingual benchmark datasets: IEMOCAP, RAVDESS, CREMA-D, EmoDB and CaFE and, release a novel dataset for SER on the Hindi language (BhavVani). CAMuLeNet shows an average improvement of approximately 8% over all benchmarks on unseen speakers determined by our cross-validation strategy.
- [176] arXiv:2406.09272 (replaced) [pdf, html, other]
-
Title: Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric VideosComments: Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Generating realistic audio for human interactions is important for many applications, such as creating sound effects for films or virtual reality games. Existing approaches implicitly assume total correspondence between the video and audio during training, yet many sounds happen off-screen and have weak to no correspondence with the visuals -- resulting in uncontrolled ambient sounds or hallucinations at test time. We propose a novel ambient-aware audio generation model, AV-LDM. We devise a novel audio-conditioning mechanism to learn to disentangle foreground action sounds from the ambient background sounds in in-the-wild training videos. Given a novel silent video, our model uses retrieval-augmented generation to create audio that matches the visual content both semantically and temporally. We train and evaluate our model on two in-the-wild egocentric video datasets Ego4D and EPIC-KITCHENS. Our model outperforms an array of existing methods, allows controllable generation of the ambient sound, and even shows promise for generalizing to computer graphics game clips. Overall, our work is the first to focus video-to-audio generation faithfully on the observed visual content despite training from uncurated clips with natural background sounds.
- [177] arXiv:2406.10911 (replaced) [pdf, html, other]
-
Title: SingMOS: An extensive Open-Source Singing Voice Dataset for MOS PredictionSubjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
In speech generation tasks, human subjective ratings, usually referred to as the opinion score, are considered the "gold standard" for speech quality evaluation, with the mean opinion score (MOS) serving as the primary evaluation metric. Due to the high cost of human annotation, several MOS prediction systems have emerged in the speech domain, demonstrating good performance. These MOS prediction models are trained using annotations from previous speech-related challenges. However, compared to the speech domain, the singing domain faces data scarcity and stricter copyright protections, leading to a lack of high-quality MOS-annotated datasets for singing. To address this, we propose SingMOS, a high-quality and diverse MOS dataset for singing, covering a range of Chinese and Japanese datasets. These synthesized vocals are generated using state-of-the-art models in singing synthesis, conversion, or resynthesis tasks and are rated by professional annotators alongside real vocals. Data analysis demonstrates the diversity and reliability of our dataset. Additionally, we conduct further exploration on SingMOS, providing insights for singing MOS prediction and guidance for the continued expansion of SingMOS.
- [178] arXiv:2406.11723 (replaced) [pdf, html, other]
-
Title: Control of Unknown Quadrotors from a Single ThrowComments: 7 pages, 5 figures, 2 tables. Submitted to the IROS 2024 conferenceSubjects: Robotics (cs.RO); Systems and Control (eess.SY)
This paper presents a method to recover quadrotor UAV from a throw, when no control parameters are known before the throw. We leverage the availability of high-frequency rotor speed feedback available in racing drone hardware and software to find control effectiveness values and fit a motor model using recursive least squares (RLS) estimation. Furthermore, we propose an excitation sequence that provides large actuation commands while guaranteeing to stay within gyroscope sensing limits. After 450ms of excitation, an INDI attitude controller uses the 52 fitted parameters to arrest rotational motion and recover an upright attitude. Finally, a NDI position controller drives the craft to a position setpoint. The proposed algorithm runs efficiently on microcontrollers found in common UAV flight controllers, and was shown to recover an agile quadrotor every time in 57 live experiments with as low as 3.5m throw height, demonstrating robustness against initial rotations and noise. We also demonstrate control of randomized quadrotors in simulated throws, where the parameter fitting RMS error is typically within 10% of the true value.